Notation
This section provides a concise reference describing the notation used throughout
this book. If you are unfamiliar with any of the corresponding mathematical
concepts, this notation reference may seem intimidating. However, do not despair,
we describe most of these ideas in chapters 2-4.
Numbers and Arrays
a A scalar (integer or real)
a A vector
A A matrix
A A tensor
I
n
Identity matrix with n rows and n columns
I
Identity matrix with dimensionality implied by
context
e
(i)
Standard basis vector [0
, . . . ,
0
,
1
,
0
, . . . ,
0] with a
1 at position i
diag(a)
A square, diagonal matrix with diagonal entries
given by a
a A scalar random variable
a A vector-valued random variable
A A matrix-valued random variable
xi
CONTENTS
Sets and Graphs
A A set
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n
[a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B
Set subtraction, i.e., the set containing the ele-
ments of A that are not in B
G A graph
P a
G
(x
i
) The parents of x
i
in G
Indexing
a
i
Element
i
of vector
a
, with indexing starting at 1
a
i
All elements of vector a except for element i
A
i,j
Element i, j of matrix A
A
i,:
Row i of matrix A
A
:,i
Column i of matrix A
A
i,j,k
Element (i, j, k) of a 3-D tensor A
A
:,:,i
2-D slice of a 3-D tensor
a
i
Element i of the random vector a
Linear Algebra Operations
A
>
Transpose of matrix A
A
+
Moore-Penrose pseudoinverse of A
A B Element-wise (Hadamard) product of A and B
det(A) Determinant of A
xii
CONTENTS
Calculus
dy
dx
Derivative of y with respect to x.
y
x
Partial derivative of y with respect to x
x
y Gradient of y with respect to x
X
y Matrix derivatives of y with respect to X
X
y
Tensor containing derivatives of
y
with respect to
X
f
x
Jacobian matrix J R
m×n
of f : R
n
R
m
2
x
f(x) or H(f)(x) The Hessian matrix of f at input point x
Z
f(x)dx Definite integral over the entire domain of x
Z
S
f(x)dx Definite integral with respect to x over the set S
Probability and Information Theory
ab The random variables a and b are independent
ab | c They are are conditionally independent given c
P (a)
A probability distribution over a discrete variable
p(a)
A probability distribution over a continuous vari-
able, or over a variable whose type has not been
specified
a P Random variable a has distribution P
E
xP
[f(x)] or Ef(x) Expectation of f(x) with respect to P (x)
Var(f(x)) Variance of f(x) under P (x)
Cov(f(x), g(x)) Covariance of f(x) and g(x) under P (x)
H(x) Shannon entropy of the random variable x
D
KL
(P kQ) Kullback-Leibler divergence of P and Q
N(x; µ, Σ)
Gaussian distribution over
x
with mean
µ
and
covariance Σ
xiii
CONTENTS
Functions
f : A B The function f with domain A and range B
f g Composition of the functions f and g
f(x; θ)
A function of
x
parametrized by
θ
. Sometimes
we just write
f
(
x
) and ignore the argument
θ
to
lighten notation.
log x Natural logarithm of x
σ(x) Logistic sigmoid,
1
1 + exp(x)
ζ(x) Softplus, log(1 + exp(x))
||x||
p
L
p
norm of x
||x|| L
2
norm of x
x
+
Positive part of x, i.e., max(0, x)
1
condition
is 1 if the condition is true, 0 otherwise
Sometimes we use a function
f
whose argument is a scalar, but apply it to a vector,
matrix, or tensor:
f
(
x
),
f
(
X
), or
f
(
X
). This means to apply
f
to the array
element-wise. For example, if
C
=
σ
(
X
), then C
i,j,k
=
σ
(X
i,j,k
) for all valid values
of i, j and k.
Datasets and distributions
p
data
The data generating distribution
ˆp
data
The empirical distribution defined by the training
set
X A set of training examples
x
(i)
The i-th example (input) from a dataset
y
(i)
or y
(i)
The target associated with
x
(i)
for supervised learn-
ing
X
The
m × n
matrix with input example
x
(i)
in row
X
i,:
xiv