Chapter 20
Deep Generative Models
In this chapter, we present several of the specific kinds of generative models that
can be built and trained using the techniques presented in Chapters 16, 17, 18 and
19. All of these models represent probability distributions over multiple variables
in some way. Some allow the probability distribution function to be evaluated
explicitly. Others do not allow the evaluation of the probability distribution
function, but support operations that implicitly require knowledge of it, such
as drawing samples from the distribution. Some of these models are structured
probabilistic models described in terms of graphs and factors, using the language
of graphical models presented in Chapter 16. Others can not easily be described
in terms of factors, but represent probability distributions nonetheless.
20.1 Boltzmann Machines
Boltzmann machines were originally introduced as a general “connectionist” ap-
proach to learning arbitrary probability distributions over binary vectors (Fahlman
et al., 1983; Ackley et al., 1985; Hinton et al., 1984; Hinton and Sejnowski, 1986).
Variants of the Boltzmann machine that include other kinds of variables have long
ago surpassed the popularity of the original. In this section we briefly introduce
the binary Boltzmann machine and discuss the issues that come up when trying to
train and perform inference in the model.
We define the Boltzmann machine over a
d
-dimensional binary random vector
x {
0
,
1
}
d
. The Boltzmann machine is an energy-based model (Sec. 16.2.4),
656
CHAPTER 20. DEEP GENERATIVE MODELS
meaning we define the joint probability distribution using an energy function:
P (x) =
exp (E(x))
Z
, (20.1)
where
E
(
x
) is the energy function and
Z
is the partition function that ensures
that
P
x
P (x) = 1. The energy function of the Boltzmann machine is given by
E(x) = x
>
Ux b
>
x, (20.2)
where
U
is the “weight” matrix of model parameters and
b
is the vector of bias
parameters.
In the general setting of the Boltzmann machine, we are given a set of training
examples, each of which are
n
-dimensional. Eq. 20.1 describes the joint probability
distribution over the observed variables. While this scenario is certainly viable,
it does limit the kinds of interactions between the observed variables to those
described by the weight matrix. Specifically, it means that the probability of one
unit being on is given by a linear model (logistic regression) from the values of the
other units.
The Boltzmann machine becomes more powerful when not all the variables
are observed. In this case, the non-observed variables, or latent variables, can
act similarly to hidden units in a multi-layer perceptron and model higher-order
interactions among the visible units. Just as the addition of hidden units to
convert logistic regression into an MLP results in the MLP being a universal
approximator of functions, a Boltzmann machine with hidden units is no longer
limited to modeling linear relationships between variables. Instead, the Boltzmann
machine becomes a universal approximator of probability mass functions over
discrete variables (Le Roux and Bengio, 2008).
Formally, we decompose the units
x
into two subsets: the visible units
v
and
the latent (or hidden) units h. The energy function becomes
E(v, h) = v
>
Rv v
>
W h h
>
Sh b
>
v c
>
h. (20.3)
Boltzmann Machine Learning
Learning algorithms for Boltzmann machines
are usually based on maximum likelihood. All Boltzmann machines have an
intractable partition function, so the maximum likelihood gradient must be ap-
proximated using the techniques described in Chapter 18.
One interesting property of Boltzmann machines when trained with learning
rules based on maximum likelihood is that the update for a particular weight
connecting two units depends only the statistics of those two units, collected
657
CHAPTER 20. DEEP GENERATIVE MODELS
under different distributions:
P
model
(
v
) and
ˆ
P
data
(
v
)
P
model
(
h | v
). The rest of the
network participates in shaping those statistics, but the weight can be updated
without knowing anything about the rest of the network or how those statistics were
produced. This means that the learning rule is “local,” which makes Boltzmann
machine learning somewhat biologically plausible. It is conceivable that if each
neuron were a random variable in a Boltzmann machine, then the axons and
dendrites connecting two random variables could learn only by observing the firing
pattern of the cells that they actually physically touch. In particular, in the
positive phase, two units that frequently activate together have their connection
strengthened. This is an example of a Hebbian learning rule (Hebb, 1949) often
summarized with the mnemonic “fire together, wire together.” Hebbian learning
rules are among the oldest hypothesized explanations for learning in biological
systems and remain relevant today (Giudice et al., 2009).
Other learning algorithms that use more information than local statistics seem
to require us to hypothesize the existence of more machinery than this. For
example, for the brain to implement back-propagation in a multilayer perceptron,
it seems necessary for the brain to maintain a secondary communication network for
transmitting gradient information backwards through the network. Proposals for
biologically plausible implementations (and approximations) of back-propagation
have been made (Hinton, 2007a; Bengio, 2015) but remain to be validated, and
Bengio (2015) links back-propagation of gradients to inference in energy-based
models similar to the Boltzmann machine (but with continuous latent variables).
The negative phase of Boltzmann machine learning is somewhat harder to
explain from a biological point of view. As argued in Sec. 18.2, dream sleep may
be a form of negative phase sampling. This idea is more speculative though.
20.2 Restricted Boltzmann Machines
Invented under the name harmonium (Smolensky, 1986), restricted Boltzmann
machines are some of the most common building blocks of deep probabilistic models.
We have briefly described RBMs previously, in Sec. 16.7.1. Here we review the
previous information and go into more detail. RBMs are undirected probabilistic
graphical models containing a layer of observable variables and a single layer of
latent variables. RBMs may be stacked (one on top of the other) to form deeper
models. See Fig. 20.1 for some examples. In particular, Fig. 20.1a shows the graph
structure of the RBM itself. It is a bipartite graph, with no connections permitted
between any variables in the observed layer or between any units in the latent
layer.
658
CHAPTER 20. DEEP GENERATIVE MODELS
h
1
h
1
h
2
h
2
h
3
h
3
v
1
v
1
v
2
v
2
v
3
v
3
h
4
h
4
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
a b
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
c
Figure 20.1: Examples of models that may be built with restricted Boltzmann machines.
(a) The restricted Boltzmann machine itself is an undirected graphical model based on
a bipartite graph, with visible units in one part of the graph and hidden units in the
other part. There are no connections among the visible units, nor any connections among
the hidden units. Typically every visible unit is connected to every hidden unit but it
is possible to construct sparsely connected RBMs such as convolutional RBMs. (b) A
deep belief network is a hybrid graphical model involving both directed and undirected
connections. Like an RBM, it has no intra-layer connections. However, a DBN has
multiple hidden layers, and thus there are connections between hidden units that are in
separate layers. All of the local conditional probability distributions needed by the deep
belief network are copied directly from the local conditional probability distributions of
its constituent RBMs. Alternatively, we could also represent the deep belief network with
a completely undirected graph, but it would need intra-layer connections to capture the
dependencies between parents. (c) A deep Boltzmann machine is an undirected graphical
model with several layers of latent variables. Like RBMs and DBNs, DBMs lack intra-layer
connections. DBMs are less closely tied to RBMs than DBNs are. When initializing a
DBM from a stack of RBMs, it is necessary to modify the RBM parameters slightly. Some
kinds of DBMs may be trained without first training a set of RBMs.
659
CHAPTER 20. DEEP GENERATIVE MODELS
We begin with the binary version of the restricted Boltzmann machine, but as
we see later there are extensions to other types of visible and hidden units.
More formally, let the observed layer consist of a set of
n
v
binary random
variables which we refer to collectively with the vector
v
. We refer to the latent or
hidden layer of n
h
binary random variables as h.
Like the general Boltzmann machine, the restricted Boltzmann machine is an
energy-based model with the joint probability distribution specified by its energy
function:
P (v = v, h = h) =
1
Z
exp (E(v, h)) . (20.4)
The energy function for an RBM is given by
E(v, h) = b
>
v c
>
h v
>
W h, (20.5)
and Z is the normalizing constant known as the partition function:
Z =
X
v
X
h
exp {−E(v, h)}. (20.6)
It is apparent from the definition of the partition function
Z
that the naive method
of computing
Z
(exhaustively summing over all states) could be computationally
intractable, unless a cleverly designed algorithm could exploit regularities in the
probability distribution to compute
Z
faster. In the case of restricted Boltzmann
machines, Long and Servedio (2010) formally proved that the partition function
Z
is intractable. The intractable partition function
Z
implies that the normalized
joint probability distribution P (v) is also intractable to evaluate.
20.2.1 Conditional Distributions
Though
P
(
v
) is intractable, the bipartite graph structure of the RBM has the
very special property that its conditional distributions
P
(
h | v
) and
P
(
v | h
) are
factorial and relatively simple to compute and to sample from.
Deriving the conditional distributions from the joint distribution is straightfor-
ward:
P (h | v) =
P (h, v)
P (v)
(20.7)
=
1
P (v)
1
Z
exp
n
b
>
v + c
>
h + v
>
W h
o
(20.8)
=
1
Z
0
exp
n
c
>
h + v
>
W h
o
(20.9)
660
CHAPTER 20. DEEP GENERATIVE MODELS
=
1
Z
0
exp
n
h
X
j=1
c
j
h
j
+
n
h
X
j=1
v
>
W
:,j
h
j
(20.10)
=
1
Z
0
n
h
Y
j=1
exp
n
c
j
h
j
+ v
>
W
:,j
h
j
o
(20.11)
Since we are conditioning on the visible units
v
, we can treat these as constant
with respect to the distribution
P
(
h | v
). The factorial nature of the conditional
P
(
h | v
) follows immediately from our ability to write the joint probability over
the vector
h
as the product of (unnormalized) distributions over the individual
elements,
h
j
. It is now a simple matter of normalizing the distributions over the
individual binary h
j
.
P (h
j
= 1 | v) =
˜
P (h
j
= 1 | v)
˜
P (h
j
= 0 | v) +
˜
P (h
j
= 1 | v)
(20.12)
=
exp
c
j
+ v
>
W
:,j
exp {0} + exp {c
j
+ v
>
W
:,j
}
(20.13)
= σ
c
j
+ v
>
W
:,j
. (20.14)
We can now express the full conditional over the hidden layer as the factorial
distribution:
P (h | v) =
n
h
Y
j=1
σ
(2h 1) (c + W
>
v)
j
. (20.15)
A similar derivation will show that the other condition of interest to us,
P
(
v | h
),
is also a factorial distribution:
P (v | h) =
n
v
Y
i=1
σ ((2v 1) (b + W h))
i
. (20.16)
20.2.2 Training Restricted Boltzmann Machines
Because the RBM admits efficient evaluation and differentiation of
˜
P
(
v
) and
efficient MCMC sampling in the form of block Gibbs sampling, it can readily be
trained with any of the techniques described in Chapter 18 for training models
that have intractable partition functions. This includes CD, SML (PCD), ratio
matching and so on. Compared to other undirected models used in deep learning,
the RBM is relatively straightforward to train because we can compute
P
(
h | v
)
661
CHAPTER 20. DEEP GENERATIVE MODELS
exactly in closed form. Some other deep models, such as the deep Boltzmann
machine, combine both the difficulty of an intractable partition function and the
difficulty of intractable inference.
20.3 Deep Belief Networks
Deep belief networks (DBNs) were one of the first non-convolutional models to
successfully admit training of deep architectures (Hinton et al., 2006; Hinton,
2007b). The introduction of deep belief networks in 2006 began the current deep
learning renaissance. Prior to the introduction of deep belief networks, deep models
were considered too difficult to optimize. Kernel machines with convex objective
functions dominated the research landscape. Deep belief networks demonstrated
that deep architectures can be successful, by outperforming kernelized support
vector machines on the MNIST dataset (Hinton et al., 2006). Today, deep belief
networks have mostly fallen out of favor and are rarely used, even compared to
other unsupervised or generative learning algorithms, but they are still deservedly
recognized for their important role in deep learning history.
Deep belief networks are generative models with several layers of latent variables.
The latent variables are typically binary, while the visible units may be binary
or real. There are no intra-layer connections. Usually, every unit in each layer is
connected to every unit in each neighboring layer, though it is possible to construct
more sparsely connected DBNs. The connections between the top two layers are
undirected. The connections between all other layers are directed, with the arrows
pointed toward the layer that is closest to the data. See Fig. 20.1b for an example.
A DBN with
l
hidden layers contains
l
weight matrices:
W
(1)
, . . . , W
(l)
. It
also contains
l
+ 1 bias vectors:
b
(0)
, . . . , b
(l)
, with
b
(0)
providing the biases for the
visible layer. The probability distribution represented by the DBN is given by
P (h
(l)
, h
(l1)
) exp
b
(l)
>
h
(l)
+ b
(l1)>
h
(l1)
+ h
(l1)>
W
(l)
h
(l)
, (20.17)
P (h
(k)
i
= 1 | h
(k+1)
) = σ
b
(k)
i
+ W
(k+1)>
:,i
h
(k+1)
i, k 1, . . . , l 2, (20.18)
P (v
i
= 1 | h
(1)
) = σ
b
(0)
i
+ W
(1)>
:,i
h
(1)
i. (20.19)
In the case of real-valued visible units, substitute
v N
v; b
(0)
+ W
(1)>
h
(1)
, β
1
(20.20)
662
CHAPTER 20. DEEP GENERATIVE MODELS
with
β
diagonal for tractability. Generalizations to other exponential family visible
units are straightforward, at least in theory. A DBN with only one hidden layer is
just an RBM.
To generate a sample from a DBN, we first run several steps of Gibbs sampling
on the top two hidden layers. This stage is essentially drawing a sample from
the RBM defined by the top two hidden layers. We can then use a single pass of
ancestral sampling through the rest of the model to draw a sample from the visible
units.
Deep belief networks incur many of the problems associated with both directed
models and undirected models.
Inference in a deep belief network is intractable due to the explaining away
effect within each directed layer, and due to the interaction between the two hidden
layers that have undirected connections. Evaluating or maximizing the standard
evidence lower bound on the log-likelihood is also intractable, because the evidence
lower bound takes the expectation of cliques whose size is equal to the network
width.
Evaluating or maximizing the log-likelihood requires not just confronting the
problem of intractable inference to marginalize out the latent variables, but also
the problem of an intractable partition function within the undirected model of
the top two layers.
To train a deep belief network, one begins by training an RBM to maximize
E
vp
data
log p
(
v
) using contrastive divergence or stochastic maximum likelihood.
The parameters of the RBM then define the parameters of the first layer of the
DBN. Next, a second RBM is trained to approximately maximize
E
vp
data
E
h
(1)
p
(1)
(h
(1)
|v)
log p
(2)
(h
(1)
) (20.21)
where
p
(1)
is the probability distribution represented by the first RBM and
p
(2)
is the probability distribution represented by the second RBM. In other words,
the second RBM is trained to model the distribution defined by sampling the
hidden units of the first RBM, when the first RBM is driven by the data. This
procedure can be repeated indefinitely, to add as many layers to the DBN as
desired, with each new RBM modeling the samples of the previous one. Each RBM
defines another layer of the DBN. This procedure can be justified as increasing a
variational lower bound on the log-likelihood of the data under the DBN (Hinton
et al., 2006).
In most applications, no effort is made to jointly train the DBN after the greedy
layer-wise procedure is complete. However, it is possible to perform generative
fine-tuning using the wake-sleep algorithm.
663
CHAPTER 20. DEEP GENERATIVE MODELS
The trained DBN may be used directly as a generative model, but most of the
interest in DBNs arose from their ability to improve classification models. We can
take the weights from the DBN and use them to define an MLP:
h
(1)
= σ
b
(1)
+ v
>
W
(1)
. (20.22)
h
(l)
= σ
b
(l)
i
+ h
(l1)>
W
(l)
l 2, . . . , m, (20.23)
After initializing this MLP with the weights and biases learned via generative
training of the DBN, we may train the MLP to perform a classification task. This
additional training of the MLP is an example of discriminative fine-tuning.
This specific choice of MLP is somewhat arbitrary, compared to many of the
inference equations in Chapter 19 that are derived from first principles. This MLP
is a heuristic choice that seems to work well in practice and is used consistently
in the literature. Many approximate inference techniques are motivated by their
ability to find a maximally tight variational lower bound on the log-likelihood
under some set of constraints. One can construct a variational lower bound on the
log-likelihood using the hidden unit expectations defined by the DBN’s MLP, but
this is true of any probability distribution over the hidden units, and there is no
reason to believe that this MLP provides a particularly tight bound. In particular,
the MLP ignores many important interactions in the DBN graphical model. The
MLP propagates information upward from the visible units to the deepest hidden
units, but does not propagate any information downward or sideways. The DBN
graphical model has explaining away interactions between all of the hidden units
within the same layer as well as top-down interactions between layers.
While the log-likelihood of a DBN is intractable, it may be approximated with
AIS (Salakhutdinov and Murray, 2008). This permits evaluating its quality as a
generative model.
The term “deep belief network” is commonly used incorrectly to refer to any
kind of deep neural network, even networks without latent variable semantics.
The term “deep belief network” should refer specifically to models with undirected
connections in the deepest layer and directed connections pointing downward
between all other pairs of consecutive layers.
The term “deep belief network” may also cause some confusion because the
term “belief network” is sometimes used to refer to purely directed models, while
deep belief networks contain an undirected layer. Deep belief networks also share
the acronym DBN with dynamic Bayesian networks (Dean and Kanazawa, 1989),
which are Bayesian networks for representing Markov chains.
664
CHAPTER 20. DEEP GENERATIVE MODELS
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
v
3
v
3
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
4
h
(1)
4
Figure 20.2: The graphical model for a deep Boltzmann machine with one visible layer
(bottom) and two hidden layers. Connections are only between units in neighboring layers.
There are no intra-layer layer connections.
20.4 Deep Boltzmann Machines
A deep Boltzmann machine or DBM (Salakhutdinov and Hinton, 2009a) is another
kind of deep, generative model. Unlike the deep belief network (DBN), it is an
entirely undirected model. Unlike the RBM, the DBM has several layers of latent
variables (RBMs have just one). But like the RBM, within each layer, each of the
variables are mutually independent, conditioned on the variables in the neighboring
layers. See Fig. 20.2 for the graph structure. Deep Boltzmann machines have been
applied to a variety of tasks including document modeling (Srivastava et al., 2013).
Like RBMs and DBNs, DBMs typically contain only binary units—as we
assume for simplicity of our presentation of the model—but it is straightforward
to include real-valued visible units.
A DBM is an energy-based model, meaning that the the joint probability
distribution over the model variables is parametrized by an energy function
E
. In
the case of a deep Boltzmann machine with one visible layer,
v
, and three hidden
layers, h
(1)
, h
(2)
and h
(3)
, the joint probability is given by:
P
v, h
(1)
, h
(2)
, h
(3)
=
1
Z(θ)
exp
E(v, h
(1)
, h
(2)
, h
(3)
; θ)
. (20.24)
To simplify our presentation, we omit the bias parameters below. The DBM energy
function is then defined as follows:
E(v, h
(1)
, h
(2)
, h
(3)
; θ) = v
>
W
(1)
h
(1)
h
(1)>
W
(2)
h
(2)
h
(2)>
W
(3)
h
(3)
.
(20.25)
665
CHAPTER 20. DEEP GENERATIVE MODELS
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
v
1
v
1
v
2
v
2
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(3)
1
h
(3)
1
h
(3)
2
h
(3)
2
v
1
v
2
h
(2)
1
h
(2)
1
h
(2)
2
h
(2)
2
h
(2)
3
h
(2)
3
h
(1)
1
h
(1)
1
h
(1)
2
h
(1)
2
h
(1)
3
h
(1)
3
h
(3)
1
h
(3)
1
h
(3)
2
h
(3)
2
Figure 20.3: A deep Boltzmann machine, re-arranged to reveal its bipartite graph structure.
In comparison to the RBM energy function (Eq. 20.5), the DBM energy
function includes connections between the hidden units (latent variables) in the
form of the weight matrices (
W
(2)
and
W
(3)
). As we will see, these connections
have significant consequences for both the model behavior as well as how we go
about performing inference in the model.
In comparison to fully connected Boltzmann machines (with every unit con-
nected to every other unit), the DBM offers some advantages that are similar to
those offered by the RBM. Specifically, as illustrated in Fig. 20.3, the DBM layers
can be organized into a bipartite graph, with odd layers on one side and even layers
on the other. This immediately implies that when we condition on the variables in
the even layer, the variables in the odd layers become conditionally independent.
Of course, when we condition on the variables in the odd layers, the variables in
the even layers also become conditionally independent.
The bipartite structure of the DBM means that we can apply the same equa-
tions we have previously used for the conditional distributions of an RBM to
determine the conditional distributions in a DBM. The units within a layer are
conditionally independent from each other given the values of the neighboring
layers, so the distributions over binary variables can be fully described by the
Bernoulli parameters giving the probability of each unit being active. In our
example with two hidden layers, the activation probabilities are given by:
P (v
i
= 1 | h
(1)
) = σ
W
(1)
i,:
h
(1)
, (20.26)
666
CHAPTER 20. DEEP GENERATIVE MODELS
P (h
(1)
i
= 1 | v, h
(2)
) = σ
v
>
W
(1)
:,i
+ W
(2)
i,:
h
(2)
(20.27)
and
P (h
(2)
k
= 1 | h
(1)
) = σ
h
(1)>
W
(2)
:,k
. (20.28)
The bipartite structure makes Gibbs sampling in a deep Boltzmann machine
efficient. The naive approach to Gibbs sampling is to update only one variable
at a time. RBMs allow all of the visible units to be updated in one block and all
of the hidden units to be updated in a second block. One might naively assume
that a DBM with
l
layers requires
l
+ 1 updates, with each iteration updating a
block consisting of one layer of units. Instead, it is possible to update all of the
units in only two iterations. Gibbs sampling can be divided into two blocks of
updates, one including all even layers (including the visible layer) and the other
including all odd layers. Due to the bipartite DBM connection pattern, given
the even layers, the distribution over the odd layers is factorial and thus can be
sampled simultaneously and independently as a block. Likewise, given the odd
layers, the even layers can be sampled simultaneously and independently as a
block. Efficient sampling is especially important for training with the stochastic
maximum likelihood algorithm.
20.4.1 Interesting Properties
Deep Boltzmann machines have many interesting properties.
DBMs were developed after DBNs. Compared to DBNs, the posterior distribu-
tion
P
(
h | v
) is simpler for DBMs. Somewhat counterintuitively, the simplicity of
this posterior distribution allows richer approximations of the posterior. In the case
of the DBN, we perform classification using a heuristically motivated approximate
inference procedure, in which we guess that a reasonable value for the mean field
expectation of the hidden units can be provided by an upward pass through the
network in an MLP that uses sigmoid activation functions and the same weights
as the original DBN.
Any
distribution
Q
(
h
) may be used to obtain a variational
lower bound on the log-likelihood. This heuristic procedure therefore allows us to
obtain such a bound. However, the bound is not explicitly optimized in any way, so
the bound may be far from tight. In particular, the heuristic estimate of
Q
ignores
interactions between hidden units within the same layer as well as the top-down
feedback influence of hidden units in deeper layers on hidden units that are closer
to the input. Because the heuristic MLP-based inference procedure in the DBN
is not able to account for these interactions, the resulting
Q
is presumably far
667
CHAPTER 20. DEEP GENERATIVE MODELS
from optimal. In DBMs, all of the hidden units within a layer are conditionally
independent given the other layers. This lack of intra-layer interaction makes it
possible to use fixed point equations to actually optimize the variational lower
bound and find the true optimal mean field expectations (to within some numerical
tolerance).
The use of proper mean field allows the approximate inference procedure for
DBMs to capture the influence of top-down feedback interactions. This makes
DBMs interesting from the point of view of neuroscience, because the human brain
is known to use many top-down feedback connections. Because of this property,
DBMs have been used as computational models of real neuroscientific phenomena
(Series et al., 2010; Reichert et al., 2011).
One unfortunate property of DBMs is that sampling from them is relatively
difficult. DBNs only need to use MCMC sampling in their top pair of layers. The
other layers are used only at the end of the sampling process, in one efficient
ancestral sampling pass. To generate a sample from a DBM, it is necessary to
use MCMC across all layers, with every layer of the model participating in every
Markov chain transition.
20.4.2 DBM Mean Field Inference
The conditional distribution over one DBM layer given the neighboring layers is
factorial. In the example of the DBM with two hidden layers, these distributions
are
P
(
v | h
(1)
),
P
(
h
(1)
| v, h
(2)
) and
P
(
h
(2)
| h
(1)
). The distribution over
all
hidden layers generally does not factorize because of interactions between layers.
In the example with two hidden layers,
P
(
h
(1)
, h
(2)
| v
) does not factorize due due
to the interaction weights
W
(2)
between
h
(1)
and
h
(2)
which render these variables
mutually dependent.
As was the case with the DBN, we are left to seek out methods to approximate
the DBM posterior distribution. However, unlike the DBN, the DBM posterior
distribution over their hidden units—while complicated—is easy to approximate
with a variational approximation (as discussed in Sec. 19.4), specifically a mean
field approximation. The mean field approximation is a simple form of variational
inference, where we restrict the approximating distribution to fully factorial distri-
butions. In the context of DBMs, the mean field equations capture the bidirectional
interactions between layers. In this section we derive the iterative approximate
inference procedure originally introduced in Salakhutdinov and Hinton (2009a).
In variational approximations to inference, we approach the task of approxi-
mating a particular target distribution—in our case, the posterior distribution over
668
CHAPTER 20. DEEP GENERATIVE MODELS
the hidden units given the visible units—by some reasonably simple family of dis-
tributions. In the case of the mean field approximation, the approximating family
is the set of distributions where the hidden units are conditionally independent.
We now develop the mean field approach for the example with two hidden
layers. Let
Q
(
h
(1)
, h
(2)
| v
) be the approximation of
P
(
h
(1)
, h
(2)
| v
). The mean
field assumption implies that
Q(h
(1)
, h
(2)
| v) =
Y
j
Q(h
(1)
j
| v)
Y
k
Q(h
(2)
k
| v). (20.29)
The mean field approximation attempts to find a member of this family of
distributions that best fits the true posterior
P
(
h
(1)
, h
(2)
| v
). Importantly, the
inference process must be run again to find a different distribution
Q
every time
we use a new value of v.
One can conceive of many ways of measuring how well
Q
(
h | v
) fits
P
(
h | v
).
The mean field approach is to minimize
KL(QkP ) =
X
h
Q(h
(1)
, h
(2)
| v) log
Q(h
(1)
, h
(2)
| v)
P (h
(1)
, h
(2)
| v)
!
. (20.30)
In general, we do not have to provide a parametric form of the approximating
distribution beyond enforcing the independence assumptions. The variational
approximation procedure is generally able to recover a functional form of the
approximate distribution. However, in the case of a mean field assumption on
binary hidden units (the case we are developing here) there is no loss of generality
resulting from fixing a parametrization of the model in advance.
We parametrize
Q
as a product of Bernoulli distributions, that is we associate
the probability of each element of
h
(1)
with a parameter. Specifically, for each
j
,
ˆ
h
(1)
j
=
Q
(
h
(1)
j
= 1
| v
), where
ˆ
h
(1)
j
[0
,
1] and for each
k
,
ˆ
h
(2)
k
=
Q
(
h
(2)
k
= 1
| v
),
where
ˆ
h
(2)
k
[0, 1]. Thus we have the following approximation to the posterior:
Q(h
(1)
, h
(2)
| v) =
Y
j
Q(h
(1)
j
| v)
Y
k
Q(h
(2)
k
| v) (20.31)
=
Y
j
(
ˆ
h
(1)
j
)
h
(1)
j
(1
ˆ
h
(1)
j
)
(1h
(1)
j
)
×
Y
k
(
ˆ
h
(2)
k
)
h
(2)
k
(1
ˆ
h
(2)
k
)
(1h
(2)
k
)
.
(20.32)
Of course, for DBMs with more layers the approximate posterior parametrization
can be extended in the obvious way, exploiting the bipartite structure of the graph
669
CHAPTER 20. DEEP GENERATIVE MODELS
to update all of the even layers simultaneously and then to update all of the odd
layers simultaneously, following the same schedule as Gibbs sampling.
Now that we have specified our family of approximating distributions
Q
, it
remains to specify a procedure for choosing the member of this family that best
fits
P
. The most straightforward way to do this is to use the mean field equations
specified by Eq. 19.56. These equations were derived by solving for where the
derivatives of the variational lower bound are zero. They describe in an abstract
manner how to optimize the variational lower bound for any model, simply by
taking expectations with respect to Q.
Applying these general equations, we obtain the update rules (again, ignoring
bias terms):
ˆ
h
(1)
j
= σ
X
i
v
i
W
(1)
i,j
+
X
k
0
W
(2)
j,k
0
ˆ
h
(2)
k
0
!
, j (20.33)
ˆ
h
(2)
k
= σ
X
j
0
W
(2)
j
0
,k
ˆ
h
(1)
j
0
, k. (20.34)
At a fixed point of this system of equations, we have a local maximum of the
variational lower bound
L
(
Q
). Thus these fixed point update equations define
an iterative algorithm where we alternate updates of
ˆ
h
(1)
j
(using Eq. 20.33) and
updates of
ˆ
h
(2)
k
(using Eq. 20.34). On small problems such as MNIST, as few
as ten iterations can be sufficient to find an approximate positive phase gradient
for learning, and fifty usually suffice to obtain a high quality representation of
a single specific example to be used for high-accuracy classification. Extending
approximate variational inference to deeper DBMs is straightforward.
20.4.3 DBM Parameter Learning
.
Learning in the DBM must confront both the challenge of an intractable
partition function, using the techniques from Chapter 18, and the challenge of an
intractable posterior distribution, using the techniques from Chapter 19.
As described in Sec. 20.4.2, variational inference allows the construction of a
distribution
Q
(
h | v
) that approximates the intractable
P
(
h | v
). Learning then
proceeds by maximizing
L
(
v, Q, θ
), the variational lower bound on the intractable
log-likelihood, log P (v; θ).
670
CHAPTER 20. DEEP GENERATIVE MODELS
For a deep Boltzmann machine with two hidden layers, L is given by
L(Q, θ) =
X
i
X
j
0
v
i
W
(1)
i,j
0
ˆ
h
(1)
j
0
+
X
j
0
X
k
0
ˆ
h
(1)
j
0
W
(2)
j
0
,k
0
ˆ
h
(2)
k
0
log Z(θ) + H(Q). (20.35)
This expression still contains the log partition function,
log Z
(
θ
). Because a deep
Boltzmann machine contains restricted Boltzmann machines as components, the
hardness results for computing the partition function and sampling that apply to
restricted Boltzmann machines also apply to deep Boltzmann machines. This means
that evaluating the probability mass function of a Boltzmann machine requires
approximate methods such as annealed importance sampling. Likewise, training
the model requires approximations to the gradient of the log partition function. See
Chapter 18 for a general description of these methods. DBMs are typically trained
using stochastic maximum likelihood. Many of the other techniques described in
Chapter 18 are not applicable. Techniques such as pseudolikelihood require the
ability to evaluate the unnormalized probabilities, rather than merely obtain a
variational lower bound on them. Contrastive divergence is slow for deep Boltzmann
machines because they do not allow efficient sampling of the hidden units given the
visible units—instead, contrastive divergence would require burning in a Markov
chain every time a new negative phase sample is needed.
The non-variational version of stochastic maximum likelihood algorithm was
discussed earlier, in Sec. 18.2. Variational stochastic maximum likelihood as applied
to the DBM is given in Algorithm 20.1. Recall that we describe a simplified varient
of the DBM that lacks bias parameters; including them is trivial.
20.4.4 Layer-Wise Pretraining
Unfortunately, training a DBM using stochastic maximum likelihood (as described
above) from a random initialization usually results in failure. In some cases, the
model fails to learn to represent the distribution adequately. In other cases, the
DBM may represent the distribution well, but with no higher likelihood than could
be obtained with just an RBM. A DBM with very small weights in all but the first
layer represents approximately the same distribution as an RBM.
Various techniques that permit joint training have been developed and are
described in Sec. 20.4.5. However, the original and most popular method for
overcoming the joint training problem of DBMs is greedy layer-wise pretraining.
In this method, each layer of the DBM is trained in isolation as an RBM. The
first layer is trained to model the input data. Each subsequent RBM is trained to
model samples from the previous RBM’s posterior distribution. After all of the
671
CHAPTER 20. DEEP GENERATIVE MODELS
Algorithm 20.1
The variational stochastic maximum likelihood algorithm for
training a DBM with two hidden layers.
Set , the step size, to a small positive number
Set
k
, the number of Gibbs steps, high enough to allow a Markov chain of
p
(
v, h
(1)
, h
(2)
;
θ
+
θ
) to burn in, starting from samples from
p
(
v, h
(1)
, h
(2)
;
θ
).
Initialize three matrices,
˜
V
,
˜
H
(1)
and
˜
H
(2)
each with
m
columns set to random
values (e.g., from Bernoulli distributions, possibly with marginals matched to
the model’s marginals).
while not converged (learning loop) do
Sample a minibatch of
m
examples from the training data and arrange them
as the rows of a design matrix V .
Initialize matrices
ˆ
H
(1)
and
ˆ
H
(2)
, possibly to the model’s marginals.
while not converged (mean field inference loop) do
ˆ
H
(1)
σ
V W
(1)
+
ˆ
H
(2)
W
(2)>
.
ˆ
H
(2)
σ
ˆ
H
(1)
W
(2)
.
end while
W
(1)
1
m
V
>
ˆ
H
(1)
W
(2)
1
m
ˆ
H
(1) >
ˆ
H
(2)
for l = 1 to k (Gibbs sampling) do
Gibbs block 1:
i, j,
˜
V
i,j
sampled from P (
˜
V
i,j
= 1) = σ
W
(1)
j,:
˜
H
(1)
i,:
.
i, j,
˜
H
(2)
i,j
sampled from P (
˜
H
(2)
i,j
= 1) = σ
˜
H
(1)
i,:
W
(2)
:,j
.
Gibbs block 2:
i, j,
˜
H
(1)
i,j
sampled from P (
˜
H
(1)
i,j
= 1) = σ
˜
V
i,:
W
(1)
:,j
+
˜
H
(2)
i,:
W
(2)>
j,:
.
end for
W
(1)
W
(1)
1
m
V
>
˜
H
(1)
W
(2)
W
(2)
1
m
˜
H
(1)>
˜
H
(2)
W
(1)
W
(1)
+
W
(1)
(this is a cartoon illustration, in practice use a more
effective algorithm, such as momentum with a decaying learning rate)
W
(2)
W
(2)
+
W
(2)
end while
672
CHAPTER 20. DEEP GENERATIVE MODELS
RBMs have been trained in this way, they can be combined to form a DBM. The
DBM may then be trained with PCD. Typically PCD training will make only a
small change in the model’s parameters and its performance as measured by the
log-likelihood it assigns to the data, or its ability to classify inputs. See Fig. 20.4
for an illustration of the training procedure.
This greedy layer-wise training procedure is not just coordinate ascent. It bears
some passing resemblance to coordinate ascent because we optimize one subset of
the parameters at each step. However, in the case of the greedy layer-wise training
procedure, we actually use a different objective function at each step.
Greedy layer-wise pretraining of a DBM differs from greedy layer-wise pre-
training of a DBN. The parameters of each individual RBM may be copied to
the corresponding DBN directly. In the case of the DBM, the RBM parameters
must be modified before inclusion in the DBM. A layer in the middle of the stack
of RBMs is trained with only bottom-up input, but after the stack is combined
to form the DBM, the layer will have both bottom-up and top-down input. To
account for this effect, Salakhutdinov and Hinton (2009a) advocate dividing the
weights of all but the top and bottom RBM in half before inserting them into the
DBM. Additionally, the bottom RBM must be trained using two “copies” of each
visible unit and the weights tied to be equal between the two copies. This means
that the weights are effectively doubled during the upward pass. Similarly, the top
RBM should be trained with two copies of the topmost layer.
Obtaining the state of the art results with the deep Boltzmann machine requires
a modification of the standard SML algorithm, which is to use a small amount of
mean field during the negative phase of the joint PCD training step (Salakhutdinov
and Hinton, 2009a). Specifically, the expectation of the energy gradient should
be computed with respect to the mean field distribution in which all of the units
are independent from each other. The parameters of this mean field distribution
should be obtained by running the mean field fixed point equations for just one
step. See Goodfellow et al. (2013b) for a comparison of the performance of centered
DBMs with and without the use of partial mean field in the negative phase.
20.4.5 Jointly Training Deep Boltzmann Machines
Classic DBMs require greedy unsupervised pretraining, and to perform classification
well, require a separate MLP-based classifier on top of the hidden features they
extract. This has some undesirable properties. It is hard to track performance
during training because we cannot evaluate properties of the full DBM while
training the first RBM. Thus, it is hard to tell how well our hyperparameters
673
CHAPTER 20. DEEP GENERATIVE MODELS
d)
a) b)
c)
Figure 20.4: The deep Boltzmann machine training procedure used to classify the MNIST
dataset (Salakhutdinov and Hinton, 2009a; Srivastava et al., 2014). (a) Train an RBM
by using CD to approximately maximize
logP
(
v
). (b) Train a second RBM that models
h
(1)
and target class
y
by using CD-
k
to approximately maximize
log P
(
h
(1)
, y
) where
h
(1)
is drawn from the first RBM’s posterior conditioned on the data. Increase
k
from 1
to 20 during learning. (c) Combine the two RBMs into a DBM. Train it to approximately
maximize
logP
(
v, y
) using stochastic maximum likelihood with
k
= 5. (d) Delete
y
from
the model. Define a new set of features
h
(1)
and
h
(
2) that are obtained by running mean
field inference in the model lacking
y
. Use these features as input to an MLP whose
structure is the same as an additional pass of mean field, with an additional output layer
for the estimate of
y
. Initialize the MLP’s weights to be the same as the DBM’s weights.
Train the MLP to approximately maximize
log P
(
y | v
) using stochastic gradient descent
and dropout. Figure reprinted from (Goodfellow et al., 2013b).
674
CHAPTER 20. DEEP GENERATIVE MODELS
are working until quite late in the training process. Software implementations
of DBMs need to have many different components for CD training of individual
RBMs, PCD training of the full DBM, and training based on back-propagation
through the MLP. Finally, the MLP on top of the Boltzmann machine loses many
of the advantages of the Boltzmann machine probabilistic model, such as being
able to perform inference when some input values are missing.
There are two main ways to resolve the joint training problem of the deep
Boltzmann machine. The first is the centered deep Boltzmann machine (Montavon
and Muller, 2012), which reparametrizes the model in order to make the Hessian of
the cost function better-conditioned at the beginning of the learning process. This
yields a model that can be trained without a greedy layer-wise pretraining stage.
The resulting model obtains excellent test set log-likelihood and produces high
quality samples. Unfortunately, it remains unable to compete with appropriately
regularized MLPs as a classifier. The second way to jointly train a deep Boltzmann
machine is to use a multi-prediction deep Boltzmann machine (Goodfellow et al.,
2013b). This model uses an alternative training criterion that allows the use
of the back-propagation algorithm in order to avoid the problems with MCMC
estimates of the gradient. Unfortunately, the new criterion does not lead to good
likelihood or samples, but, compared to the MCMC approach, it does lead to
superior classification performance and ability to reason well about missing inputs.
The centering trick for the Boltzmann machine is easiest to describe if we
return to the general view of a Boltzmann machine as consisting of a set of units
x
with a weight matrix
U
and biases
b
. Recall from Eq. 20.2 that he energy function
is given by
E(x) = x
>
Ux b
>
x. (20.36)
Using different sparsity patterns in the weight matrix
U
, we can implement
structures of Boltzmann machines, such as RBMs, or DBMs with different numbers
of layers. This is accomplished by partitioning
x
into visible and hidden units and
zeroing out elements of
U
for units that do not interact. The centered Boltzmann
machine introduces a vector µ that is subtracted from all of the states:
E
0
(x; U, b) = (x µ)
>
U(x µ) (x µ)
>
b. (20.37)
Typically
µ
is a hyperparameter fixed at the beginning of training. It is usu-
ally chosen to make sure that
x µ 0
when the model is initialized. This
reparametrization does not change the set of probability distributions that the
model can represent, but it does change the dynamics of stochastic gradient descent
applied to the likelihood. Specifically, in many cases, this reparametrization results
in a Hessian matrix that is better conditioned. Melchior et al. (2013) experimentally
675
CHAPTER 20. DEEP GENERATIVE MODELS
confirmed that the conditioning of the Hessian matrix improves, and observed
that the centering trick is equivalent to another Boltzmann machine learning
technique, the enhanced gradient (Cho et al., 2011). The improved conditioning of
the Hessian matrix allows learning to succeed, even in difficult cases like training a
deep Boltzmann machine with multiple layers.
The other approach to jointly training deep Boltzmann machines is the multi-
prediction deep Boltzmann machine (MP-DBM) which works by viewing the mean
field equations as defining a family of recurrent networks for approximately solving
every possible inference problem (Goodfellow et al., 2013b). Rather than training
the model to maximize the likelihood, the model is trained to make each recurrent
network obtain an accurate answer to the corresponding inference problem. The
training process is illustrated in Fig. 20.5. It consists of randomly sampling a
training example, randomly sampling a subset of inputs to the inference network,
and then training the inference network to predict the values of the remaining
units.
This general principle of back-propagating through the computational graph
for approximate inference has been applied to other models (Stoyanov et al., 2011;
Brakel et al., 2013). In these models and in the MP-DBM, the final loss is not
the lower bound on the likelihood. Instead, the final loss is typically based on
the approximate conditional distribution that the approximate inference network
imposes over the missing values. This means that the training of these models
is somewhat heuristically motivated. If we inspect the
p
(
v
) represented by the
Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective,
in the sense that Gibbs sampling yields poor samples.
Back-propagation through the inference graph has two main advantages. First,
it trains the model as it is really used—with approximate inference. This means
that approximate inference, for example, to fill in missing inputs, or to perform
classification despite the presence of missing inputs, is more accurate in the MP-
DBM than in the original DBM. The original DBM does not make an accurate
classifier on its own; the best classification results with the original DBM were
based on training a separate classifier to use features extracted by the DBM,
rather than by using inference in the DBM to compute the distribution over the
class labels. Mean field inference in the MP-DBM performs well as a classifier
without special modifications. The other advantage of back-propagating through
approximate inference is that back-propagation computes the exact gradient of
the loss. This is better for optimization than the approximate gradients of SML
training, which suffer from both bias and variance. This probably explains why MP-
DBMs may be trained jointly while DBMs require a greedy layer-wise pretraining.
676
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.5: An illustration of the multi-prediction training process for a deep Boltzmann
machine. Each row indicates a different example within a minibatch for the same training
step. Each column represents a time step within the mean field inference process. For
each example, we sample a subset of the data variables to serve as inputs to the inference
process. These variables are shaded black to indicate conditioning. We then run the
mean field inference process, with arrows indicating which variables influence which other
variables in the process. In practical applications, we unroll mean field for several steps.
In this illustration, we unroll for only two steps. Dashed arrows indicate how the process
could be unrolled for more steps. The data variables that were not used as inputs to the
inference process become targets, shaded in gray. We can view the inference process for
each example as a recurrent network. We use gradient descent and back-propagation to
train these recurrent networks to produce the correct targets given their inputs. This
trains the mean field process for the MP-DBM to produce accurate estimates. Figure
adapted from Goodfellow et al. (2013b).
677
CHAPTER 20. DEEP GENERATIVE MODELS
The disadvantage of back-propagating through the approximate inference graph is
that it does not provide a way to optimize the log-likelihood, but rather a heuristic
approximation of the generalized pseudolikelihood.
The MP-DBM inspired the NADE-
k
(Raiko et al., 2014) extension to the
NADE framework, which is described in Sec. 20.10.10.
The MP-DBM has some connections to dropout. Dropout shares the same pa-
rameters among many different computational graphs, with the difference between
each graph being whether it includes or excludes each unit. The MP-DBM also
shares parameters across many computational graphs. In the case of the MP-DBM,
the difference between the graphs is whether each input unit is observed or not.
When a unit is not observed, the MP-DBM does not delete it entirely as in the
case of dropout. Instead, the MP-DBM treats it as a latent variable to be inferred.
One could imagine applying dropout to the MP-DBM by additionally removing
some units rather than making them latent.
20.5 Boltzmann Machines for Real-Valued Data
While Boltzmann machines were originally developed for use with binary data,
many applications such as image and audio modeling seem to require the ability
to represent probability distributions over real values. In some cases, it is possible
to treat real-valued data in the interval [0, 1] as representing the expectation of a
binary variable. For example, Hinton (2000) treats grayscale images in the training
set as defining [0,1] probability values. Each pixel defines the probability of a
binary value being 1, and the binary pixels are all sampled independently from
each other. This is a common procedure for evaluating binary models on grayscale
image datasets. However, it is not a particularly theoretically satisfying approach,
and binary images sampled independently in this way have a noisy appearance. In
this section, we present Boltzmann machines that define a probability density over
real-valued data.
20.5.1 Gaussian-Bernoulli RBMs
Restricted Boltzmann machines may be developed for many exponential family
conditional distributions (Welling et al., 2005). Of these, the most common is the
RBM with binary hidden units and real-valued visible units, with the conditional
distribution over the visible units being a Gaussian distribution whose mean is a
function of the hidden units.
There are many ways of parametrizing Gaussian-Bernoulli RBMs. First, we may
678
CHAPTER 20. DEEP GENERATIVE MODELS
choose whether to use a covariance matrix or a precision matrix for the Gaussian
distribution. Here we present the precision formulation. The modification to obtain
the covariance formulation is straightforward. We wish to have the conditional
distribution
p(v | h) = N(v; W h, β
1
). (20.38)
We can find the terms we need to add to the energy function by expanding the
unnormalized log conditional distribution:
log N(v; W h, β
1
) =
1
2
(v W h)
>
β (v W h) + f (β). (20.39)
Here
f
encapsulates all the terms that are a function only of the parameters
and not the random variables in the model. We can discard
f
because its only
role is to normalize the distribution, and the partition function of whatever energy
function we choose will carry out that role.
If we include all of the terms (with their sign flipped) involving
v
from Eq. 20.39
in our energy function and do not add any other terms involving
v
, then our energy
function will represent the desired conditional p(v | h).
We have some freedom regarding the other conditional distribution,
p
(
h | v
).
Note that Eq. 20.39 contains a term
1
2
h
>
W
>
βW h. (20.40)
This term cannot be included in its entirety because it includes
h
i
h
j
terms. These
correspond to edges between the hidden units. If we included these terms, we
would have a linear factor model instead of a restricted Boltzmann machine.
When designing our Boltzmann machine, we simply omit these
h
i
h
j
cross terms.
Omitting them does not change the conditional
p
(
v | h
) so Eq. 20.39 is still
respected. However, we still have a choice about whether to include the terms
involving only a single h
i
. If we assume a diagonal precision matrix, we find that
for each hidden unit h
i
we have a term
1
2
h
i
X
j
β
j
W
2
j,i
. (20.41)
In the above, we used the fact that
h
2
i
=
h
i
because
h
i
{
0
,
1
}
. If we include this
term (with its sign flipped) in the energy function, then it will naturally bias
h
i
to be turned off when the weights for that unit are large and connected to visible
units with high precision. The choice of whether or not to include this bias term
does not affect the family of distributions the model can represent (assuming that
679
CHAPTER 20. DEEP GENERATIVE MODELS
we include bias parameters for the hidden units) but it does affect the learning
dynamics of the model. Including the term may help the hidden unit activations
remain reasonable even when the weights rapidly increase in magnitude.
One way to define the energy function on a Gaussian-Bernoulli RBM is thus
E(v, h) =
1
2
v
>
(β v) (v β)
>
W h b
>
h (20.42)
but we may also add extra terms or parametrize the energy in terms of the variance
rather than precision if we choose.
In this derivation, we have not included a bias term on the visible units, but one
could easily be added. One final source of variability in the parametrization of a
Gaussian-Bernoulli RBM is the choice of how to treat the precision matrix. It may
either be fixed to a constant (perhaps estimated based on the marginal precision
of the data) or learned. It may also be a scalar times the identity matrix, or it
may be a diagonal matrix. Typically we do not allow the precision matrix to be
non-diagonal in this context, because some operations would then require inverting
the matrix. In the sections ahead, we will see that other forms of Boltzmann
machines permit modeling the covariance structure, using various techniques to
avoid inverting the precision matrix.
20.5.2 Undirected Models of Conditional Covariance
While the Gaussian RBM has been the canonical energy model for real-valued
data, Ranzato et al. (2010a) argue that the Gaussian RBM inductive bias is not
well suited to the statistical variations present in some types of real-valued data,
especially natural images. The problem is that much of the information content
present in natural images is embedded in the covariance between pixels rather than
in the raw pixel values. In other words, it is the relationships between pixels and
not their absolute values where most of the useful information in images resides.
Since the Gaussian RBM only models the conditional mean of the input given the
hidden units, it cannot capture conditional covariance information. In response
to these criticisms, alternative models have been proposed that attempt to better
account for the covariance of real-valued data. These models include the mean and
covariance RBM (mcRBM
1
), the mean-product of
t
-distribution (mPoT) model
and the spike and slab RBM (ssRBM).
1
The term “mcRBM” is pronounced by saying the name of the letters M-C-R-B-M; the “mc”
is not pronounced like the “Mc” in “McDonald’s.”
680
CHAPTER 20. DEEP GENERATIVE MODELS
Mean and Covariance RBM
The mcRBM uses its hidden units to indepen-
dently encode the conditional mean and covariance of all observed units. The
mcRBM hidden layer is divided into two groups of units: mean units and covariance
units. The group that models the conditional mean is simply a Gaussian RBM.
The other half is a covariance RBM (Ranzato et al., 2010a), also called a cRBM,
whose components model the conditional covariance structure, as described below.
Specifically, with binary mean units
h
(m)
and binary covariance units
h
(c)
, the
mcRBM model is defined as the combination of two energy functions:
E
mc
(x, h
(m)
, h
(c)
) = E
m
(x, h
(m)
) + E
c
(x, h
(c)
), (20.43)
where E
m
is the standard Gaussian-Bernoulli RBM energy function:
2
E
m
(x, h
(m)
) =
1
2
x
>
x
X
j
x
>
W
:,j
h
(m)
j
X
j
b
(m)
j
h
(m)
j
, (20.44)
and
E
c
is the cRBM energy function that models the conditional covariance
information:
E
c
(x, h
(c)
) =
1
2
X
j
h
(c)
j
x
>
r
(j)
2
X
j
b
(c)
j
h
(c)
j
. (20.45)
The parameter
r
(j)
corresponds to the covariance weight vector associated with
h
(c)
j
and
b
(c)
is a vector of covariance offsets. The combined energy function defines
a joint distribution:
p
mc
(x, h
(m)
, h
(c)
) =
1
Z
exp
n
E
mc
(x, h
(m)
, h
(c)
)
o
, (20.46)
and a corresponding conditional distribution over the observations given
h
(m)
and
h
(c)
as a multivariate Gaussian distribution:
p
mc
(x | h
(m)
, h
(c)
) = N
C
mc
x|h
X
j
W
:,j
h
(m)
j
, C
mc
x|h
. (20.47)
Note that the covariance matrix
C
mc
x|h
=
P
j
h
j
r
(j)
r
(j)>
+ I
1
is non-diagonal
and that
W
is the weight matrix associated with the Gaussian RBM modeling the
2
This version of the Gaussian-Bernoulli RBM energy function assumes the image data has
zero mean, per pixel. Pixel offsets can easily be added to the model to account for nonzero pixel
means.
681
CHAPTER 20. DEEP GENERATIVE MODELS
conditional means. It is difficult to train the mcRBM via contrastive divergence or
persistent contrastive divergence because of its non-diagonal conditional covariance
structure. CD and PCD require sampling from the joint distribution of
x, h
(m)
, h
(c)
which, in a standard RBM, is accomplished by Gibbs sampling over the conditionals.
However, in the mcRBM, sampling from
p
mc
(
x | h
(m)
, h
(c)
) requires computing
(
C
mc
)
1
at every iteration of learning. This can be an impractical computational
burden for larger observations. Ranzato and Hinton (2010) avoid direct sampling
from the conditional
p
mc
(
x | h
(m)
, h
(c)
) by sampling directly from the marginal
p
(
x
) using Hamiltonian (hybrid) Monte Carlo (Neal, 1993) on the mcRBM free
energy.
Mean-Product of Student’s t-distributions
The mean-product of Student’s
t
-distribution (mPoT) model (Ranzato et al., 2010b) extends the PoT model (Welling
et al., 2003a) in a manner similar to how the mcRBM extends the cRBM. This
is achieved by including nonzero Gaussian means by the addition of Gaussian
RBM-like hidden units. Like the mcRBM, the PoT conditional distribution over the
observation is a multivariate Gaussian (with non-diagonal covariance) distribution;
however, unlike the mcRBM, the complementary conditional distribution over the
hidden variables is given by conditionally independent Gamma distributions. The
Gamma distribution
G
(
k, θ
) is a probability distribution over positive real numbers,
with mean
kθ
. It is not necessary to have a more detailed understanding of the
Gamma distribution to understand the basic ideas underlying the mPoT model.
The mPoT energy function is:
E
mPoT
(x, h
(m)
, h
(c)
) (20.48)
= E
m
(x, h
(m)
) +
X
j
h
(c)
j
1 +
1
2
r
(j)>
x
2
+ (1 γ
j
) log h
(c)
j
(20.49)
where
r
(j)
is the covariance weight vector associated with unit
h
(c)
j
and
E
m
(
x, h
(m)
)
is as defined in Eq. 20.44.
Just as with the mcRBM, the mPoT model energy function specifies a mul-
tivariate Gaussian, with a conditional distribution over
x
that has non-diagonal
covariance. The covariance units h
(c)
are conditionally Gamma-distributed:
p
mPoT
(h
(c)
j
| x) = G
γ
j
, 1 +
1
2
v
(j)>
x
2
(20.50)
Learning in the mPoT model—again, like the mcRBM—is complicated by the in-
ability to sample from the non-diagonal Gaussian conditional
p
mPoT
(
x | h
(m)
, h
(c)
),
682
CHAPTER 20. DEEP GENERATIVE MODELS
so Ranzato et al. (2010b) also advocate direct sampling of
p
(
x
) via Hamiltonian
(hybrid) Monte Carlo.
Spike and Slab Restricted Boltzmann Machines
Spike and slab restricted
Boltzmann machines (Courville et al., 2011) or ssRBMs provide another means
of modeling the covariance structure of real-valued data. Compared to mcRBMs,
ssRBMs have the advantage of requiring neither matrix inversion nor Hamiltonian
Monte Carlo methods. As a model of natural images, the ssRBM is interesting
in that, like the mcRBM and the mPoT model, its binary hidden units encode
the conditional covariance across pixels through the use of auxiliary real-valued
variables.
The spike and slab RBM has two sets of hidden units: binary spike units
h
,
and real-valued slab units
s
. The mean of the visible units conditioned on the
hidden units is given by (
h s
)
W
>
. In other words, each column
W
:,i
defines a
component that can appear in the input when
h
i
= 1. The corresponding spike
variable
h
i
determines whether that component is present at all. The corresponding
slab variable
s
i
determines the intensity of that component, if it is present. When
a spike variable is active, the corresponding slab variable adds variance to the
input along the axis defined by
W
:,i
. This allows us to model the covariance of the
inputs. Fortunately, contrastive divergence and persistent contrastive divergence
with Gibbs sampling are still applicable. There is no need to invert any matrix.
Formally, the ssRBM model is defined via its energy function:
E
ss
(x, s, h) =
X
i
x
>
W
:,i
s
i
h
i
+
1
2
x
>
Λ +
X
i
Φ
i
h
i
!
x (20.51)
+
1
2
X
i
α
i
s
2
i
X
i
α
i
µ
i
s
i
h
i
X
i
b
i
h
i
+
X
i
α
i
µ
2
i
h
i
, (20.52)
where
b
i
is the offset of the spike
h
i
and
Λ
is a diagonal precision matrix on the
observations
x
. The parameter
α
i
>
0 is a scalar precision parameter for the
real-valued slab variable
s
i
. The parameter
Φ
i
is a non-negative diagonal matrix
that defines an
h
-modulated quadratic penalty on
x
. Each
µ
i
is a mean parameter
for the slab variable s
i
.
With the joint distribution defined via the energy function, it is relatively
straightforward to derive the ssRBM conditional distributions. For example,
by marginalizing out the slab variables
s
, the conditional distribution over the
observations given the binary spike variables h is given by:
p
ss
(x | h) =
1
P (h)
1
Z
Z
exp {−E(x, s, h)} ds (20.53)
683
CHAPTER 20. DEEP GENERATIVE MODELS
= N
C
ss
x|h
X
i
W
:,i
µ
i
h
i
, C
ss
x|h
!
(20.54)
where
C
ss
x|h
=
Λ +
P
i
Φ
i
h
i
P
i
α
1
i
h
i
W
:,i
W
>
:,i
1
. The last equality holds only if
the covariance matrix C
ss
x|h
is positive definite.
Gating by the spike variables means that the true marginal distribution over
h s
is sparse. This is different from sparse coding, where samples from the model
“almost never” (in the measure theoretic sense) contain zeros in the code, and MAP
inference is required to impose sparsity.
Comparing the ssRBM to the mcRBM and the mPoT models, the ssRBM
parametrizes the conditional covariance of the observation in a significantly different
way. The mcRBM and mPoT both model the covariance structure of the observation
as
P
j
h
(c)
j
r
(j)
r
(j)>
+ I
1
, using the activation of the hidden units
h
j
>
0 to
enforce constraints on the conditional covariance in the direction
r
(j)
. In contrast,
the ssRBM specifies the conditional covariance of the observations using the hidden
spike activations
h
i
= 1 to pinch the precision matrix along the direction specified
by the corresponding weight vector. The ssRBM conditional covariance is very
similar to that given by a different model: the product of probabilistic principal
components analysis (PoPPCA) (Williams and Agakov, 2002). In the overcomplete
setting, sparse activations with the ssRBM parametrization permit significant
variance (above the nominal variance given by
Λ
1
) only in the selected directions
of the sparsely activated
h
i
. In the mcRBM or mPoT models, an overcomplete
representation would mean that to capture variation in a particular direction in
the observation space requires removing potentially all constraints with positive
projection in that direction. This would suggest that these models are less well
suited to the overcomplete setting.
The primary disadvantage of the spike and slab restricted Boltzmann machine
is that some settings of the parameters can correspond to a covariance matrix
that is not positive definite. Such a covariance matrix places more unnormalized
probability on values that are farther from the mean, causing the integral over
all possible outcomes to diverge. Generally this issue can be avoided with simple
heuristic tricks. There is not yet any theoretically satisfying solution. Using
constrained optimization to explicitly avoid the regions where the probability is
undefined is difficult to do without being overly conservative and also preventing
the model from accessing high-performing regions of parameter space.
Qualitatively, convolutional variants of the ssRBM produce excellent samples
of natural images. Some examples are shown in Fig. 16.1.
684
CHAPTER 20. DEEP GENERATIVE MODELS
The ssRBM allows for several extensions. Including higher-order interactions
and average-pooling of the slab variables (Courville et al., 2014) enables the model
to learn excellent features for a classifier when labeled data is scarce. Adding a
term to the energy function that prevents the partition function from becoming
undefined results in a sparse coding model, spike and slab sparse coding (Goodfellow
et al., 2013d), also known as S3C.
20.6 Convolutional Boltzmann Machines
As seen in Chapter 9, extremely high dimensional inputs such as images place
great strain on the computation, memory and statistical requirements of machine
learning models. Replacing matrix multiplication by discrete convolution with a
small kernel is the standard way of solving these problems for inputs that have
translation invariant spatial or temporal structure. Desjardins and Bengio (2008)
showed that this approach works well when applied to RBMs.
Deep convolutional networks usually require a pooling operation so that the
spatial size of each successive layer decreases. Feedforward convolutional networks
often use a pooling function such as the maximum of the elements to be pooled.
It is unclear how to generalize this to the setting of energy-based models. We
could introduce a binary pooling unit
p
over
n
binary detector units
d
and enforce
p
=
max
i
d
i
by setting the energy function to be
whenever that constraint is
violated. This does not scale well though, as it requires evaluating 2
n
different
energy configurations to compute the normalization constant. For a small 3
×
3
pooling region this requires 2
9
= 512 energy function evaluations per pooling unit!
Lee et al. (2009) developed a solution to this problem called probabilistic max
pooling (not to be confused with “stochastic pooling,” which is a technique for
implicitly constructing ensembles of convolutional feedforward networks). The
strategy behind probabilistic max pooling is to constrain the detector units so
at most one may be active at a time. This means there are only
n
+ 1 total
states (one state for each of the
n
detector units being on, and an additional state
corresponding to all of the detector units being off). The pooling unit is on if
and only if one of the detector units is on. The state with all units off is assigned
energy zero. We can think of this as describing a model with a single variable that
has
n
+ 1 states, or equivalently as a model that has
n
+ 1 variables that assigns
energy to all but n + 1 joint assignments of variables.
While efficient, probabilistic max pooling does force the detector units to be
mutually exclusive, which may be a useful regularizing constraint in some contexts
or a harmful limit on model capacity in other contexts. It also does not support
685
CHAPTER 20. DEEP GENERATIVE MODELS
overlapping pooling regions. Overlapping pooling regions are usually required
to obtain the best performance from feedforward convolutional networks, so this
constraint probably greatly reduces the performance of convolutional Boltzmann
machines.
Lee et al. (2009) demonstrated that probabilistic max pooling could be used
to build convolutional deep Boltzmann machines.
3
This model is able to perform
operations such as filling in missing portions of its input. While intellectually
appealing, this model is challenging to make work in practice, and usually does
not perform as well as a classifier as traditional convolutional networks trained
with supervised learning.
Many convolutional models work equally well with inputs of many different
spatial sizes. For Boltzmann machines, it is difficult to change the input size
for a variety of reasons. The partition function changes as the size of the input
changes. Moreover, many convolutional networks achieve size invariance by scaling
up the size of their pooling regions proportional to the size of the input, but scaling
Boltzmann machine pooling regions is awkward. Traditional convolutional neural
networks can use a fixed number of pooling units and dynamically increase the
size of their pooling regions in order to obtain a fixed-size representation of a
variable-sized input. For Boltzmann machines, large pooling regions become too
expensive for the naive approach. The approach of Lee et al. (2009) of making
each of the detector units in the same pooling region mutually exclusive solves
the computational problems, but still does not allow variable-size pooling regions.
For example, suppose we learn a model with 2
×
2 probabilistic max pooling over
detector units that learn edge detectors. This enforces the constraint that only
one of these edges may appear in each 2
×
2 region. If we then increase the size of
the input image by 50% in each direction, we would expect the number of edges to
increase correspondingly. Instead, if we increase the size of the pooling regions by
50% in each direction to 3
×
3, then the mutual exclusivity constraint now specifies
that each of these edges may only appear once in a 3
×
3 region. As we grow
a model’s input image in this way, the model generates edges with less density.
Of course, these issues only arise when the model must use variable amounts of
pooling in order to emit a fixed-size output vector. Models that use probabilistic
max pooling may still accept variable-sized input images so long as the output of
the model is a feature map that can scale in size proportional to the input image.
Pixels at the boundary of the image also pose some difficulty, which is exac-
3
The publication describes the model as a “deep belief network” but because it can be described
as a purely undirected model with tractable layer-wise mean field fixed point updates, it best fits
the definition of a deep Boltzmann machine.
686
CHAPTER 20. DEEP GENERATIVE MODELS
erbated by the fact that connections in a Boltzmann machine are symmetric. If
we do not implicitly zero-pad the input, then there are fewer hidden units than
visible units, and the visible units at the boundary of the image are not modeled
well because they lie in the receptive field of fewer hidden units. However, if we do
implicitly zero-pad the input, then the hidden units at the boundary are driven by
fewer input pixels, and may fail to activate when needed.
20.7 Boltzmann Machines for Structured or Sequential
Outputs
In the structured output scenario, we wish to train a model that can map from
some input
x
to some output
y
, and the different entries of
y
are related to each
other and must obey some constraints. For example, in the speech synthesis task,
y is a waveform, and the entire waveform must sound like a coherent utterance.
A natural way to represent the relationships between the entries in
y
is to
use a probability distribution
p
(
y | x
). Boltzmann machines, extended to model
conditional distributions, can supply this probabilistic model.
The same tool of conditional modeling with a Boltzmann machine can be used
not just for structured output tasks, but also for sequence modeling. In the latter
case, rather than mapping an input
x
to an output
y
, the model must estimate a
probability distribution over a sequence of variables,
p
(
x
(1)
, . . . , x
(τ)
). Conditional
Boltzmann machines can represent factors of the form
p
(
x
(t)
| x
(1)
, . . . , x
(t1)
) in
order to accomplish this task.
An important sequence modeling task for the video game and film industry
is modeling sequences of joint angles of skeletons used to render 3-D characters.
These sequences are often collected using motion capture systems to record the
movements of actors. A probabilistic model of a character’s movement allows
the generation of new, previously unseen, but realistic animations. To solve
this sequence modeling task, Taylor et al. (2007) introduced a conditional RBM
modeling
p
(
x
(t)
| x
(t1)
, . . . , x
(tm)
) for small
m
. The model is an RBM over
p
(
x
(t)
) whose bias parameters are a linear function of the preceding
m
values of
x
.
When we condition on different values of
x
(t1)
and earlier variables, we get a new
RBM over
x
. The weights in the RBM over
x
never change, but by conditioning on
different past values, we can change the probability of different hidden units in the
RBM being active. By activating and deactivating different subsets of hidden units,
we can make large changes to the probability distribution induced on
x
. Other
variants of conditional RBM (Mnih et al., 2011) and other variants of sequence
687
CHAPTER 20. DEEP GENERATIVE MODELS
modeling using conditional RBMs are possible (Taylor and Hinton, 2009; Sutskever
et al., 2009; Boulanger-Lewandowski et al., 2012).
Another sequence modeling task is to model the distribution over sequences
of musical notes used to compose songs. Boulanger-Lewandowski et al. (2012)
introduced the RNN-RBM sequence model and applied it to this task. The RNN-
RBM is a generative model of a sequence of frames
x
(t)
consisting of an RNN
that emits the RBM parameters for each time step. Unlike the model described
above, the RNN emits all of the parameters of the RBM, including the weights.
To train the model, we need to be able to back-propagate the gradient of the
loss function through the RNN. The loss function is not applied directly to the
RNN outputs. Instead, it is applied to the RBM. This means that we must
approximately differentiate the loss with respect to the RBM parameters using
contrastive divergence or a related algorithm. This approximate gradient may then
be back-propagated through the RNN using the usual back-propagation through
time algorithm.
20.8 Other Boltzmann Machines
Many other variants of Boltzmann machines are possible.
Boltzmann machines may be extended with different training criteria. We have
focused on Boltzmann machines trained to approximately maximize the generative
criterion
log p
(
v
). It is also possible to train discriminative RBMs that aim to
maximize
log p
(
y | v
) instead (Larochelle and Bengio, 2008). This approach often
performs the best when using a linear combination of both the generative and
the discriminative criteria. Unfortunately, RBMs do not seem to be as powerful
supervised learners as MLPs, at least using existing methodology.
Most Boltzmann machines used in practice have only second-order interactions
in their energy functions, meaning that their energy functions are the sum of many
terms and each individual term only includes the product between two random
variables. An example of such a term is
v
i
W
i,j
h
j
. It is also possible to train
higher-order Boltzmann machines (Sejnowski, 1987) whose energy function terms
involve the products between many variables. Three-way interactions between a
hidden unit and two different images can model spatial transformations from one
frame of video to the next (Memisevic and Hinton, 2007, 2010). Multiplication by a
one-hot class variable can change the relationship between visible and hidden units
depending on which class is present (Nair and Hinton, 2009). One recent example
of the use of higher-order interactions is a Boltzmann machine with two groups of
hidden units, with one group of hidden units that interact with both the visible
688
CHAPTER 20. DEEP GENERATIVE MODELS
units
v
and the class label
y
, and another group of hidden units that interact only
with the
v
input values (Luo et al., 2011). This can be interpreted as encouraging
some hidden units to learn to model the input using features that are relevant to
the class but also to learn extra hidden units that explain nuisance details that
are necessary for the samples of
v
to be realistic but do not determine the class
of the example. Another use of higher-order interactions is to gate some features.
Sohn et al. (2013) introduced a Boltzmann machine with third-order interactions
with binary mask variables associated with each visible unit. When these masking
variables are set to zero, they remove the influence of a visible unit on the hidden
units. This allows visible units that are not relevant to the classification problem
to be removed from the inference pathway that estimates the class.
More generally, the Boltzmann machine framework is a rich space of models
permitting many more model structures than have been explored so far. Developing
a new form of Boltzmann machine requires some more care and creativity than
developing a new neural network layer, because it is often difficult to find an energy
function that maintains tractability of all of the different conditional distributions
needed to use the Boltzmann machine, but despite this required effort the field
remains open to innovation.
20.9 Back-Propagation through Random Operations
Traditional neural networks implement a deterministic transformation of some
input variables
x
. When developing generative models, we often wish to extend
neural networks to implement stochastic transformations of
x
. One straightforward
way to do this is to augment the neural network with extra inputs
z
that are
sampled from some simple probability distribution, such as a uniform or Gaussian
distribution. The neural network can then continue to perform deterministic
computation internally, but the function
f
(
x, z
) will appear stochastic to an
observer who does not have access to
z
. Provided that
f
is continuous and
differentiable, we can then compute the gradients necessary for training using
back-propagation as usual.
As an example, let us consider the operation consisting of drawing samples
y
from a Gaussian distribution with mean µ and variance σ
2
:
y N(µ, σ
2
). (20.55)
Because an individual sample of
y
is not produced by a function, but rather by
a sampling process whose output changes every time we query it, it may seem
counterintuitive to take the derivatives of
y
with respect to the parameters of
689
CHAPTER 20. DEEP GENERATIVE MODELS
its distribution,
µ
and
σ
2
. However, we can rewrite the sampling process as
transforming an underlying random value
z N
(
z
; 0
,
1) to obtain a sample from
the desired distribution:
y = µ + σz (20.56)
We are now able to back-propagate through the sampling operation, by regard-
ing it as a deterministic operation with an extra input
z
. Crucially, the extra input
is a random variable whose distribution is not a function of any of the variables
whose derivatives we want to calculate. The result tells us how an infinitesimal
change in
µ
or
σ
would change the output if we could repeat the sampling operation
again with the same value of z.
Being able to back-propagate through this sampling operation allows us to
incorporate it into a larger graph. We can build elements of the graph on top of the
output of the sampling distribution. For example, we can compute the derivatives
of some loss function
J
(
y
). We can also build elements of the graph whose outputs
are the inputs or the parameters of the sampling operation. For example, we could
build a larger graph with
µ
=
f
(
x
;
θ
) and
σ
=
g
(
x
;
θ
). In this augmented graph,
we can use back-propagation through these functions to derive
θ
J(y).
The principle used in this Gaussian sampling example is more generally appli-
cable. We can express any probability distribution of the form
p
(
y
;
θ
) or
p
(
y | x
;
θ
)
as
p
(
y | ω
), where
ω
is a variable containing both parameters
θ
, and if applicable,
the inputs
x
. Given a value
y
sampled from distribution
p
(
y | ω
), where
ω
may in
turn be a function of other variables, we can rewrite
y p(y | ω) (20.57)
as
y = f(z; ω), (20.58)
where
z
is a source of randomness. We may then compute the derivatives of
y
with
respect to
ω
using traditional tools such as the back-propagation algorithm applied
to
f
, so long as
f
is continuous and differentiable almost everywhere. Crucially,
ω
must not be a function of
z
, and
z
must not be a function of
ω
. This technique is
often called the reparametrization trick, stochastic back-propagation or perturbation
analysis.
The requirement that
f
be continuous and differentiable of course requires
y
to be continuous. If we wish to back-propagate through a sampling process that
produces discrete-valued samples, it may still be possible to estimate a gradient on
ω
, using reinforcement learning algorithms such as variants of the REINFORCE
algorithm (Williams, 1992), discussed in Sec. 20.9.1.
690
CHAPTER 20. DEEP GENERATIVE MODELS
In neural network applications, we typically choose
z
to be drawn from some
simple distribution, such as a unit uniform or unit Gaussian distribution, and
achieve more complex distributions by allowing the deterministic portion of the
network to reshape its input.
The idea of propagating gradients or optimizing through stochastic operations
dates back to the mid-twentieth century (Price, 1958; Bonnet, 1964) and was
first used for machine learning in the context of reinforcement learning (Williams,
1992). More recently, it has been applied to variational approximations (Opper
and Archambeau, 2009) and stochastic or generative neural networks (Bengio
et al., 2013b; Kingma, 2013; Kingma and Welling, 2014b,a; Rezende et al., 2014;
Goodfellow et al., 2014c). Many networks, such as denoising autoencoders or
networks regularized with dropout, are also naturally designed to take noise
as an input without requiring any special reparametrization to make the noise
independent from the model.
20.9.1 Back-Propagating through Discrete Stochastic Operations
When a model emits a discrete variable
y
, the reparametrization trick is not
applicable. Suppose that the model takes inputs
x
and parameters
θ
, both
encapsulated in the vector
ω
, and combines them with random noise
z
to produce
y:
y = f(z; ω). (20.59)
Because
y
is discrete,
f
must be a step function. The derivatives of a step function
are not useful at any point. Right at each step boundary, the derivatives are
undefined, but that is a small problem. The large problem is that the derivatives
are zero almost everywhere, on the regions between step boundaries. The derivatives
of any cost function
J
(
y
) therefore do not give any information for how to update
the model parameters θ.
The REINFORCE algorithm (REward Increment = Non-negative Factor
×
Offset Reinforcement
×
Characteristic Eligibility) provides a framework defining a
family of simple but powerful solutions (Williams, 1992). The core idea is that
even though
J
(
f
(
z
;
ω
)) is a step function with useless derivatives, the expected
cost
E
zp(z)
J (f(z; ω))
is often a smooth function amenable to gradient descent.
Although that expectation is typically not tractable when
y
is high-dimensional
(or is the result of the composition of many discrete stochastic decisions), it can be
estimated without bias using a Monte Carlo average. The stochastic estimate of
the gradient can be used with SGD or other stochastic gradient-based optimization
techniques.
691
CHAPTER 20. DEEP GENERATIVE MODELS
The simplest version of REINFORCE can be derived by simply differentiating
the expected cost:
E
z
[J(y)] =
X
y
J(y)p(y) (20.60)
E[J(y)]
ω
=
X
y
J(y)
p(y)
ω
(20.61)
=
X
y
J(y)p(y)
log p(y)
ω
(20.62)
1
m
m
X
y
(i)
p(y), i=1
J(y
(i)
)
log p(y
(i)
)
ω
. (20.63)
Eq. 20.61 relies on the assumption that
J
does not reference
ω
directly. It is trivial
to extend the approach to relax this assumption. Eq. 20.62 exploits the derivative
rule for the logarithm,
log p(y)
ω
=
1
p(y)
p(y)
ω
. Eq. 20.63 gives an unbiased Monte
Carlo estimator of the gradient.
Anywhere we write
p
(
y
) in this section, one could equally write
p
(
y | x
). This
is because
p
(
y
) is parametrized by
ω
, and
ω
contains both
θ
and
x
, if
x
is present.
One issue with the above simple REINFORCE estimator is that it has a very
high variance, so that many samples of
y
need to be drawn to obtain a good
estimator of the gradient, or equivalently, if only one sample is drawn, SGD will
converge very slowly and will require a smaller learning rate. It is possible to
considerably reduce the variance of that estimator by using variance reduction
methods (Wilson, 1984; L’Ecuyer, 1994). The idea is to modify the estimator so
that its expected value remains unchanged but its variance get reduced. In the
context of REINFORCE, the proposed variance reduction methods involve the
computation of a baseline that is used to offset
J
(
y
). Note that any offset
b
(
ω
)
that does not depend on
y
would not change the expectation of the estimated
gradient because
E
p(y)
log p(y)
ω
=
X
y
p(y)
log p(y)
ω
(20.64)
=
X
y
p(y)
ω
(20.65)
=
ω
X
y
p(y) =
ω
1 = 0, (20.66)
692
CHAPTER 20. DEEP GENERATIVE MODELS
which means that
E
p(y)
(J(y) b(ω))
log p(y)
ω
= E
p(y)
J(y)
log p(y)
ω
b(ω)E
p(y)
log p(y)
ω
(20.67)
= E
p(y)
J(y)
log p(y)
ω
. (20.68)
Furthermore, we can obtain the optimal
b
(
ω
) by computing the variance of (
J
(
y
)
b
(
ω
))
log p(y)
ω
under
p
(
y
) and minimizing with respect to
b
(
ω
). What we find is
that this optimal baseline b
(ω)
i
is different for each element ω
i
of the vector ω:
b
(ω)
i
=
E
p(y)
h
J(y)
log p(y)
ω
i
2
i
E
p(y)
h
log p(y)
ω
i
2
i
. (20.69)
The gradient estimator with respect to ω
i
then becomes
(J(y) b(ω)
i
)
log p(y)
ω
i
(20.70)
where
b
(
ω
)
i
estimates the above
b
(
ω
)
i
. The estimate
b
is usually obtained by
adding extra outputs to the neural network and training the new outputs to estimate
E
p(y)
[
J
(
y
)
log p(y)
ω
i
2
] and
E
p(y)
h
log p(y)
ω
i
2
i
for each element of
ω
. These extra
outputs can be trained with the mean squared error objective, using respectively
J
(
y
)
log p(y)
ω
i
2
and
log p(y)
ω
i
2
as targets when
y
is sampled from
p
(
y
), for a given
ω
. The estimate
b
may then be recovered by substituting these estimates into Eq.
20.69. Mnih and Gregor (2014) preferred to use a single shared output (across all
elements
i
of
ω
) trained with the target
J
(
y
), using as baseline
b
(
ω
)
E
p(y)
[
J
(
y
)].
Variance reduction methods have been introduced in the reinforcement learning
context (Sutton et al., 2000; Weaver and Tao, 2001), generalizing previous work
on the case of binary reward by Dayan (1990). See Bengio et al. (2013b), Mnih
and Gregor (2014), Ba et al. (2014), Mnih et al. (2014), or Xu et al. (2015) for
examples of modern uses of the REINFORCE algorithm with reduced variance in
the context of deep learning. In addition to the use of an input-dependent baseline
b
(
ω
), Mnih and Gregor (2014) found that the scale of (
J
(
y
)
b
(
ω
)) could be
adjusted during training by dividing it by its standard deviation estimated by a
moving average during training, as a kind of adaptive learning rate, to counter
the effect of important variations that occur during the course of training in the
magnitude of this quantity. Mnih and Gregor (2014) called this heuristic variance
normalization.
693
CHAPTER 20. DEEP GENERATIVE MODELS
REINFORCE-based estimators can be understood as estimating the gradient
by correlating choices of
y
with corresponding values of
J
(
y
). If a good value of
y
is unlikely under the current parametrization, it might take a long time to obtain it
by chance, and get the required signal that this configuration should be reinforced.
20.10 Directed Generative Nets
As discussed in Chapter 16, directed graphical models make up a prominent class
of graphical models. While directed graphical models have been very popular
within the greater machine learning community, within the smaller deep learning
community they have until roughly 2013 been overshadowed by undirected models
such as the RBM.
In this section we review some of the standard directed graphical models that
have traditionally been associated with the deep learning community.
We have already described deep belief networks, which are a partially directed
model. We have also already described sparse coding models, which can be thought
of as shallow directed generative models. They are often used as feature learners
in the context of deep learning, though they tend to perform poorly at sample
generation and density estimation. We now describe a variety of deep, fully directed
models.
20.10.1 Sigmoid Belief Nets
Sigmoid belief networks (Neal, 1990) are a simple form of directed graphical model
with a specific kind of conditional probability distribution. In general, we can
think of a sigmoid belief network as having a vector of binary states
s
, with each
element of the state influenced by its ancestors:
p(s
i
) = σ
X
j<i
W
j,i
s
j
+ b
i
. (20.71)
The most common structure of sigmoid belief network is one that is divided
into many layers, with ancestral sampling proceeding through a series of many
hidden layers and then ultimately generating the visible layer. This structure is
very similar to the deep belief network, except that the units at the beginning of
the sampling process are independent from each other, rather than sampled from
a restricted Boltzmann machine. Such a structure is interesting for a variety of
694
CHAPTER 20. DEEP GENERATIVE MODELS
reasons. One reason is that the structure is a universal approximator of probability
distributions over the visible units, in the sense that it can approximate any
probability distribution over binary variables arbitrarily well, given enough depth,
even if the width of the individual layers is restricted to the dimensionality of the
visible layer (Sutskever and Hinton, 2008).
While generating a sample of the visible units is very efficient in a sigmoid
belief network, most other operations are not. Inference over the hidden units given
the visible units is intractable. Mean field inference is also intractable because the
variational lower bound involves taking expectations of cliques that encompass
entire layers. This problem has remained difficult enough to restrict the popularity
of directed discrete networks.
One approach for performing inference in a sigmoid belief network is to construct
a different lower bound that is specialized for sigmoid belief networks (Saul et al.,
1996). This approach has only been applied to very small networks. Another
approach is to use learned inference mechanisms as described in Sec. 19.5. The
Helmholtz machine (Dayan et al., 1995; Dayan and Hinton, 1996) is a sigmoid belief
network combined with an inference network that predicts the parameters of the
mean field distribution over the hidden units. Modern approaches (Gregor et al.,
2014; Mnih and Gregor, 2014) to sigmoid belief networks still use this inference
network approach. These techniques remain difficult due to the discrete nature of
the latent variables. One cannot simply back-propagate through the output of the
inference network, but instead must use the relatively unreliable machinery for back-
propagating through discrete sampling processes, described in Sec. 20.9.1. Recent
approaches based on importance sampling, reweighted wake-sleep (Bornschein
and Bengio, 2015) and bidirectional Helmholtz machines (Bornschein et al., 2015)
make it possible to quickly train sigmoid belief networks and reach state-of-the-art
performance on benchmark tasks.
A special case of sigmoid belief networks is the case where there are no latent
variables. Learning in this case is efficient, because there is no need to marginalize
latent variables out of the likelihood. A family of models called auto-regressive
networks generalize this fully visible belief network to other kinds of variables
besides binary variables and other structures of conditional distributions besides log-
linear relationships. Auto-regressive networks are described later, in Sec. 20.10.7.
20.10.2 Differentiable Generator Nets
Many generative models are based on the idea of using a differentiable generator
network. The model transforms samples of latent variables
z
to samples
x
or
695
CHAPTER 20. DEEP GENERATIVE MODELS
to distributions over samples
x
using a differentiable function
g
(
z
;
θ
(g)
) which is
typically represented by a neural network. This model class includes variational
autoencoders, which pair the generator net with an inference net, generative
adversarial networks, which pair the generator network with a discriminator
network, and techniques that train generator networks in isolation.
Generator networks are essentially just parametrized computational procedures
for generating samples, where the architecture provides the family of possible
distributions to sample from and the parameters select a distribution from within
that family.
As an example, the standard procedure for drawing samples from a normal
distribution with mean
µ
and covariance
Σ
is to feed samples
z
from a normal
distribution with zero mean and identity covariance into a very simple generator
network. This generator network contains just one affine layer:
x = g(z) = µ + Lz (20.72)
where L is given by the Cholesky decomposition of Σ.
Pseudorandom number generators can also use nonlinear transformations of
simple distributions. For example, inverse transform sampling (Devroye, 2013)
draws a scalar
z
from
U
(0
,
1) and applies a nonlinear transformation to a scalar
x
. In this case
g
(
z
) is given by the inverse of the cumulative distribution function
F
(
x
) =
R
x
−∞
p
(
v
)
dv
. If we are able to specify
p
(
x
), integrate over
x
, and invert the
resulting function, we can sample from p(x) without using machine learning.
To generate samples from more complicated distributions that are difficult
to specify directly, difficult to integrate over, or whose resulting integrals are
difficult to invert, we use a feedforward network to represent a parametric family
of nonlinear functions
g
, and use training data to infer the parameters selecting
the desired function.
We can think of
g
as providing a nonlinear change of variables that transforms
the distribution over z into the desired distribution over x.
Recall from Eq. 3.47 that, for invertible, differentiable, continuous g,
p
z
(z) = p
x
(g(z))
det(
g
z
)
. (20.73)
This implicitly imposes a probability distribution over x:
p
x
(x) =
p
z
(g
1
(x))
det(
g
z
)
. (20.74)
696
CHAPTER 20. DEEP GENERATIVE MODELS
Of course, this formula may be difficult to evaluate, depending on the choice of
g
, so we often use indirect means of learning
g
, rather than trying to maximize
log p(x) directly.
In some cases, rather than using
g
to provide a sample of
x
directly, we use
g
to define a conditional distribution over
x
. For example, we could use a generator
net whose final layer consists of sigmoid outputs to provide the mean parameters
of Bernoulli distributions:
p(x
i
= 1 | z) = g(z)
i
. (20.75)
In this case, when we use
g
to define
p
(
x | z
), we impose a distribution over
x
by
marginalizing z:
p(x) = E
z
p(x | z). (20.76)
Both approaches define a distribution
p
g
(
x
) and allow us to train various
criteria of p
g
using the reparametrization trick of Sec. 20.9.
The two different approaches to formulating generator nets—emitting the
parameters of a conditional distribution versus directly emitting samples—have
complementary strengths and weaknesses. When the generator net defines a
conditional distribution over
x
, it is capable of generating discrete data as well
as continuous data. When the generator net provides samples directly, it is
capable of generating only continuous data (we could introduce discretization in
the forward propagation, but this would lose the ability to learn the model using
back-propagation). The advantage to direct sampling is that we are no longer
forced to use conditional distributions whose form can be easily written down and
algebraically manipulated by a human designer.
Approaches based on differentiable generator networks are motivated by the
success of gradient descent applied to differentiable feedforward networks for
classification. In the context of supervised learning, deep feedforward networks
trained with gradient-based learning seem practically guaranteed to succeed given
enough hidden units and enough training data. Can this same recipe for success
transfer to generative modeling?
Generative modeling seems to be more difficult than classification or regression
because the learning process requires optimizing intractable criteria. In the context
of differentiable generator nets, the criteria are intractable because the data does
not specify both the inputs
z
and the outputs
x
of the generator net. In the case
of supervised learning, both the inputs
x
and the outputs
y
were given, and the
optimization procedure needs only to learn how to produce the specified mapping.
In the case of generative modeling, the learning procedure needs to determine how
to arrange z space in a useful way and additionally how to map from z to x.
697
CHAPTER 20. DEEP GENERATIVE MODELS
Dosovitskiy et al. (2015) studied a simplified problem, where the correspondence
between
z
and
x
is given. Specifically, the training data is computer-rendered
imagery of chairs. The latent variables
z
are parameters given to the rendering
engine describing the choice of which chair model to use, the position of the chair,
and other configuration details that affect the rendering of the image. Using this
synthetically generated data, a convolutional network is able to learn to map
z
descriptions of the content of an image to
x
approximations of rendered images.
This suggests that contemporary differentiable generator networks have sufficient
model capacity to be good generative models, and that contemporary optimization
algorithms have the ability to fit them. The difficulty lies in determining how to
train generator networks when the value of
z
for each
x
is not fixed and known
ahead of each time.
The following sections describe several approaches to training differentiable
generator nets given only training samples of x.
20.10.3 Variational Autoencoders
The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is a
directed model that uses learned approximate inference and can be trained purely
with gradient-based methods.
To generate a sample from the model, the VAE first draws a sample
z
from
the code distribution
p
model
(
z
). The sample is then run through a differentiable
generator network
g
(
z
). Finally,
x
is sampled from a distribution
p
model
(
x
;
g
(
z
)) =
p
model
(
x | z
). However, during training, the approximate inference network (or
encoder)
q
(
z | x
) is used to obtain
z
and
p
model
(
x | z
) is then viewed as a decoder
network.
The key insight behind variational autoencoders is that they may be trained
by maximizing the variational lower bound L(q) associated with data point x:
L(q) = E
zq(z|x)
log p
model
(z, x) + H(q(z | x)) (20.77)
= E
zq(z|x)
log p
model
(x | z) D
KL
(q(z | x)||p
model
(z)) (20.78)
log p
model
(x). (20.79)
In Eq. 20.77, we recognize the first term as the joint log-likelihood of the visible
and hidden variables under the approximate posterior over the latent variables (just
like with EM, except that we use an approximate rather than the exact posterior).
We recognize also a second term, the entropy of the approximate posterior. When
q
is chosen to be a Gaussian distribution, with noise added to a predicted mean
value, maximizing this entropy term encourages increasing the standard deviation
698
CHAPTER 20. DEEP GENERATIVE MODELS
of this noise. More generally, this entropy term encourages the variational posterior
to place high probability mass on many
z
values that could have generated
x
,
rather than collapsing to a single point estimate of the most likely value. In Eq.
20.78, we recognize the first term as the reconstruction log-likelihood found in
other autoencoders. The second term tries to make the approximate posterior
distribution q(z | x) and the model prior p
model
(z) approach each other.
Traditional approaches to variational inference and learning infer
q
via an
optimization algorithm, typically iterated fixed point equations (Sec. 19.4). These
approaches are slow and often require the ability to compute
E
zq
log p
model
(
z, x
)
in closed form. The main idea behind the variational autoencoder is to train a
parametric encoder (also sometimes called an inference network or recognition
model) that produces the parameters of
q
. So long as
z
is a continuous variable, we
can then back-propagate through samples of
z
drawn from
q
(
z | x
) =
q
(
z
;
f
(
x
;
θ
))
in order to obtain a gradient with respect to
θ
. Learning then consists solely of
maximizing
L
with respect to the parameters of the encoder and decoder. All of
the expectations in L may be approximated by Monte Carlo sampling.
The variational autoencoder approach is elegant, theoretically pleasing, and
simple to implement. It also obtains excellent results and is among the state of
the art approaches to generative modeling. Its main drawback is that samples
from variational autoencoders trained on images tend to be somewhat blurry. The
causes of this phenomenon are not yet known. One possibility is that the blurriness
is an intrinsic effect of maximum likelihood, which minimizes
D
KL
(
p
data
kp
model
).
As illustrated in Fig. 3.6, this means that the model will assign high probability to
points that occur in the training set, but may also assign high probability to other
points. These other points may include blurry images. Part of the reason that the
model would choose to put probability mass on blurry images rather than some
other part of the space is that the variational autoencoders used in practice usually
have a Gaussian distribution for
p
model
(
x
;
g
(
z
)). Maximizing a lower bound on
the likelihood of such a distribution is similar to training a traditional autoencoder
with mean squared error, in the sense that it has a tendency to ignore features
of the input that occupy few pixels or that cause only a small change in the
brightness of the pixels that they occupy. This issue is not specific to VAEs and
is shared with generative models that optimize a log-likelihood, or equivalently,
D
KL
(
p
data
kp
model
), as argued by Theis et al. (2015) and by Huszar (2015). Another
troubling issue with contemporary VAE models is that they tend to use only a small
subset of the dimensions of
z
, as if the encoder was not able to transform enough
of the local directions in input space to a space where the marginal distribution
matches the factorized prior.
699
CHAPTER 20. DEEP GENERATIVE MODELS
The VAE framework is very straightforward to extend to a wide range of model
architectures. This is a key advantage over Boltzmann machines, which require
extremely careful model design to maintain tractability. VAEs work very well
with a diverse family of differentiable operators. One particularly sophisticated
VAE is the deep recurrent attention writer or DRAW model (Gregor et al., 2015).
DRAW uses a recurrent encoder and recurrent decoder combined with an attention
mechanism. The generation process for the DRAW model consists of sequentially
visiting different small image patches and drawing the values of the pixels at those
points. VAEs can also be extended to generate sequences by defining variational
RNNs (Chung et al., 2015b) by using a recurrent encoder and decoder within
the VAE framework. Generating a sample from a traditional RNN involves only
non-deterministic operations at the output space. Variational RNNs also have
random variability at the potentially more abstract level captured by the VAE
latent variables.
The VAE framework has been extended to maximize not just the traditional
variational lower bound, but instead the importance weighted autoencoder (Burda
et al., 2015) objective:
L
k
(x, q) = E
z
(1)
,...,z
(k)
q(z|x)
"
log
1
k
k
X
i=1
p
model
(x, z
(i)
)
q(z
(i)
| x)
#
. (20.80)
This new objective is equivalent to the traditional lower bound
L
when
k
= 1.
However, it may also be interpreted as forming an estimate of the true
log p
model
(
x
)
using importance sampling of
z
from proposal distribution
q
(
z | x
). The importance
weighted autoencoder objective is also a lower bound on
log p
model
(
x
) and becomes
tighter as k increases.
Variational autoencoders have some interesting connections to the MP-DBM
and other approaches that involve back-propagation through the approximate
inference graph (Goodfellow et al., 2013b; Stoyanov et al., 2011; Brakel et al., 2013).
These previous approaches required an inference procedure such as mean field fixed
point equations to provide the computational graph. The variational autoencoder
is defined for arbitrary computational graphs, which makes it applicable to a wider
range of probabilistic model families because there is no need to restrict the choice
of models to those with tractable mean field fixed point equations. The variational
autoencoder also has the advantage that it increases a bound on the log-likelihood
of the model, while the criteria for the MP-DBM and related models are more
heuristic and have little probabilistic interpretation beyond making the results of
approximate inference accurate. One disadvantage of the variational autoencoder
is that it learns an inference network for only one problem, inferring
z
given
x
.
700
CHAPTER 20. DEEP GENERATIVE MODELS
The older methods are able to perform approximate inference over any subset of
variables given any other subset of variables, because the mean field fixed point
equations specify how to share parameters between the computational graphs for
all of these different problems.
One very nice property of the variational autoencoder is that simultaneously
training a parametric encoder in combination with the generator network forces
the model to learn a predictable coordinate system that the encoder can capture.
This makes it an excellent manifold learning algorithm. See Fig. 20.6 for examples
of low-dimensional manifolds learned by the variational autoencoder. In one of the
cases demonstrated in the figure, the algorithm discovered two independent factors
of variation present in images of faces: angle of rotation and emotional expression.
Figure 20.6: Examples of two-dimensional coordinate systems for high-dimensional mani-
folds, learned by a variational autoencoder (Kingma and Welling, 2014a). Two dimensions
may be plotted directly on the page for visualization, so we can gain an understanding of
how the model works by training a model with a 2-D latent code, even if we believe the
intrinsic dimensionality of the data manifold is much higher. The images shown are not
examples from the training set but images
x
actually generated by the model
p
(
x | z
),
simply by changing the 2-D “code”
z
(each image corresponds to a different choice of “code”
z
on a 2-D uniform grid). (Left) The two-dimensional map of the Frey faces manifold.
One dimension that has been discovered (horizontal) mostly corresponds to a rotation of
the face, while the other (vertical) corresponds to the emotional expression. (Right) The
two-dimensional map of the MNIST manifold.
701
CHAPTER 20. DEEP GENERATIVE MODELS
20.10.4 Generative Adversarial Networks
Generative adversarial networks or GANs (Goodfellow et al., 2014c) are another
generative modeling approach based on differentiable generator networks.
Generative adversarial networks are based on a game theoretic scenario in
which the generator network must compete against an adversary. The generator
network directly produces samples
x
=
g
(
z
;
θ
(g)
). Its adversary, the discriminator
network, attempts to distinguish between samples drawn from the training data
and samples drawn from the generator. The discriminator emits a probability value
given by
d
(
x
;
θ
(d)
), indicating the probability that
x
is a real training example
rather than a fake sample drawn from the model.
The simplest way to formulate learning in generative adversarial networks is
as a zero-sum game, in which a function
v
(
θ
(g)
, θ
(d)
) determines the payoff of the
discriminator. The generator receives
v
(
θ
(g)
, θ
(d)
) as its own payoff. During
learning, each player attempts to maximize its own payoff, so that at convergence
g
= arg min
g
max
d
v(g, d). (20.81)
The default choice for v is
v(θ
(g)
, θ
(d)
) = E
xp
data
log d(x) + E
xp
model
log (1 d(x)) . (20.82)
This drives the discriminator to attempt to learn to correctly classify samples as real
or fake. Simultaneously, the generator attempts to fool the classifier into believing
its samples are real. At convergence, the generator’s samples are indistinguishable
from real data, and the discriminator outputs
1
2
everywhere. The discriminator
may then be discarded.
The main motivation for the design of GANs is that the learning process
requires neither approximate inference nor approximation of a partition function
gradient. In the case where
max
d
v
(
g, d
) is convex in
θ
(g)
(such as the case where
optimization is performed directly in the space of probability density functions)
then the procedure is guaranteed to converge and is asymptotically consistent.
Unfortunately, learning in GANs can be difficult in practice when
g
and
d
are represented by neural networks and
max
d
v
(
g, d
) is not convex. Goodfellow
(2014) identified non-convergence as an issue that may cause GANs to underfit.
In general, simultaneous gradient descent on two players’ costs is not guaranteed
to reach an equilibrium. Consider for example the value function
v
(
a, b
) =
ab
,
where one player controls
a
and incurs cost
ab
, while the other player controls
b
and receives a cost
ab
. If we model each player as making infinitesimally small
702
CHAPTER 20. DEEP GENERATIVE MODELS
gradient steps, each player reducing their own cost at the expense of the other
player, then
a
and
b
go into a stable, circular orbit, rather than arriving at the
equilibrium point at the origin. Note that the equilibria for a minimax game are
not local minima of
v
. Instead, they are points that are simultaneously minima
for both players’ costs. This means that they are saddle points of
v
that are local
minima with respect to the first player’s parameters and local maxima with respect
to the second player’s parameters. It is possible for the two players to take turns
increasing then decreasing
v
forever, rather than landing exactly on the saddle
point where neither player is capable of reducing their cost. It is not known to
what extent this non-convergence problem affects GANs.
Goodfellow (2014) identified an alternative formulation of the payoffs, in
which the game is no longer zero-sum, that has the same expected gradient
as maximum likelihood learning whenever the discriminator is optimal. Because
maximum likelihood training converges, this reformulation of the GAN game should
also converge, given enough samples. Unfortunatley, this alternative formulation
does not seem to perform well in practice, possibly due to suboptimality of the
discriminator, or possibly due to high variance around the expected gradient.
In practice, the best-performing formulation of the GAN game is a different for-
mulation that is neither zero-sum nor equivalent to maximum likelihood, introduced
by Goodfellow et al. (2014c) with a heuristic motivation. In this best-performing
formulation, the generator aims to increase the log probability that the discrimina-
tor makes a mistake, rather than aiming to decrease the log probability that the
discriminator makes the correct prediction. This reformulation is motivated solely
by the observation that it causes the derivative of the generator’s cost function
with respect to the discriminator’s logits to remain large even in the situation
where the discriminator confidently rejects all generator samples.
Stabilization of GAN learning remains an open problem. Fortunately, GAN
learning performs well when the model architecture and hyperparameters are care-
fully selected. Radford et al. (2015) crafted a deep convolutional GAN (DCGAN)
that performs very well for image synthesis tasks, and showed that its latent
representation space captures important factors of variation, as shown in Fig. 15.9.
See Fig. 20.7 for examples of images generated by a DCGAN generator.
The GAN learning problem can also be simplified by breaking the generation
process into many levels of detail. It is possible to train conditional GANs (Mirza
and Osindero, 2014) that learn to sample from a distribution
p
(
x | y
) rather
than simply sampling from a marginal distribution
p
(
x
). Denton et al. (2015)
showed that a series of conditional GANs can be trained to first generate a very
low-resolution version of an image, then incrementally add details to the image.
703
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.7: Images generated by GANs trained on the LSUN dataset. (Left) Images
of bedrooms generated by a DCGAN model, reproduced with permission from Radford
et al. (2015). (Right) Images of churches generated by a LAPGAN model, reproduced
with permission from Denton et al. (2015).
This technique is called the LAPGAN model, due to the use of a Laplacian pyramid
to generate the images containing varying levels of detail. LAPGAN generators
are able to fool not only discriminator networks but also human observers, with
experimental subjects identifying up to 40% of the outputs of the network as being
real data. See Fig. 20.7 for examples of images generated by a LAPGAN generator.
One unusual capability of the GAN training procedure is that it can fit proba-
bility distributions that assign zero probability to the training points. Rather than
maximizing the log probability of specific points, the generator net learns to trace
out a manifold whose points resemble training points in some way. Somewhat para-
doxically, this means that the model may assign a log-likelihood of negative infinity
to the test set, while still representing a manifold that a human observer judges
to capture the essence of the generation task. This is not clearly an advantage or
a disadvantage, and one may also guarantee that the generator network assigns
non-zero probability to all points simply by making the last layer of the generator
network add Gaussian noise to all of the generated values. Generator networks
that add Gaussian noise in this manner sample from the same distribution that one
obtains by using the generator network to parametrize the mean of a conditional
Gaussian distribution.
Dropout seems to be important in the discriminator network. In particular,
units should be stochastically dropped while computing the gradient for the
generator network to follow. Following the gradient of the deterministic version of
the discriminator with its weights divided by two does not seem to be as effective.
704
CHAPTER 20. DEEP GENERATIVE MODELS
Likewise, never using dropout seems to yield poor results.
While the GAN framework is designed for differentiable generator networks,
similar principles can be used to train other kinds of models. For example, self-
supervised boosting can be used to train an RBM generator to fool a logistic
regression discriminator (Welling et al., 2002).
20.10.5 Generative Moment Matching Networks
Generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015)
are another form of generative model based on differentiable generator networks.
Unlike VAEs and GANs, they do not need to pair the generator network with any
other network—neither an inference network as used with VAEs nor a discriminator
network as used with GANs.
These networks are trained with a technique called moment matching. The
basic idea behind moment matching is to train the generator in such a way that
many of the statistics of samples generated by the model are as similar as possible
to those of the statistics of the examples in the training set. In this context, a
moment is an expectation of different powers of a random variable. For example,
the first moment is the mean, the second moment is the mean of the squared
values, and so on. In multiple dimensions, each element of the random vector may
be raised to different powers, so that a moment may be any quantity of the form
E
x
Π
i
x
n
i
i
(20.83)
where n = [n
1
, n
2
, . . . , n
d
]
>
is a vector of non-negative integers.
Upon first examination, this approach seems to be computationally infeasible.
For example, if we want to match all the moments of the form
x
i
x
j
, then we need
to minimize the difference between a number of values that is quadratic in the
dimension of
x
. Moreover, even matching all of the first and second moments
would only be sufficient to fit a multivariate Gaussian distribution, which captures
only linear relationships between values. Our ambitions for neural networks are to
capture complex nonlinear relationships, which would require far more moments.
GANs avoid this problem of exhaustively enumerating all moments by using a
dynamically updated discriminator, that automatically focuses its attention on
whichever statistic the generator network is matching the least effectively.
Instead, generative moment matching networks can be trained by minimizing
a cost function called maximum mean discrepancy (Schölkopf and Smola, 2002;
Gretton et al., 2012) or MMD. This cost function measures the error in the first
moments in an infinite-dimensional space, using an implicit mapping to feature
705
CHAPTER 20. DEEP GENERATIVE MODELS
space defined by a kernel function in order to make computations on infinite-
dimensional vectors tractable. The MMD cost is zero if and only if the two
distributions being compared are equal.
Visually, the samples from generative moment matching networks are somewhat
disappointing. Fortunately, they can be improved by combining the generator
network with an autoencoder. First, an autoencoder is trained to reconstruct the
training set. Next, the encoder of the autoencoder is used to transform the entire
training set into code space. The generator network is then trained to generate
code samples, which may be mapped to visually pleasing samples via the decoder.
Unlike GANs, the cost function is defined only with respect to a batch of
examples from both the training set and the generator network. It is not possible
to make a training update as a function of only one training example or only
one sample from the generator network. This is because the moments must be
computed as an empirical average across many samples. When the batch size is too
small, MMD can underestimate the true amount of variation in the distributions
being sampled. No finite batch size is sufficiently large to eliminate this problem
entirely, but larger batches reduce the amount of underestimation. When the batch
size is too large, the training procedure becomes infeasibly slow, because many
examples must be processed in order to compute a single small gradient step.
As with GANs, it is possible to train a generator net using MMD even if that
generator net assigns zero probability to the training points.
20.10.6 Convolutional Generative Networks
When generating images, it is often useful to use a generator network that includes
a convolutional structure (see for example Goodfellow et al. (2014c) or Dosovitskiy
et al. (2015)). To do so, we use the “transpose” of the convolution operator,
described in Sec. 9.5. This approach often yields more realistic images and does
so using fewer parameters than using fully connected layers without parameter
sharing.
Convolutional networks for recognition tasks have information flow from the
image to some summarization layer at the top of the network, often a class label.
As this image flows upward through the network, information is discarded as the
representation of the image becomes more invariant to nuisance transformations.
In a generator network, the opposite is true. Rich details must be added as
the representation of the image to be generated propagates through the network,
culminating in the final representation of the image, which is of course the image
itself, in all of its detailed glory, with object positions and poses and textures and
706
CHAPTER 20. DEEP GENERATIVE MODELS
lighting. The primary mechanism for discarding information in a convolutional
recognition network is the pooling layer. The generator network seems to need to
add information. We cannot put the inverse of a pooling layer into the generator
network because most pooling functions are not invertible. A simpler operation is
to merely increase the spatial size of the representation. An approach that seems
to perform acceptably is to use an “un-pooling” as introduced by Dosovitskiy et al.
(2015). This layer corresponds to the inverse of the max-pooling operation under
certain simplifying conditions. First, the stride of the max-pooling operation is
constrained to be equal to the width of the pooling region. Second, the maximum
input within each pooling region is assumed to be the input in the upper-left
corner. Finally, all non-maximal inputs within each pooling region are assumed to
be zero. These are very strong and unrealistic assumptions, but they do allow the
max-pooling operator to be inverted. The inverse un-pooling operation allocates
a tensor of zeros, then copies each value from spatial coordinate
i
of the input
to spatial coordinate
i × k
of the output. The integer value
k
defines the size
of the pooling region. Even though the assumptions motivating the definition of
the un-pooling operator are unrealistic, the subsequent layers are able to learn to
compensate for its unusual output, so the samples generated by the model as a
whole are visually pleasing.
20.10.7 Auto-Regressive Networks
Auto-regressive networks are directed probabilistic models with no latent random
variables. The conditional probability distributions in these models are represented
by neural networks (sometimes extremely simple neural networks such as logistic
regression). The graph structure of these models is the complete graph. They
decompose a joint probability over the observed variables using the chain rule of
probability to obtain a product of conditionals of the form
P
(
x
d
| x
d1
, . . . , x
1
).
Such models have been called fully-visible Bayes networks (FVBNs) and used
successfully in many forms, first with logistic regression for each conditional
distribution (Frey, 1998) and then with neural networks with hidden units (Bengio
and Bengio, 2000b; Larochelle and Murray, 2011). In some forms of auto-
regressive networks, such as NADE (Larochelle and Murray, 2011), described in
Sec. 20.10.10 below, we can introduce a form of parameter sharing that brings both
a statistical advantage (fewer unique parameters) and a computational advantage
(less computation). This is one more instance of the recurring deep learning motif
of reuse of features.
707
CHAPTER 20. DEEP GENERATIVE MODELS
x
1
x
1
x
2
x
2
x
4
x
4
P (x
4
| x
1
,x
2
,x
3
)P (x
4
| x
1
,x
2
,x
3
)
P (x
3
| x
1
,x
2
)P (x
3
| x
1
,x
2
)
P (x
2
| x
1
)P (x
2
| x
1
)
P (x
1
)P (x
1
)
x
1
x
1
x
2
x
2
x
3
x
3
x
4
x
4
Figure 20.8: A fully visible belief network predicts the
i
-th variable from the
i
1
previous ones. (Top) The directed graphical model for an FVBN. (Bottom) Corresponding
computational graph, in the case of the logistic FVBN, where each prediction is made by
a linear predictor.
20.10.8 Linear Auto-Regressive Networks
The simplest form of auto-regressive network has no hidden units and no sharing
of parameters or features. Each
P
(
x
i
| x
i1
, . . . , x
1
) is parametrized as a linear
model (linear regression for real-valued data, logistic regression for binary data,
softmax regression for discrete data). This model was introduced by Frey (1998)
and has
O
(
d
2
) parameters when there are
d
variables to model. It is illustrated in
Fig. 20.8.
If the variables are continuous, a linear auto-regressive model is merely another
way to formulate a multivariate Gaussian distribution, capturing linear pairwise
interactions between the observed variables.
Linear auto-regressive networks are essentially the generalization of linear
classification methods to generative modeling. They therefore have the same
advantages and disadvantages as linear classifiers. Like linear classifiers, they may
be trained with convex loss functions, and sometimes admit closed form solutions
(as in the Gaussian case). Like linear classifiers, the model itself does not offer
a way of increasing its capacity, so capacity must be raised using techniques like
basis expansions of the input or the kernel trick.
708
CHAPTER 20. DEEP GENERATIVE MODELS
x
1
x
1
x
2
x
2
x
4
x
4
h
1
h
1
h
2
h
2
h
3
h
3
P (x
4
| x
1
,x
2
,x
3
)P (x
4
| x
1
,x
2
,x
3
)
P (x
3
| x
1
,x
2
)P (x
3
| x
1
,x
2
)
P (x
2
| x
1
)P (x
2
| x
1
)
P (x
1
)P (x
1
)
Figure 20.9: A neural auto-regressive network predicts the
i
-th variable
x
i
from the
i
1
previous ones, but is parametrized so that features (groups of hidden units denoted
h
i
)
that are functions of
x
1
, . . . , x
i
can be reused in predicting all of the subsequent variables
x
i+1
, x
i+2
, . . . , x
d
.
20.10.9 Neural Auto-Regressive Networks
Neural auto-regressive networks (Bengio and Bengio, 2000a,b) have the same
left-to-right graphical model as logistic auto-regressive networks (Fig. 20.8) but
employ a different parametrization of the conditional distributions within that
graphical model structure. The new parametrization is more powerful in the sense
that its capacity can be increased as much as needed, allowing approximation of
any joint distribution. The new parametrization can also improve generalization
by introducing a parameter sharing and feature sharing principle common to deep
learning in general. The models were motivated by the objective of avoiding the
curse of dimensionality arising out of traditional tabular graphical models, sharing
the same structure as Fig. 20.8. In tabular discrete probabilistic models, each
conditional distribution is represented by a table of probabilities, with one entry
and one parameter for each possible configuration of the variables involved. By
using a neural network instead, two advantages are obtained:
1.
The parametrization of each
P
(
x
i
| x
i1
, . . . , x
1
) by a neural network with
(
i
1)
× k
inputs and
k
outputs (if the variables are discrete and take
k
values, encoded one-hot) allows one to estimate the conditional probability
without requiring an exponential number of parameters (and examples), yet
still is able to capture high-order dependencies between the random variables.
2.
Instead of having a different neural network for the prediction of each
x
i
,
709
CHAPTER 20. DEEP GENERATIVE MODELS
a left-to-right connectivity illustrated in Fig. 20.9 allows one to merge all
the neural networks into one. Equivalently, it means that the hidden layer
features computed for predicting
x
i
can be reused for predicting
x
i+k
(
k >
0).
The hidden units are thus organized in groups that have the particularity
that all the units in the
i
-th group only depend on the input values
x
1
, . . . , x
i
.
The parameters used to compute these hidden units are jointly optimized
to improve the prediction of all the variables in the sequence. This is
an instance of the reuse principle that recurs throughout deep learning in
scenarios ranging from recurrent and convolutional network architectures to
multi-task and transfer learning.
Each
P
(
x
i
| x
i1
, . . . , x
1
) can represent a conditional distribution by having
outputs of the neural network predict parameters of the conditional distribution
of
x
i
, as discussed in Sec. 6.2.1.1. Although the original neural auto-regressive
networks were initially evaluated in the context of purely discrete multivariate
data (with a sigmoid output for a Bernoulli variable or softmax output for a
multinoulli variable) it is natural to extend such models to continuous variables or
joint distributions involving both discrete and continuous variables.
20.10.10 NADE
The neural autoregressive density estimator (NADE) is a very successful recent form
of neural auto-regressive network (Larochelle and Murray, 2011). The connectivity
is the same as for the original neural auto-regressive network of Bengio and
Bengio (2000b) but NADE introduces an additional parameter sharing scheme, as
illustrated in Fig. 20.10. The parameters of the hidden units of different groups
j
are shared.
The weights
W
0
j,k,i
from the
i
-th input
x
i
to the
k
-th element of the
j
-th group
of hidden unit h
(j)
k
(j i) are shared among the groups:
W
0
j,k,i
= W
k,i
. (20.84)
The remaining weights, where j < i, are zero.
Larochelle and Murray (2011) chose this sharing scheme so that forward
propagation in a NADE model loosely resembles the computations performed in
mean field inference to fill in missing inputs in an RBM. This mean field inference
corresponds to running a recurrent network with shared weights and the first step
of that inference is the same as in NADE. The only difference is that with NADE,
the output weights connecting the hidden units to the output are parametrized
710
CHAPTER 20. DEEP GENERATIVE MODELS
x
1
x
1
x
2
x
2
x
4
x
4
h
1
h
1
h
2
h
2
h
3
h
3
P (x
4
| x
1
,x
2
,x
3
)P (x
4
| x
1
,x
2
,x
3
)
P (x
3
| x
1
,x
2
)P (x
3
| x
1
,x
2
)
P (x
2
| x
1
)P (x
2
| x
1
)
P (x
1
)P (x
1
)
W
:,1
W
:,1
W
:,1
W
:,2
W
:,2
W
:,3
Figure 20.10: An illustration of the neural autoregressive density estimator (NADE). The
hidden units are organized in groups
h
(j)
so that only the inputs
x
1
, . . . , x
i
participate
in computing
h
(i)
and predicting
P
(
x
j
| x
j1
, . . . , x
1
), for
j > i
. NADE is differentiated
from earlier neural auto-regressive networks by the use of a particular weight sharing
pattern:
W
0
j,k,i
=
W
k,i
is shared (indicated in the gure by the use of the same line pattern
for every instance of a replicated weight) for all the weights going out from
x
i
to the
k
-th
unit of any group j i. Recall that the vector (W
1,i
, W
2,i
, . . . , W
n,i
) is denoted W
:,i
.
independently from the weights connecting the input units to the hidden units. In
the RBM, the hidden-to-output weights are the transpose of the input-to-hidden
weights. The NADE architecture can be extended to mimic not just one time step
of the mean field recurrent inference but to mimic
k
steps. This approach is called
NADE-k (Raiko et al., 2014).
As mentioned previously, auto-regressive networks may be extend to process
continuous-valued data. A particularly powerful and generic way of parametrizing a
continuous density is as a Gaussian mixture (introduced in Sec. 3.9.6) with mixture
weights
α
i
(the coefficient or prior probability for component
i
), per-component
conditional mean
µ
i
and per-component conditional variance
σ
2
i
. A model called
RNADE (Uria et al., 2013) uses this parametrization to extend NADE to real
values. As with other mixture density networks, the parameters of this distribution
are outputs of the network, with the mixture weight probabilities produced by a
softmax unit, and the variances parametrized so that they are positive. Stochastic
gradient descent can be numerically ill-behaved due to the interactions between the
conditional means
µ
i
and the conditional variances
σ
2
i
. To reduce this difficulty,
Uria et al. (2013) use a pseudo-gradient that replaces the gradient on the mean, in
the back-propagation phase.
711
CHAPTER 20. DEEP GENERATIVE MODELS
Another very interesting extension of the neural auto-regressive architectures
gets rid of the need to choose an arbitrary order for the observed variables (Murray
and Larochelle, 2014). In auto-regressive networks, the idea is to train the network
to be able to cope with any order by randomly sampling orders and providing the
information to hidden units specifying which of the inputs are observed (on the
right side of the conditioning bar) and which are to be predicted and are thus
considered missing (on the left side of the conditioning bar). This is nice because
it allows one to use a trained auto-regressive network to perform any inference
problem (i.e. predict or sample from the probability distribution over any subset
of variables given any subset) extremely efficiently. Finally, since many orders of
variables are possible (
n
! for
n
variables) and each order
o
of variables yields a
different p(x | o), we can form an ensemble of models for many values of o:
p
ensemble
(x) =
1
k
k
X
i=1
p(x | o
(i)
). (20.85)
This ensemble model usually generalizes better and assigns higher probability to
the test set than does an individual model defined by a single ordering.
In the same paper, the authors propose deep versions of the architecture, but
unfortunately that immediately makes computation as expensive as in the original
neural auto-regressive neural network (Bengio and Bengio, 2000b). The first layer
and the output layer can still be computed in
O
(
nh
) multiply-add operations,
as in the regular NADE, where
h
is the number of hidden units (the size of the
groups
h
i
, in Fig. 20.10 and Fig. 20.9), whereas it is
O
(
n
2
h
) in Bengio and Bengio
(2000b). However, for the other hidden layers, the computation is
O
(
n
2
h
2
) if every
“previous” group at layer
l
participates in predicting the “next” group at layer
l
+ 1,
assuming
n
groups of
h
hidden units at each layer. Making the
i
-th group at layer
l
+ 1 only depend on the
i
-th group, as in Murray and Larochelle (2014) at layer
l
reduces it to O(nh
2
), which is still h times worse than the regular NADE.
20.11 Drawing Samples from Autoencoders
In Chapter 14, we saw that many kinds of autoencoders learn the data distribution.
There are close connections between score matching, denoising autoencoders, and
contractive autoencoders. These connections demonstrate that some kinds of
autoencoders learn the data distribution in some way. We have not yet seen how
to draw samples from such models.
Some kinds of autoencoders, such as the variational autoencoder, explicitly
712
CHAPTER 20. DEEP GENERATIVE MODELS
represent a probability distribution and admit straightforward ancestral sampling.
Most other kinds of autoencoders require MCMC sampling.
Contractive autoencoders are designed to recover an estimate of the tangent
plane of the data manifold. This means that repeated encoding and decoding with
injected noise will induce a random walk along the surface of the manifold (Rifai
et al., 2012; Mesnil et al., 2012). This manifold diffusion technique is a kind of
Markov chain.
There is also a more general Markov chain that can sample from any denoising
autoencoder.
20.11.1 Markov Chain Associated with any Denoising Autoen-
coder
The above discussion left open the question of what noise to inject and where, in
order to obtain a Markov chain that would generate from the distribution estimated
by the autoencoder. Bengio et al. (2013c) showed how to construct such a Markov
chain for generalized denoising autoencoders. Generalized denoising autoencoders
are specified by a denoising distribution for sampling an estimate of the clean input
given the corrupted input.
Each step of the Markov chain that generates from the estimated distribution
consists of the following sub-steps, illustrated in Fig. 20.11:
1.
Starting from the previous state
x
, inject corruption noise, sampling
˜
x
from
C(
˜
x | x).
2. Encode
˜
x into h = f(
˜
x).
3. Decode h to obtain the parameters ω = g(h) of p(x | ω = g(h)) = p(x |
˜
x).
4. Sample the next state x from p(x | ω = g(h)) = p(x |
˜
x).
Bengio et al. (2014) showed that if the autoencoder
p
(
x |
˜
x
) forms a consistent
estimator of the corresponding true conditional distribution, then the stationary
distribution of the above Markov chain forms a consistent estimator (albeit an
implicit one) of the data generating distribution of x.
20.11.2 Clamping and Conditional Sampling
Similarly to Boltzmann machines, denoising autoencoders and their generalizations
(such as GSNs, described below) can be used to sample from a conditional distri-
bution
p
(
x
f
| x
o
), simply by clamping the observed units
x
f
and only resampling
713
CHAPTER 20. DEEP GENERATIVE MODELS
xx
˜
x
˜
x
hh
!!
ˆ
x
ˆ
x
C(
˜
x | x)
p(x | !)
f
g
Figure 20.11: Each step of the Markov chain associated with a trained denoising autoen-
coder, that generates the samples from the probabilistic model implicitly trained by the
denoising log-likelihood criterion. Each step consists in (a) injecting noise via corruption
process
C
in state
x
, yielding
˜
x
, (b) encoding it with function
f
, yielding
h
=
f
(
˜
x
),
(c) decoding the result with function
g
, yielding parameters
ω
for the reconstruction
distribution, and (d) given
ω
, sampling a new state from the reconstruction distribution
p
(
x | ω
=
g
(
f
(
˜
x
))). In the typical squared reconstruction error case,
g
(
h
) =
ˆ
x
, which
estimates
E
[
x |
˜
x
], corruption consists in adding Gaussian noise and sampling from
p
(
x | ω
) consists in adding Gaussian noise, a second time, to the reconstruction
ˆ
x
. The
latter noise level should correspond to the mean squared error of reconstructions, whereas
the injected noise is a hyperparameter that controls the mixing speed as well as the
extent to which the estimator smooths the empirical distribution (Vincent, 2011). In the
example illustrated here, only the
C
and
p
conditionals are stochastic steps (
f
and
g
are
deterministic computations), although noise can also be injected inside the autoencoder,
as in generative stochastic networks (Bengio et al., 2014).
714
CHAPTER 20. DEEP GENERATIVE MODELS
the free units
x
o
given
x
f
and the sampled latent variables (if any). For example,
MP-DBMs can be interpreted as a form of denoising autoencoder, and are able
to sample missing inputs. GSNs later generalized some of the ideas present in
MP-DBMs to perform the same operation (Bengio et al., 2014). Alain et al. (2015)
identified a missing condition from Proposition 1 of Bengio et al. (2014), which is
that the transition operator (defined by the stochastic mapping going from one
state of the chain to the next) should satisfy a property called detailed balance,
which specifies that a Markov Chain at equilibrium will remain in equilibrium
whether the transition operator is run in forward or reverse.
An experiment in clamping half of the pixels (the right part of the image) and
running the Markov chain on the other half is shown in Fig. 20.12.
Figure 20.12: Illustration of clamping the right half of the image and running the Markov
Chain by resampling only the left half at each step. These samples come from a GSN
trained to reconstruct MNIST digits at each time step using the walkback procedure.
20.11.3 Walk-Back Training Procedure
The walk-back training procedure was proposed by Bengio et al. (2013c) as a way
to accelerate the convergence of generative training of denoising autoencoders.
Instead of performing a one-step encode-decode reconstruction, this procedure
consists in alternative multiple stochastic encode-decode steps (as in the generative
715
CHAPTER 20. DEEP GENERATIVE MODELS
Markov chain) initialized at a training example (just like with the contrastive
divergence algorithm, described in Sec. 18.2) and penalizing the last probabilistic
reconstructions (or all of the reconstructions along the way).
Training with
k
steps is equivalent (in the sense of achieving the same stationary
distribution) as training with one step, but practically has the advantage that
spurious modes farther from the data can be removed more efficiently.
20.12 Generative Stochastic Networks
Generative stochastic networks or GSNs (Bengio et al., 2014) are generalizations of
denoising autoencoders that include latent variables
h
in the generative Markov
chain, in addition to the visible variables (usually denoted x).
A GSN is parametrized by two conditional probability distributions which
specify one step of the Markov chain:
1. p
(
x
(k)
| h
(k)
) tells how to generate the next visible variable given the current
latent state. Such a “reconstruction distribution” is also found in denoising
autoencoders, RBMs, DBNs and DBMs.
2. p
(
h
(k)
| h
(k1)
, x
(k1)
) tells how to update the latent state variable, given
the previous latent state and visible variable.
Denoising autoencoders and GSNs differ from classical probabilistic models
(directed or undirected) in that they parametrize the generative process itself rather
than the mathematical specification of the joint distribution of visible and latent
variables. Instead, the latter is defined
implicitly
, if it exists, as the stationary
distribution of the generative Markov chain. The conditions for existence of the
stationary distribution are mild and are the same conditions required by standard
MCMC methods (see Sec. 17.3). These conditions are necessary to guarantee
that the chain mixes, but they can be violated by some choices of the transition
distributions (for example, if they were deterministic).
One could imagine different training criteria for GSNs. The one proposed and
evaluated by Bengio et al. (2014) is simply reconstruction log-probability on the
visible units, just like for denoising autoencoders. This is achieved by clamping
x
(0)
=
x
to the observed example and maximizing the probability of generating
x
at some subsequent time steps, i.e., maximizing
log p
(
x
(k)
=
x | h
(k)
), where
h
(k)
is sampled from the chain, given
x
(0)
=
x
. In order to estimate the gradient of
log p
(
x
(k)
=
x | h
(k)
) with respect to the other pieces of the model, Bengio et al.
(2014) use the reparametrization trick, introduced in Sec. 20.9.
716
CHAPTER 20. DEEP GENERATIVE MODELS
The walk-back training protocol (described in Sec. 20.11.3) was used (Bengio
et al., 2014) to improve training convergence of GSNs.
20.12.1 Discriminant GSNs
The original formulation of GSNs (Bengio et al., 2014) was meant for unsupervised
learning and implicitly modeling
p
(
x
) for observed data
x
, but it is possible to
modify the framework to optimize p(y | x).
For example, Zhou and Troyanskaya (2014) generalize GSNs in this way, by only
back-propagating the reconstruction log-probability over the output variables, keep-
ing the input variables fixed. They applied this successfully to model
sequences
(protein secondary structure) and introduced a (one-dimensional)
convolutional
structure in the transition operator of the Markov chain. It is important to re-
member that, for each step of the Markov chain, one generates a new sequence for
each layer, and that sequence is the input for computing other layer values (say
the one below and the one above) at the next time step.
Hence the Markov chain is really over the
output variable
(and associated
higher-level hidden layers), and the input sequence only serves to condition that
chain, with back-propagation allowing to learn how the input sequence can condition
the output distribution implicitly represented by the Markov chain. It is therefore
a case of using the GSN in the context of
structured outputs
, where
p
(
y | x
)
does not have a simple parametric form but instead the components of
y
are
statistically dependent of each other, given x, in complicated ways.
Zöhrer and Pernkopf (2014) introduced a hybrid model that combines a super-
vised objective (as in the above work) and an unsupervised objective (as in the
original GSN work), by simply adding (with a different weight) the supervised and
unsupervised costs i.e., the reconstruction log-probabilities of
y
and
x
respectively.
Such a hybrid criterion had previously been introduced for RBMs by Larochelle
and Bengio (2008). They show improved classification performance using this
scheme.
20.13 Other Generation Schemes
The methods we have described so far use either MCMC sampling, ancestral
sampling, or some mixture of the two to generate samples. While these are the
most popular approaches to generative modeling, they are by no means the only
approaches.
717
CHAPTER 20. DEEP GENERATIVE MODELS
Sohl-Dickstein et al. (2015) developed a diffusion inversion training scheme
for learning a generative model, based on non-equilibrium thermodynamics. The
approach is based on the idea that the probability distributions we wish to sample
from have structure. This structure can gradually be destroyed by a diffusion
process that incrementally changes the probability distribution to have more
entropy. To form a generative model, we can run the process in reverse, by training
a model that gradually restores the structure to an unstructured distribution. By
iteratively applying a process that brings a distribution closer to the target one, we
can gradually approach that target distribution. This approach resembles MCMC
methods in the sense that it involves many iterations to produce a sample. However,
the model is defined to be the probability distribution produced by the final step
of the chain. In this sense, there is no approximation induced by the iterative
procedure. The approach introduced by Sohl-Dickstein et al. (2015) is also very
close to the generative interpretation of the denoising autoencoder (Sec. 20.11.1).
Like with the denoising autoencoder, the training objective trains a transition
operator which attempts to probabilistically undo the effect of adding some noise,
trying to undo one step of the diffusion process. If we compare with the walkback
training procedure (Sec. 20.11.3) for denoising autoencoders and GSNs, the main
difference is that instead of reconstructing only towards the observed training point
x
, the objective function only tries to reconstruct towards the previous point in
the diffusion trajectory that started at
x
(which should be easier). This addresses
the following dilemma present with the ordinary reconstruction log-likelihood
objective of denoising autoencoders: with small levels of noise the learner only sees
configurations near the data points, while with large levels of noise it is asked to do
an almost impossible job (because the denoising distribution is going to be highly
complex and multi-modal). With the diffusion inversion objective, the learner can
learn more precisely the shape of the density around the data points as well as
remove spurious modes that could show up far from the data points.
Another approach to sample generation is the approximate Bayesian computa-
tion (ABC) framework (Rubin et al., 1984). In this approach, samples are rejected
or modified in order to make the moments of selected functions of the samples
match those of the desired distribution. While this idea uses the moments of the
samples like in moment matching, it is different from moment matching because it
modifies the samples themselves, rather than training the model to automatically
emit samples with the correct moments. Bachman and Precup (2015) showed how
to use ideas from ABC in the context of deep learning, by using ABC to shape the
MCMC trajectories of GSNs.
We expect that many other possible approaches to generative modeling await
discovery.
718
CHAPTER 20. DEEP GENERATIVE MODELS
20.14 Evaluating Generative Models
Researchers studying generative models often need to compare one generative
model to another, usually in order to demonstrate that a newly invented generative
model is better at capturing some distribution than the pre-existing models.
This can be a difficult and subtle task. In many cases, we can not actually
evaluate the log probability of the data under the model, but only an approximation.
In these cases, it is important to think and communicate clearly about exactly what
is being measured. For example, suppose we can evaluate a stochastic estimate of
the log-likelihood for model A, and a deterministic lower bound on the log-likelihood
for model B. If model A gets a higher score than model B, which is better? If we
care about determining which model has a better internal representation of the
distribution, we actually cannot tell, unless we have some way of determining how
loose the bound for model B is. However, if we care about how well we can use
the model in practice, for example to perform anomaly detection, then it is fair to
say that a model is preferable based on a criterion specific to the practical task of
interest, e.g., based on ranking test examples and ranking criteria such as precision
and recall.
Another subtlety of evaluating generative models is that the evaluation metrics
are often hard research problems in and of themselves. It can be very difficult
to establish that models are being compared fairly. For example, suppose we use
AIS to estimate
log Z
in order to compute
log ˜p
(
x
)
log Z
for a new model we
have just invented. A computationally economical implementation of AIS may fail
to find several modes of the model distribution and underestimate
Z
, which will
result in us overestimating
log p
(
x
). It can thus be difficult to tell whether a high
likelihood estimate is due to a good model or a bad AIS implementation.
Other fields of machine learning usually allow for some variation in the pre-
processing of the data. For example, when comparing the accuracy of object
recognition algorithms, it is usually acceptable to preprocess the input images
slightly differently for each algorithm based on what kind of input requirements
it has. Generative modeling is different because changes in preprocessing, even
very small and subtle ones, are completely unacceptable. Any change to the input
data changes the distribution to be captured and fundamentally alters the task.
For example, multiplying the input by 0.1 will artificially increase likelihood by a
factor of 10.
Issues with preprocessing commonly arise when benchmarking generative models
on the MNIST dataset, one of the more popular generative modeling benchmarks.
MNIST consists of grayscale images. Some models treat MNIST images as points
719
CHAPTER 20. DEEP GENERATIVE MODELS
in a real vector space, while others treat them as binary. Yet others treat the
grayscale values as probabilities for a binary samples. It is essential to compare
real-valued models only to other real-valued models and binary-valued models only
to other binary-valued models. Otherwise the likelihoods measured are not on the
same space. For binary-valued models, the log-likelihood can be at most zero, while
for real-valued models it can be arbitrarily high, since it is the measurement of a
density. Among binary models, it is important to compare models using exactly
the same kind of binarization. For example, we might binarize a gray pixel to 0 or 1
by thresholding at 0.5, or by drawing a random sample whose probability of being
1 is given by the gray pixel intensity. If we use the random binarization, we might
binarize the whole dataset once, or we might draw a different random example for
each step of training and then draw multiple samples for evaluation. Each of these
three schemes yields wildly different likelihood numbers, and when comparing
different models it is important that both models use the same binarization scheme
for training and for evaluation. In fact, researchers who apply a single random
binarization step share a file containing the results of the random binarization, so
that there is no difference in results based on different outcomes of the binarization
step.
Because being able to generate realistic samples from the data distribution
is one of the goals of a generative model, practitioners often evaluate generative
models by visually inspecting the samples. In the best case, this is done not by the
researchers themselves, but by experimental subjects who do not know the source
of the samples (Denton et al., 2015). Unfortunately, it is possible for a very poor
probabilistic model to produce very good samples. A common practice to verify
if the model only copies some of the training examples is illustrated in Fig. 16.1.
The idea is to show for some of the generated samples their nearest neighbor in
the training set, according to Euclidean distance in the space of
x
. This test is
intended to detect the case where the model overfits the training set and just
reproduces training instances. It is even possible to simultaneously underfit and
overfit yet still produce samples that individually look good. Imagine a generative
model trained on images of dogs and cats that simply learns to reproduce the
training images of dogs. Such a model has clearly overfit, because it does not
produces images that were not in the training set, but it has also underfit, because
it assigns no probability to the training images of cats. Yet a human observer
would judge each individual image of a dog to be high quality. In this simple
example, it would be easy for a human observer who can inspect many samples to
determine that the cats are absent. In more realistic settings, a generative model
trained on data with tens of thousands of modes may ignore a small number of
modes, and a human observer would not easily be able to inspect or remember
720
CHAPTER 20. DEEP GENERATIVE MODELS
enough images to detect the missing variation.
Since the visual quality of samples is not a reliable guide, we often also
evaluate the log-likelihood that the model assigns to the test data, when this is
computationally feasible. Unfortunately, in some cases the likelihood seems not
to measure any attribute of the model that we really care about. For example,
real-valued models of MNIST can obtain arbitrarily high likelihood by assigning
arbitrarily low variance to background pixels that never change. Models and
algorithms that detect these constant features can reap unlimited rewards, even
though this is not a very useful thing to do. The potential to achieve a cost
approaching negative infinity is present for any kind of maximum likelihood
problem with real values, but it is especially problematic for generative models of
MNIST because so many of the output values are trivial to predict. This strongly
suggests a need for developing other ways of evaluating generative models.
Theis et al. (2015) review many of the issues involved in evaluating generative
models, including many of the ideas described above. They highlight the fact
that there are many different uses of generative models and that the choice of
metric must match the intended use of the model. For example, some generative
models are better at assigning high probability to most realistic points while other
generative models are better at rarely assigning high probability to unrealistic
points. These differences can result from whether a generative model is designed
to minimize
D
KL
(
p
data
kp
model
) or
D
KL
(
p
model
kp
data
), as illustrated in Fig. 3.6.
Unfortunately, even when we restrict the use of each metric to the task it is most
suited for, all of the metrics currently in use continue to have serious weaknesses.
One of the most important research topics in generative modeling is therefore not
just how to improve generative models, but in fact, designing new techniques to
measure our progress.
20.15 Conclusion
Training generative models with hidden units is a powerful way to make models
understand the world represented in the given training data. By learning a model
p
model
(
x
) and a representation
p
model
(
h | x
), a generative model can provide
answers to many inference problems about the relationships between input variables
in
x
and can provide many different ways of representing
x
by taking expectations
of
h
at different layers of the hierarchy. Generative models hold the promise to
provide AI systems with a framework for all of the many different intuitive concepts
they need to understand, and the ability to reason about these concepts in the
face of uncertainty. We hope that our readers will find new ways to make these
721
CHAPTER 20. DEEP GENERATIVE MODELS
approaches more powerful and continue the journey to understanding the principles
that underlie learning and intelligence.
722