CHAPTER 19. APPROXIMATE INFERENCE
this expense by learning to perform approximate inference. Specifically, we can
think of the optimization process as a function
f
that maps an input
v
to an
approximate distribution
q
∗
=
arg max
q
L
(
v, q
). Once we think of the multi-step
iterative optimization process as just being a function, we can approximate it with
a neural network that implements an approximation
ˆ
f(v; θ).
19.5.1 Wake-Sleep
One of the main difficulties with training a model to infer
h
from
v
is that we
do not have a supervised training set with which to train the model. Given a
v
,
we do not know the appropriate
h
. The mapping from
v
to
h
depends on the
choice of model family, and evolves throughout the learning process as
θ
changes.
The wake-sleep algorithm (Hinton et al., 1995b; Frey et al., 1996) resolves this
problem by drawing samples of both
h
and
v
from the model distribution. For
example, in a directed model, this can be done cheaply by performing ancestral
sampling beginning at
h
and ending at
v
. The inference network can then be
trained to perform the reverse mapping: predicting which
h
caused the present
v
. The main drawback to this approach is that we will only be able to train the
inference network on values of
v
that have high probability under the model. Early
in learning, the model distribution will not resemble the data distribution, so the
inference network will not have an opportunity to learn on samples that resemble
data.
In Sec. 18.2 we saw that one possible explanation for the role of dream sleep in
human beings and animals is that dreams could provide the negative phase samples
that Monte Carlo training algorithms use to approximate the negative gradient of
the log partition function of undirected models. Another possible explanation for
biological dreaming is that it is providing samples from
p
(
h, v
) which can be used
to train an inference network to predict
h
given
v
. In some senses, this explanation
is more satisfying than the partition function explanation. Monte Carlo algorithms
generally do not perform well if they are run using only the positive phase of the
gradient for several steps then with only the negative phase of the gradient for
several steps. Human beings and animals are usually awake for several consecutive
hours then asleep for several consecutive hours. It is not readily apparent how this
schedule could support Monte Carlo training of an undirected model. Learning
algorithms based on maximizing
L
can be run with prolonged periods of improving
q
and prolonged periods of improving
θ
, however. If the role of biological dreaming
is to train networks for predicting
q
, then this explains how animals are able to
remain awake for several hours (the longer they are awake, the greater the gap
between
L
and
log p
(
v
), but
L
will remain a lower bound) and to remain asleep
654