CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
stacked to initialize deeper models like DBNs or DBMs. However, CD does not
provide much help for training deeper models directly. This is because it is difficult
to obtain samples of the hidden units given samples of the visible units. Since the
hidden units are not included in the data, initializing from training points cannot
solve the problem. Even if we initialize the visible units from the data, we will still
need to burn in a Markov chain sampling from the distribution over the hidden
units conditioned on those visible samples.
The CD algorithm can be thought of as penalizing the model for having a
Markov chain that changes the input rapidly when the input comes from the data.
This means training with CD somewhat resembles autoencoder training. Even
though CD is more biased than some of the other training methods, it can be
useful for pretraining shallow models that will later be stacked. This is because
the earliest models in the stack are encouraged to copy more information up to
their latent variables, thereby making it available to the later models. This should
be thought of more of as an often-exploitable side effect of CD training rather than
a principled design advantage.
Sutskever and Tieleman (2010) showed that the CD update direction is not the
gradient of any function. This allows for situations where CD could cycle forever,
but in practice this is not a serious problem.
A different strategy that resolves many of the problems with CD is to initialize
the Markov chains at each gradient step with their states from the previous gradient
step. This approach was first discovered under the name stochastic maximum
likelihood (SML) in the applied mathematics and statistics community (Younes,
1998) and later independently rediscovered under the name persistent contrastive
divergence (PCD, or PCD-
k
to indicate the use of
k
Gibbs steps per update) in
the deep learning community (Tieleman, 2008). See Algorithm 18.3. The basic
idea of this approach is that, so long as the steps taken by the stochastic gradient
algorithm are small, then the model from the previous step will be similar to the
model from the current step. It follows that the samples from the previous model’s
distribution will be very close to being fair samples from the current model’s
distribution, so a Markov chain initialized with these samples will not require much
time to mix.
Because each Markov chain is continually updated throughout the learning
process, rather than restarted at each gradient step, the chains are free to wander
far enough to find all of the model’s modes. SML is thus considerably more
resistant to forming models with spurious modes than CD is. Moreover, because
it is possible to store the state of all of the sampled variables, whether visible or
latent, SML provides an initialization point for both the hidden and visible units.
615