CHAPTER 20. DEEP GENERATIVE MODELS
in a real vector space, while others treat them as binary. Yet others treat the
grayscale values as probabilities for a binary samples. It is essential to compare
real-valued models only to other real-valued models and binary-valued models only
to other binary-valued models. Otherwise the likelihoods measured are not on the
same space. For binary-valued models, the log-likelihood can be at most zero, while
for real-valued models it can be arbitrarily high, since it is the measurement of a
density. Among binary models, it is important to compare models using exactly
the same kind of binarization. For example, we might binarize a gray pixel to 0 or 1
by thresholding at 0.5, or by drawing a random sample whose probability of being
1 is given by the gray pixel intensity. If we use the random binarization, we might
binarize the whole dataset once, or we might draw a different random example for
each step of training and then draw multiple samples for evaluation. Each of these
three schemes yields wildly different likelihood numbers, and when comparing
different models it is important that both models use the same binarization scheme
for training and for evaluation. In fact, researchers who apply a single random
binarization step share a file containing the results of the random binarization, so
that there is no difference in results based on different outcomes of the binarization
step.
Because being able to generate realistic samples from the data distribution
is one of the goals of a generative model, practitioners often evaluate generative
models by visually inspecting the samples. In the best case, this is done not by the
researchers themselves, but by experimental subjects who do not know the source
of the samples (Denton et al., 2015). Unfortunately, it is possible for a very poor
probabilistic model to produce very good samples. A common practice to verify
if the model only copies some of the training examples is illustrated in Fig. 16.1.
The idea is to show for some of the generated samples their nearest neighbor in
the training set, according to Euclidean distance in the space of
x
. This test is
intended to detect the case where the model overfits the training set and just
reproduces training instances. It is even possible to simultaneously underfit and
overfit yet still produce samples that individually look good. Imagine a generative
model trained on images of dogs and cats that simply learns to reproduce the
training images of dogs. Such a model has clearly overfit, because it does not
produces images that were not in the training set, but it has also underfit, because
it assigns no probability to the training images of cats. Yet a human observer
would judge each individual image of a dog to be high quality. In this simple
example, it would be easy for a human observer who can inspect many samples to
determine that the cats are absent. In more realistic settings, a generative model
trained on data with tens of thousands of modes may ignore a small number of
modes, and a human observer would not easily be able to inspect or remember
720