CHAPTER 15. REPRESENTATION LEARNING
measured after the fact but is often difficult to predict ahead of time. When we
perform unsupervised and supervised learning simultaneously, instead of using the
pretraining strategy, there is a single hyperparameter, usually a coefficient attached
to the unsupervised cost, that determines how strongly the unsupervised objective
will regularize the supervised model. One can always predictably obtain less
regularization by decreasing this coefficient. In the case of unsupervised pretraining,
there is not a way of flexibly adapting the strength of the regularization—either
the supervised model is initialized to pretrained parameters, or it is not.
Another disadvantage of having two separate training phases is that each phase
has its own hyperparameters. The performance of the second phase usually cannot
be predicted during the first phase, so there is a long delay between proposing
hyperparameters for the first phase and being able to update them using feedback
from the second phase. The most principled approach is to use validation set error
in the supervised phase in order to select the hyperparameters of the pretraining
phase, as discussed in Larochelle et al. (2009). In practice, some hyperparameters,
like the number of pretraining iterations, are more conveniently set during the
pretraining phase, using early stopping on the unsupervised objective, which is
not ideal but computationally much cheaper than using the supervised objective.
Today, unsupervised pretraining has been largely abandoned, except in the
field of natural language processing, where the natural representation of words as
one-hot vectors conveys no similarity information and where very large unlabeled
sets are available. In that case, the advantage of pretraining is that one can pretrain
once on a huge unlabeled set (for example with a corpus containing billions of
words), learn a good representation (typically of words, but also of sentences), and
then use this representation or fine-tune it for a supervised task for which the
training set contains substantially fewer examples. This approach was pioneered
by by Collobert and Weston (2008b), Turian et al. (2010), and Collobert et al.
(2011a) and remains in common use today.
Deep learning techniques based on supervised learning, regularized with dropout
or batch normalization, are able to achieve human-level performance on very many
tasks, but only with extremely large labeled datasets. These same techniques
outperform unsupervised pretraining on medium-sized datasets such as CIFAR-10
and MNIST, which have roughly 5,000 labeled examples per class. On extremely
small datasets, such as the alternative splicing dataset, Bayesian methods outper-
form methods based on unsupervised pretraining (Srivastava, 2013). For these
reasons, the popularity of unsupervised pretraining has declined. Nevertheless,
unsupervised pretraining remains an important milestone in the history of deep
learning research and continues to influence contemporary approaches. The idea of
538