CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
in the ensemble, without needing any weight scaling.
So far we have described dropout purely as a means of performing efficient,
approximate bagging. However, there is another view of dropout that goes further
than this. Dropout trains not just a bagged ensemble of models, but an ensemble
of models that share hidden units. This means each hidden unit must be able to
perform well regardless of which other hidden units are in the model. Hidden units
must be prepared to be swapped and interchanged between models. Hinton et al.
(2012c) were inspired by an idea from biology: sexual reproduction, which involves
swapping genes between two different organisms, creates evolutionary pressure for
genes to become not just good, but to become readily swapped between different
organisms. Such genes and such features are very robust to changes in their
environment because they are not able to incorrectly adapt to unusual features
of any one organism or model. Dropout thus regularizes each hidden unit to be
not merely a good feature but a feature that is good in many contexts. Warde-
Farley et al. (2014) compared dropout training to training of large ensembles and
concluded that dropout offers additional improvements to generalization error
beyond those obtained by ensembles of independent models.
It is important to understand that a large portion of the power of dropout
arises from the fact that the masking noise is applied to the hidden units. This
can be seen as a form of highly intelligent, adaptive destruction of the information
content of the input rather than destruction of the raw values of the input. For
example, if the model learns a hidden unit
h
i
that detects a face by finding the nose,
then dropping
h
i
corresponds to erasing the information that there is a nose in
the image. The model must learn another
h
i
, either that redundantly encodes the
presence of a nose, or that detects the face by another feature, such as the mouth.
Traditional noise injection techniques that add unstructured noise at the input are
not able to randomly erase the information about a nose from an image of a face
unless the magnitude of the noise is so great that nearly all of the information in
the image is removed. Destroying extracted features rather than original values
allows the destruction process to make use of all of the knowledge about the input
distribution that the model has acquired so far.
Another important aspect of dropout is that the noise is multiplicative. If the
noise were additive with fixed scale, then a rectified linear hidden unit
h
i
with
added noise
could simply learn to have
h
i
become very large in order to make
the added noise
insignificant by comparison. Multiplicative noise does not allow
such a pathological solution to the noise robustness problem.
Another deep learning algorithm, batch normalization, reparametrizes the
model in a way that introduces both additive and multiplicative noise on the
266