CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
this approach is that in feedforward networks, activations and gradients can grow
or shrink on each step of forward or back-propagation, following a random walk
behavior. This is because feedforward networks use a different weight matrix at
each layer. If this random walk is tuned to preserve norms, then feedforward
networks can mostly avoid the vanishing and exploding gradients problem that
arises when the same weight matrix is used at each step, described in Sec. 8.2.5.
Unfortunately, these optimal criteria for initial weights often do not lead to
optimal performance. This may be for three different reasons. First, we may
be using the wrong criteria—it may not actually be beneficial to preserve the
norm of a signal throughout the entire network. Second, the properties imposed
at initialization may not persist after learning has begun to proceed. Third, the
criteria might succeed at improving the speed of optimization but inadvertently
increase generalization error. In practice, we usually need to treat the scale of the
weights as a hyperparameter whose optimal value lies somewhere roughly near but
not exactly equal to the theoretical predictions.
One drawback to scaling rules that set all of the initial weights to have the same
standard deviation, such as
1
√
m
, is that every individual weight becomes extremely
small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to
have exactly
k
non-zero weights. The idea is to keep the total amount of input to
the unit independent from the number of inputs
m
without making the magnitude
of individual weight elements shrink with
m
. Sparse initialization helps to achieve
more diversity among the units at initialization time. However, it also imposes
a very strong prior on the weights that are chosen to have large Gaussian values.
Because it takes a long time for gradient descent to shrink “incorrect” large values,
this initialization scheme can cause problems for units such as maxout units that
have several filters that must be carefully coordinated with each other.
When computational resources allow it, it is usually a good idea to treat the
initial scale of the weights for each layer as a hyperparameter, and to choose these
scales using a hyperparameter search algorithm described in Sec. 11.4.2, such
as random search. The choice of whether to use dense or sparse initialization
can also be made a hyperparameter. Alternately, one can manually search for
the best initial scales. A good rule of thumb for choosing the initial scales is to
look at the range or standard deviation of activations or gradients on a single
minibatch of data. If the weights are too small, the range of activations across the
minibatch will shrink as the activations propagate forward through the network.
By repeatedly identifying the first layer with unacceptably small activations and
increasing its weights, it is possible to eventually obtain a network with reasonable
304