Chapter 7

Regularization for Deep Learning

A central problem in machine learning is how to make an algorithm that will

perform well not just on the training data, but also on new inputs. Many strategies

used in machine learning are explicitly designed to reduce the test error, possibly

at the expense of increased training error. These strategies are known collectively

as regularization. As we will see there are a great many forms of regularization

available to the deep learning practitioner. In fact, developing more eﬀective

regularization strategies has been one of the major research eﬀorts in the ﬁeld.

Chapter 5 introduced the basic concepts of generalization, underﬁtting, overﬁt-

ting, bias, variance and regularization. If you are not already familiar with these

notions, please refer to that chapter before continuing with this one.

In this chapter, we describe regularization in more detail, focusing on regular-

ization strategies for deep models or models that may be used as building blocks

to form deep models.

Some sections of this chapter deal with standard concepts in machine learning.

If you are already familiar with these concepts, feel free to skip the relevant

sections. However, most of this chapter is concerned with the extension of these

basic concepts to the particular case of neural networks.

In Sec. 5.2.2, we deﬁned regularization as “any modiﬁcation we make to a

learning algorithm that is intended to reduce its generalization error but not

its training error.” There are many regularization strategies. Some put extra

constraints on a machine learning model, such as adding restrictions on the

parameter values. Some add extra terms in the objective function that can be

thought of as corresponding to a soft constraint on the parameter values. If chosen

carefully, these extra constraints and penalties can lead to improved performance

228

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

on the test set. Sometimes these constraints and penalties are designed to encode

speciﬁc kinds of prior knowledge. Other times, these constraints and penalties

are designed to express a generic preference for a simpler model class in order to

promote generalization. Sometimes penalties and constraints are necessary to make

an underdetermined problem determined. Other forms of regularization, known as

ensemble methods, combine multiple hypotheses that explain the training data.

In the context of deep learning, most regularization strategies are based on

regularizing estimators. Regularization of an estimator works by trading increased

bias for reduced variance. An eﬀective regularizer is one that makes a proﬁtable

trade, reducing variance signiﬁcantly while not overly increasing the bias. When

we discussed generalization and overﬁtting in Chapter 5, we focused on three

situations, where the model family being trained either (1) excluded the true

data generating process—corresponding to underﬁtting and inducing bias, or (2)

matched the true data generating process, or (3) included the generating process

but also many other possible generating processes—the overﬁtting regime where

variance rather than bias dominates the estimation error. The goal of regularization

is to take a model from the third regime into the second regime.

In practice, an overly complex model family does not necessarily include the

target function or the true data generating process, or even a close approximation

of either. We almost never have access to the true data generating process so

we can never know for sure if the model family being estimated includes the

generating process or not. However, most applications of deep learning algorithms

are to domains where the true data generating process is almost certainly outside

the model family. Deep learning algorithms are typically applied to extremely

complicated domains such as images, audio sequences and text, for which the true

generation process essentially involves simulating the entire universe. To some

extent, we are always trying to ﬁt a square peg (the data generating process) into

a round hole (our model family).

What this means is that controlling the complexity of the model is not a

simple matter of ﬁnding the model of the right size, with the right number of

parameters. Instead, we might ﬁnd—and indeed in practical deep learning scenarios,

we almost always do ﬁnd—that the best ﬁtting model (in the sense of minimizing

generalization error) is a large model that has been regularized appropriately.

We now review several strategies for how to create such a large, deep, regularized

model.

229

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1 Parameter Norm Penalties

Regularization has been used for decades prior to the advent of deep learning. Linear

models such as linear regression and logistic regression allow simple, straightforward,

and eﬀective regularization strategies.

Many regularization approaches are based on limiting the capacity of models,

such as neural networks, linear regression, or logistic regression, by adding a pa-

rameter norm penalty Ω(

) to the objective function

. We denote the regularized

objective function by

J(θ; X, y) = J(θ; X, y) + αΩ(θ) (7.1)

where

α ∈

, ∞

) is a hyperparameter that weights the relative contribution of

the norm penalty term, Ω, relative to the standard objective function

(

;

Setting

to 0 results in no regularization. Larger values of

correspond to more

regularization.

When our training algorithm minimizes the regularized objective function

will decrease both the original objective

on the training data and some measure

of the size of the parameters

(or some subset of the parameters). Diﬀerent

choices for the parameter norm Ω can result in diﬀerent solutions being preferred.

In this section, we discuss the eﬀects of the various norms when used as penalties

on the model parameters.

Before delving into the regularization behavior of diﬀerent norms, we note that

for neural networks, we typically choose to use a parameter norm penalty Ω that

penalizes

only the weights

of the aﬃne transformation at each layer and leaves

the biases unregularized. The biases typically require less data to ﬁt accurately

than the weights. Each weight speciﬁes how two variables interact. Fitting the

weight well requires observing both variables in a variety of conditions. Each

bias controls only a single variable. This means that we do not induce too much

variance by leaving the biases unregularized. Also, regularizing the bias parameters

can introduce a signiﬁcant amount of underﬁtting. We therefore use the vector

to indicate all of the weights that should be aﬀected by a norm penalty, while the

vector

denotes all of the parameters, including both

and the unregularized

parameters.

In the context of neural networks, it is sometimes desirable to use a separate

penalty with a diﬀerent

coeﬃcient for each layer of the network. Because it can

be expensive to search for the correct value of multiple hyperparameters, it is still

reasonable to use the same weight decay at all layers just to reduce the search

space.

230

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.1.1 L

Parameter Regularization

We have already seen, in Sec. 5.2.2, one of the simplest and most common kinds

of parameter norm penalty: the

parameter norm penalty commonly known as

weight decay. This regularization strategy drives the weights closer to the origin

by adding a regularization term Ω(

) =

kwk

to the objective function. In

other academic communities,

regularization is also known as ridge regression or

Tikhonov regularization.

We can gain some insight into the behavior of weight decay regularization

by studying the gradient of the regularized objective function. To simplify the

presentation, we assume no bias parameter, so

is just

. Such a model has the

following total objective function:

J(w; X, y) =

w + J(w; X, y), (7.2)

with the corresponding parameter gradient

∇

J(w; X, y) = αw + ∇

J(w; X, y). (7.3)

To take a single gradient step to update the weights, we perform this update:

w ← w −  (αw + ∇

J(w; X, y)) . (7.4)

Written another way, the update is:

w ← (1 −α)w − ∇

J(w; X, y). (7.5)

We can see that the addition of the weight decay term has modiﬁed the learning

rule to multiplicatively shrink the weight vector by a constant factor on each step,

just before performing the usual gradient update. This describes what happens in

a single step. But what happens over the entire course of training?

We will further simplify the analysis by making a quadratic approximation

to the objective function in the neighborhood of the value of the weights that

obtains minimal unregularized training cost,

∗

arg min

(

). If the objective

function is truly quadratic, as in the case of ﬁtting a linear regression model with

mean squared error, then the approximation is perfect.

J(θ) = J(w

∗

) +

(w − w

∗

)

H(w − w

∗

) (7.6)

More generally, we could regularize the parameters to be near any speciﬁc point in space

and, surprisingly, still get a regularization eﬀect, but better results will be obtained for a value

closer to the true one, with zero being a default value that makes sense when we do not know if

the correct value should be positive or negative. Since it is far more common to regularize the

model parameters towards zero, we will focus on this special case in our exposition.

231

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

where

is the Hessian matrix of

with respect to

evaluated at

∗

. There is

no ﬁrst-order term in this quadratic approximation, because

∗

is deﬁned to be a

minimum, where the gradient vanishes. Likewise, because

∗

is the location of a

minimum of J, we can conclude that H is positive semideﬁnite.

The minimum of

J occurs where its gradient

∇

J(w) = H(w − w

∗

) (7.7)

is equal to 0.

To study the eﬀect of weight decay, we modify Eq. 7.7 by adding the weight

decay gradient. We can now solve for the minimum of the regularized version of

We use the variable

w to represent the location of the minimum.

w + H(

w − w

∗

) = 0 (7.8)

(H + αI)

w = Hw

∗

(7.9)

w = (H + αI)

−1

∗

. (7.10)

approaches 0, the regularized solution

approaches

∗

. But what

happens as

grows? Because

is real and symmetric, we can decompose it

into a diagonal matrix

and an orthonormal basis of eigenvectors,

, such that

H = QΛQ

. Applying the decomposition to Eq.7.10, we obtain:

w = (QΛQ

+ αI)

−1

QΛQ

∗

(7.11)

Q(Λ + αI)Q

−1

QΛQ

∗

(7.12)

= Q(Λ + αI)

−1

ΛQ

∗

. (7.13)

We see that the eﬀect of weight decay is to rescale

∗

along the axes deﬁned by

the eigenvectors of

. Speciﬁcally, the component of

∗

that is aligned with the

-th eigenvector of

is rescaled by a factor of

+α

. (You may wish to review

how this kind of scaling works, ﬁrst explained in Fig. 2.3).

Along the directions where the eigenvalues of

are relatively large, for example,

where

 α

, the eﬀect of regularization is relatively small. However, components

with

 α

will be shrunk to have nearly zero magnitude. This eﬀect is illustrated

in Fig. 7.1.

Only directions along which the parameters contribute signiﬁcantly to reducing

the objective function are preserved relatively intact. In directions that do not

contribute to reducing the objective function, a small eigenvalue of the Hessian

tells us that movement in this direction will not signiﬁcantly increase the gradient.

232

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

∗

Figure 7.1: An illustration of the eﬀect of

(or weight decay) regularization on the value

of the optimal

. The solid ellipses represent contours of equal value of the unregularized

objective. The dotted circles represent contours of equal value of the

regularizer. At

the point

, these competing objectives reach an equilibrium. In the ﬁrst dimension, the

eigenvalue of the Hessian of

is small. The objective function does not increase much

when moving horizontally away from

∗

. Because the objective function does not express

a strong preference along this direction, the regularizer has a strong eﬀect on this axis.

The regularizer pulls

close to zero. In the second dimension, the objective function

is very sensitive to movements away from

∗

. The corresponding eigenvalue is large,

indicating high curvature. As a result, weight decay aﬀects the position of

relatively

little.

233

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Components of the weight vector corresponding to such unimportant directions

are decayed away through the use of the regularization throughout training.

So far we have discussed weight decay in terms of its eﬀect on the optimization

of an abstract, general, quadratic cost function. How do these eﬀects relate to

machine learning in particular? We can ﬁnd out by studying linear regression, a

model for which the true cost function is quadratic and therefore amenable to the

same kind of analysis we have used so far. Applying the analysis again, we will

be able to obtain a special case of the same results, but with the solution now

phrased in terms of the training data. For linear regression, the cost function is

the sum of squared errors:

(Xw − y)

(Xw − y). (7.14)

When we add L

regularization, the objective function changes to

(Xw − y)

(Xw − y) +

αw

w. (7.15)

This changes the normal equations for the solution from

w = (X

−1

y (7.16)

w = (X

X + αI)

−1

y. (7.17)

The matrix

in Eq. 7.16 is proportional to the covariance matrix

Using

regularization replaces this matrix with



X + αI



−1

in Eq. 7.17.

The new matrix is the same as the original one, but with the addition of

to the

diagonal. The diagonal entries of this matrix correspond to the variance of each

input feature. We can see that

regularization causes the learning algorithm

to “perceive” the input

as having higher variance, which makes it shrink the

weights on features whose covariance with the output target is low compared to

this added variance.

7.1.2 L

Regularization

While

weight decay is the most common form of weight decay, there are other

ways to penalize the size of the model parameters. Another option is to use

regularization.

Formally, L

regularization on the model parameter w is deﬁned as:

Ω(θ) = ||w||

|, (7.18)

234

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

that is, as the sum of absolute values of the individual parameters.

We will

now discuss the eﬀect of

regularization on the simple linear regression model,

with no bias parameter, that we studied in our analysis of

regularization. In

particular, we are interested in delineating the diﬀerences between

and

forms

of regularization. As with

weight decay,

weight decay controls the strength

of the regularization by scaling the penalty Ω using a positive hyperparameter

Thus, the regularized objective function

J(w; X, y) is given by

J(w; X, y) = α||w||

+ J(w; X, y), (7.19)

with the corresponding gradient (actually, sub-gradient):

∇

J(w; X, y) = αsign(w) + ∇

J(X, y; w) (7.20)

where sign(w) is simply the sign of w applied element-wise.

By inspecting Eq. 7.20, we can see immediately that the eﬀect of

regu-

larization is quite diﬀerent from that of

regularization. Speciﬁcally, we can

see that the regularization contribution to the gradient no longer scales linearly

with each

; instead it is a constant factor with a sign equal to

sign

(

). One

consequence of this form of the gradient is that we will not necessarily see clean

algebraic solutions to quadratic approximations of

(

X, y

;

) as we did for

regularization.

Our simple linear model has a quadratic cost function that we can represent

via its Taylor series. Alternately, we could imagine that this is a truncated Taylor

series approximating the cost function of a more sophisticated model. The gradient

in this setting is given by

∇

J(w) = H(w − w

∗

), (7.21)

where, again, H is the Hessian matrix of J with respect to w evaluated at w

∗

Because the

penalty does not admit clean algebraic expressions in the case

of a fully general Hessian, we will also make the further simplifying assumption

that the Hessian is diagonal,

diag

([

1,1

, . . . , H

n,n

]), where each

i,i

This assumption holds if the data for the linear regression problem has been

preprocessed to remove all correlation between the input features, which may be

accomplished using PCA.

As with

regularization, we could regularize the parameters towards a value that is not

zero, but instead towards some parameter value

(o)

. In that case the

regularization would

introduce the term Ω(θ) = ||w − w

(o)

− w

(o)

235

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Our quadratic approximation of the

regularized objective function decom-

poses into a sum over the parameters:

J(w; X, y) = J(w

∗

; X, y) +



i,i

− w

∗

)

+ α|w



. (7.22)

The problem of minimizing this approximate cost function has an analytical solution

(for each dimension i), with the following form:

= sign(w

∗

) max



∗

| −

i,i

, 0



. (7.23)

Consider the situation where w

∗

> 0 for all i. There are two possible outcomes:

1. w

∗

≤

i,i

. Here the optimal value of

under the regularized objective is

simply

= 0. This occurs because the contribution of

(

;

X, y

) to the

regularized objective

(

;

X, y

) is overwhelmed—in direction

—by the

regularization which pushes the value of w

to zero.

2. w

∗

i,i

, here the regularization does not move the optimal value of

zero but instead it just shifts it in that direction by a distance equal to

i,i

A similar process happens when

∗

0, but with the

penalty making

less

negative by

i,i

, or 0.

In comparison to

regularization,

regularization results in a solution that

is more sparse. Sparsity in this context refers to the fact that some parameters

have an optimal value of zero. The sparsity of

regularization is a qualitatively

diﬀerent behavior than arises with

regularization. Eq. 7.13 gave the solution

˜w

for

regularization. If we revisit that equation using the assumption of a diagonal

Hessian

that we introduced for our analysis of

regularization, we ﬁnd that

˜w

i,i

+α

∗

. If

∗

was nonzero, then

˜w

remains nonzero. This demonstrates

that

regularization does not cause the parameters to become sparse, while

regularization may do so for large enough α.

The sparsity property induced by

regularization has been used extensively

as a feature selection mechanism. Feature selection simpliﬁes a machine learning

problem by choosing which subset of the available features should be used. In

particular, the well known LASSO (Tibshirani, 1995) (least absolute shrinkage and

selection operator) model integrates an

penalty with a linear model and a least

squares cost function. The

penalty causes a subset of the weights to become

zero, suggesting that the corresponding features may safely be discarded.

236

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

In Sec. 5.6.1, we saw that many regularization strategies can be interpreted as

MAP Bayesian inference, and that in particular,

regularization is equivalent

to MAP Bayesian inference with a Gaussian prior on the weights. For

regu-

larization, the penalty

Ω(

) =

used to regularize a cost function is

equivalent to the log-prior term that is maximized by MAP Bayesian inference

when the prior is an isotropic Laplace distribution (Eq. 3.26) over w:

log p(w) =

log Laplace(w

; 0,

) = −α

| + log α − log 2 (7.24)

From the point of view of learning via minimization with respect to

, we can

ignore the log α − log 2 terms because they do not depend on w.

7.2 Norm Penalties as Constrained Optimization

Consider the cost function regularized by a parameter norm penalty:

J(θ; X, y) = J(θ; X, y) + αΩ(θ). (7.25)

Recall from Sec. 4.4 that we can minimize a function subject to constraints by

constructing a generalized Lagrange function, consisting of the original objective

function plus a set of penalties. Each penalty is a product between a coeﬃcient,

called a Karush–Kuhn–Tucker (KKT) multiplier, and a function representing

whether the constraint is satisﬁed. If we wanted to constrain Ω(

) to be less than

some constant k, we could construct a generalized Lagrange function

L(θ, α; X, y) = J(θ; X, y) + α(Ω(θ) − k). (7.26)

The solution to the constrained problem is given by

∗

= arg min

max

α,α≥0

L(θ, α). (7.27)

As described in Sec. 4.4, solving this problem requires modifying both

and

. Sec. 4.5 provides a worked example of linear regression with an

constraint.

Many diﬀerent procedures are possible—some may use gradient descent, while

others may use analytical solutions for where the gradient is zero—but in all

procedures

must increase whenever Ω(

)

> k

and decrease whenever Ω(

)

< k

All positive

encourage Ω(

) to shrink. The optimal value

∗

will encourage Ω(

)

to shrink, but not so strongly to make Ω(θ) become less than k.

237

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

To gain some insight into the eﬀect of the constraint, we can ﬁx

∗

and view

the problem as just a function of θ:

∗

= arg min

L(θ, α

∗

) = arg min

J(θ; X, y) + α

∗

Ω(θ). (7.28)

This is exactly the same as the regularized training problem of minimizing

We can thus think of a parameter norm penalty as imposing a constraint on the

weights. If Ω is the

norm, then the weights are constrained to lie in a

ball.

If Ω is the L

norm, then the weights are constrained to lie in a region of limited

norm. Usually we do not know the size of the constraint region that we impose

by using weight decay with coeﬃcient

∗

because the value of

∗

does not directly

tell us the value of

. In principle, one can solve for

, but the relationship between

and

∗

depends on the form of

. While we do not know the exact size of the

constraint region, we can control it roughly by increasing or decreasing

in order

to grow or shrink the constraint region. Larger

will result in a smaller constraint

region. Smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties. As

described in Sec. 4.4, we can modify algorithms such as stochastic gradient descent

to take a step downhill on

(

) and then project

back to the nearest point

that satisﬁes Ω(

)

< k

. This can be useful if we have an idea of what value of

is appropriate and do not want to spend time searching for the value of

that

corresponds to this k.

Another reason to use explicit constraints and reprojection rather than enforcing

constraints with penalties is that penalties can cause non-convex optimization

procedures to get stuck in local minima corresponding to small

. When training

neural networks, this usually manifests as neural networks that train with several

“dead units.” These are units that do not contribute much to the behavior of the

function learned by the network because the weights going into or out of them are

all very small. When training with a penalty on the norm of the weights, these

conﬁgurations can be locally optimal, even if it is possible to signiﬁcantly reduce

by making the weights larger. Explicit constraints implemented by re-projection

can work much better in these cases because they do not encourage the weights

to approach the origin. Explicit constraints implemented by re-projection only

have an eﬀect when the weights become large and attempt to leave the constraint

region.

Finally, explicit constraints with reprojection can be useful because they impose

some stability on the optimization procedure. When using high learning rates, it

is possible to encounter a positive feedback loop in which large weights induce

large gradients which then induce a large update to the weights. If these updates

238

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

consistently increase the size of the weights, then

rapidly moves away from

the origin until numerical overﬂow occurs. Explicit constraints with reprojection

allow us to terminate this feedback loop after the weights have reached a certain

magnitude. Hinton et al. (2012c) recommend using constraints combined with a

high learning rate to allow rapid exploration of parameter space while maintaining

some stability.

In particular, Hinton et al. (2012c) recommend constraining the norm of each

column of the weight matrix of a neural net layer, rather than constraining the

Frobenius norm of the entire weight matrix, a strategy introduced by Srebro and

Shraibman (2005). Constraining the norm of each column separately prevents any

one hidden unit from having very large weights. If we converted this constraint

into a penalty in a Lagrange function, it would be similar to

weight decay

but with a separate KKT multiplier for the weights of each hidden unit. Each of

these KKT multipliers would be dynamically updated separately to make each

hidden unit obey the constraint. In practice, column norm limitation is always

implemented as an explicit constraint with reprojection.

7.3 Regularization and Under-Constrained Problems

In some cases, regularization is necessary for machine learning problems to be

properly deﬁned. Many linear models in machine learning, including linear re-

gression and PCA, depend on inverting the matrix

. This is not possible

whenever

is singular. This matrix can be singular whenever the data truly

has no variance in some direction, or when there are fewer examples (rows of

)

than input features (columns of

). In this case, many forms of regularization

correspond to inverting

αI

instead. This regularized matrix is guaranteed

to be invertible.

These linear problems have closed form solutions when the relevant matrix

is invertible. It is also possible for a problem with no closed form solution to be

underdetermined. An example is logistic regression applied to a problem where

the classes are linearly separable. If a weight vector

is able to achieve perfect

classiﬁcation, then 2

will also achieve perfect classiﬁcation and higher likelihood.

An iterative optimization procedure like stochastic gradient descent will continually

increase the magnitude of

and, in theory, will never halt. In practice, a numerical

implementation of gradient descent will eventually reach suﬃciently large weights

to cause numerical overﬂow, at which point its behavior will depend on how the

programmer has decided to handle values that are not real numbers.

Most forms of regularization are able to guarantee the convergence of iterative

239

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

methods applied to underdetermined problems. For example, weight decay will

cause gradient descent to quit increasing the magnitude of the weights when the

slope of the likelihood is equal to the weight decay coeﬃcient.

The idea of using regularization to solve underdetermined problems extends

beyond machine learning. The same idea is useful for several basic linear algebra

problems.

As we saw in Sec. 2.9, we can solve underdetermined linear equations using

the Moore-Penrose pseudoinverse. Recall that one deﬁnition of the pseudoinverse

of a matrix X is

= lim

α&0

X + αI)

−1

. (7.29)

We can now recognize Eq. 7.29 as performing linear regression with weight decay.

Speciﬁcally, Eq. 7.29 is the limit of Eq. 7.17 as the regularization coeﬃcient shrinks

to zero. We can thus interpret the pseudoinverse as stabilizing underdetermined

problems using regularization.

7.4 Dataset Augmentation

The best way to make a machine learning model generalize better is to train it on

more data. Of course, in practice, the amount of data we have is limited. One way

to get around this problem is to create fake data and add it to the training set.

For some machine learning tasks, it is reasonably straightforward to create new

fake data.

This approach is easiest for classiﬁcation. A classiﬁer needs to take a compli-

cated, high dimensional input

and summarize it with a single category identity

This means that the main task facing a classiﬁer is to be invariant to a wide variety

of transformations. We can generate new (

x, y

) pairs easily just by transforming

the x inputs in our training set.

This approach is not as readily applicable to many other tasks. For example, it

is diﬃcult to generate new fake data for a density estimation task unless we have

already solved the density estimation problem.

Dataset augmentation has been a particularly eﬀective technique for a speciﬁc

classiﬁcation problem: object recognition. Images are high dimensional and include

an enormous variety of factors of variation, many of which can be easily simulated.

Operations like translating the training images a few pixels in each direction can

often greatly improve generalization, even if the model has already been designed to

be partially translation invariant by using the convolution and pooling techniques

240

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

described in Chapter 9. Many other operations such as rotating the image or

scaling the image have also proven quite eﬀective.

One must be careful not to apply transformations that would change the correct

class. For example, optical character recognition tasks require recognizing the

diﬀerence between ’b’ and ’d’ and the diﬀerence between ’6’ and ’9’, so horizontal

ﬂips and 180

◦

rotations are not appropriate ways of augmenting datasets for these

tasks.

There are also transformations that we would like our classiﬁers to be invariant

to, but which are not easy to perform. For example, out-of-plane rotation can not

be implemented as a simple geometric operation on the input pixels.

Dataset augmentation is eﬀect for speech recognition tasks as well (Jaitly and

Hinton, 2013).

Injecting noise in the input to a neural network (Sietsma and Dow, 1991)

can also be seen as a form of data augmentation. For many classiﬁcation and

even some regression tasks, the task should still be possible to solve even if small

random noise is added to the input. Neural networks prove not to be very robust

to noise, however (Tang and Eliasmith, 2010). One way to improve the robustness

of neural networks is simply to train them with random noise applied to their

inputs. Input noise injection is part of some unsupervised learning algorithms such

as the denoising autoencoder (Vincent et al., 2008). Noise injection also works

when the noise is applied to the hidden units, which can be seen as doing dataset

augmentation at multiple levels of abstraction. Poole et al. (2014) recently showed

that this approach can be highly eﬀective provided that the magnitude of the

noise is carefully tuned. Dropout, a powerful regularization strategy that will be

described in Sec. 7.12, can be seen as a process of constructing new inputs by

multiplying by noise.

When comparing machine learning benchmark results, it is important to take

the eﬀect of dataset augmentation into account. Often, hand-designed dataset

augmentation schemes can dramatically reduce the generalization error of a machine

learning technique. To compare the performance of one machine learning algorithm

to another, it is necessary to perform controlled experiments. When comparing

machine learning algorithm A and machine learning algorithm B, it is necessary

to make sure that both algorithms were evaluated using the same hand-designed

dataset augmentation schemes. Suppose that algorithm A performs poorly with

no dataset augmentation and algorithm B performs well when combined with

numerous synthetic transformations of the input. In such a case it is likely the

synthetic transformations caused the improved performance, rather than the use

of machine learning algorithm B. Sometimes deciding whether an experiment

241

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

has been properly controlled requires subjective judgment. For example, machine

learning algorithms that inject noise into the input are performing a form of dataset

augmentation. Usually, operations that are generally applicable (such as adding

Gaussian noise to the input) are considered part of the machine learning algorithm,

while operations that are speciﬁc to one application domain (such as randomly

cropping an image) are considered to be separate pre-processing steps.

7.5 Noise Robustness

Sec. 7.4 has motivated the use of noise applied to the inputs as a dataset aug-

mentation strategy. For some models, the addition of noise with inﬁnitesimal

variance at the input of the model is equivalent to imposing a penalty on the

norm of the weights (Bishop, 1995a,b). In the general case, it is important to

remember that noise injection can be much more powerful than simply shrinking

the parameters, especially when the noise is added to the hidden units. Noise

applied to the hidden units is such an important topic as to merit its own separate

discussion; the dropout algorithm described in Sec. 7.12 is the main development

of that approach.

Another way that noise has been used in the service of regularizing models

is by adding it to the weights. This technique has been used primarily in the

context of recurrent neural networks (Jim et al., 1996; Graves, 2011). This can

be interpreted as a stochastic implementation of a Bayesian inference over the

weights. The Bayesian treatment of learning would consider the model weights

to be uncertain and representable via a probability distribution that reﬂects this

uncertainty. Adding noise to the weights is a practical, stochastic way to reﬂect

this uncertainty (Graves, 2011).

This can also be interpreted as equivalent (under some assumptions) to a

more traditional form of regularization. Adding noise to the weights has been

shown to be an eﬀective regularization strategy in the context of recurrent neural

networks (Jim et al., 1996; Graves, 2011). In the following, we will present an

analysis of the eﬀect of weight noise on a standard feedforward neural network (as

introduced in Chapter 6).

We study the regression setting, where we wish to train a function

ˆy

(

) that

maps a set of features

to a scalar using the least-squares cost function between

the model predictions ˆy(x) and the true values y:

J = E

p(x,y)



(ˆy(x) − y)



. (7.30)

The training set consists of m labeled examples {(x

(1)

, y

(1)

), . . . , (x

(m)

, y

(m)

)}.

242

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

We now assume that with each input presentation we also include a random

perturbation



∼ N

(



;

0, ηI

) of the network weights. Let us imagine that we

have a standard

-layer MLP. We denote the perturbed model as

ˆy



(

). Despite

the injection of noise, we are still interested in minimizing the squared error of the

output of the network. The objective function thus becomes:

= E

p(x,y,

)

(ˆy



(x) − y)

(7.31)

= E

p(x,y,

)



ˆy



(x) − 2yˆy



(x) + y



. (7.32)

For small

, the minimization of

with added weight noise (with covariance

ηI

) is equivalent to minimization of

with an additional regularization term:

ηE

p(x,y)



k∇

ˆy(x)k



. This form of regularization encourages the parameters to

go to regions of parameter space where small perturbations of the weights have

a relatively small inﬂuence on the output. In other words, it pushes the model

into regions where the model is relatively insensitive to small variations in the

weights, ﬁnding points that are not merely minima, but minima surrounded by

ﬂat regions (Hochreiter and Schmidhuber, 1995). In the simpliﬁed case of linear

regression (where, for instance,

ˆy

(

) =

), this regularization term collapses

into

ηE

p(x)



kxk



, which is not a function of parameters and therefore does not

contribute to the gradient of

with respect to the model parameters.

7.5.1 Injecting Noise at the Output Targets

Most datasets have some amount of mistakes in the

labels. It can be harmful

to maximize

log p

(

y | x

) when

is a mistake. One way to prevent this is to

explicitly model the noise on the labels. For example, we can assume that for some

small constant



, the training set label

is correct with probability 1

− 

, and

otherwise any of the other possible labels might be correct. This assumption is

easy to incorporate into the cost function analytically, rather than by explicitly

drawing noise samples. For example, label smoothing regularizes a model based

on a softmax with

output values by replacing the hard 0 and 1 classiﬁcation

targets with targets of



and 1

−

k−1



, respectively. The standard cross-entropy

loss may then be used with these soft targets. Maximum likelihood learning with a

softmax classiﬁer and hard targets may actually never converge—the softmax can

never predict a probability of exactly 0 or exactly 1, so it will continue to learn

larger and larger weights, making more extreme predictions forever. It is possible

to prevent this scenario using other regularization strategies like weight decay.

Label smoothing has the advantage of preventing the pursuit of hard probabilities

without discouraging correct classiﬁcation. This strategy has been used since

243

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

the 1980s and continues to be featured prominently in modern neural networks

(Szegedy et al., 2015).

7.6 Semi-Supervised Learning

In the paradigm of semi-supervised learning, both unlabeled examples from

(

)

and labeled examples from

(

x, y

) are used to estimate

(

y | x

) or predict

from

In the context of deep learning, semi-supervised learning usually refers to

learning a representation

(

)

The goal is to learn a representation so

that examples from the same class have similar representations. Unsupervised

learning can provide useful cues for how to group examples in representation

space. Examples that cluster tightly in the input space should be mapped to

similar representations. A linear classiﬁer in the new space may achieve better

generalization in many cases (Belkin and Niyogi, 2002; Chapelle et al., 2003). A

long-standing variant of this approach is the application of principal components

analysis as a pre-processing step before applying a classiﬁer (on the projected

data).

Instead of having separate unsupervised and supervised components in the

model, one can construct models in which a generative model of either

(

) or

(

x, y

) shares parameters with a discriminative model of

(

y | x

). One can

then trade-oﬀ the supervised criterion

−log P

(

y | x

) with the unsupervised or

generative one (such as

−log P

(

) or

−log P

(

x, y

)). The generative criterion then

expresses a particular form of prior belief about the solution to the supervised

learning problem (Lasserre et al., 2006), namely that the structure of

(

) is

connected to the structure of

(

y | x

) in a way that is captured by the shared

parametrization. By controlling how much of the generative criterion is included

in the total criterion, one can ﬁnd a better trade-oﬀ than with a purely generative

or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and

Bengio, 2008).

In the context of scarcity of labeled data (and abundance of unlabeled data),

deep architectures have shown promise as well. Salakhutdinov and Hinton (2008)

describe a method for learning the kernel function of a kernel machine used for

regression, in which the usage of unlabeled examples for modeling

(

) improves

P (y | x) quite signiﬁcantly.

See Chapelle et al. (2006) for more information about semi-supervised learning.

244

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.7 Multi-Task Learning

Multi-task learning (Caruana, 1993) is a way to improve generalization by pooling

the examples (which can be seen as soft constraints imposed on the parameters)

arising out of several tasks. In the same way that additional training examples

put more pressure on the parameters of the model towards values that generalize

well, when part of a model is shared across tasks, that part of the model is more

constrained towards good values (assuming the sharing is justiﬁed), often yielding

better generalization.

(1)

(2)

(3)

(1)

(2)

(shared)

Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworks

and this ﬁgure illustrates the common situation where the tasks share a common input but

involve diﬀerent target random variables. The lower layers of a deep network (whether it

is supervised and feedforward or includes a generative component with downward arrows)

can be shared across such tasks, while task-speciﬁc parameters (associated respectively

with the weights into and from

(1)

and

(2)

) can be learned on top of those yielding a

shared representation

(shared)

. The underlying assumption is that there exists a common

pool of factors that explain the variations in the input

, while each task is associated

with a subset of these factors. In this example, it is additionally assumed that top-level

hidden units

(1)

and

(2)

are specialized to each task (respectively predicting

(1)

and

(2)

) while some intermediate-level representation

(shared)

is shared across all tasks. In

the unsupervised learning context, it makes sense for some of the top-level factors to be

associated with none of the output tasks (

(3)

): these are the factors that explain some of

the input variations but are not relevant for predicting y

(1)

or y

(2)

Fig. 7.2 illustrates a very common form of multi-task learning, in which diﬀerent

supervised tasks (predicting

(i)

given

) share the same input

, as well as some

intermediate-level representation

(shared)

capturing a common pool of factors. The

245

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

model can generally be divided into two kinds of parts and associated parameters:

Task-speciﬁc parameters (which only beneﬁt from the examples of their task

to achieve good generalization). These are the upper layers of the neural

network in Fig. 7.2.

Generic parameters, shared across all the tasks (which beneﬁt from the

pooled data of all the tasks). These are the lower layers of the neural network

in Fig. 7.2.

Improved generalization and generalization error bounds (Baxter, 1995) can be

achieved because of the shared parameters, for which statistical strength can be

greatly improved (in proportion with the increased number of examples for the

shared parameters, compared to the scenario of single-task models). Of course this

will happen only if some assumptions about the statistical relationship between

the diﬀerent tasks are valid, meaning that there is something shared across some

of the tasks.

From the point of view of deep learning, the underlying prior belief is the

following:

among the factors that explain the variations observed in the

data associated with the diﬀerent tasks, some are shared across two or

more tasks.

7.8 Early Stopping

When training large models with suﬃcient representational capacity to overﬁt

the task, we often observe that training error decreases steadily over time, but

validation set error begins to rise again. See Fig. 7.3 for an example of this behavior.

This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus,

hopefully better test set error) by returning to the parameter setting at the point

in time with the lowest validation set error. Instead of running our optimization

algorithm until we reach a (local) minimum of validation error, we run it until the

error on the validation set has not improved for some amount of time. Every time

the error on the validation set improves, we store a copy of the model parameters.

When the training algorithm terminates, we return these parameters, rather than

the latest parameters. This procedure is speciﬁed more formally in Algorithm 7.1.

This strategy is known as early stopping. It is probably the most commonly

used form of regularization in deep learning. Its popularity is due both to its

eﬀectiveness and its simplicity.

246

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

0 50 100 150 200 250

Time (epochs)

0.00

0.05

0.10

0.15

0.20

Loss (negative log likelihood)

Learning curves

Training set loss

Validation set loss

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over

time (indicated as number of training iterations over the dataset, or epochs). In this

example, we train a maxout network on MNIST. Observe that the training objective

decreases consistently over time, but the validation set average loss eventually begins to

increase again, forming an asymmetric U-shaped curve.

One way to think of early stopping is as a very eﬃcient hyperparameter selection

algorithm. In this view, the number of training steps is just another hyperparameter.

We can see in Fig. 7.3 that this hyperparameter has a U-shaped validation set

performance curve. Most hyperparameters that control model capacity have such a

U-shaped validation set performance curve, as illustrated in Fig. 5.3. In the case of

early stopping, we are controlling the eﬀective capacity of the model by determining

how many steps it can take to ﬁt the training set. Most hyperparameters must be

chosen using an expensive guess and check process, where we set a hyperparameter

at the start of training, then run training for several steps to see its eﬀect. The

“training time” hyperparameter is unique in that by deﬁnition a single run of

training tries out many values of the hyperparameter. The only signiﬁcant cost

to choosing this hyperparameter automatically via early stopping is running the

validation set evaluation periodically during training. Ideally, this is done in

parallel to the training process on a separate machine, separate CPU, or separate

GPU from the main training process. If such resources are not available, then the

cost of these periodic evaluations may be reduced by using a validation set that is

small compared to the training set or by evaluating the validation set error less

frequently and obtaining a lower resolution estimate of the optimal training time.

An additional cost to early stopping is the need to maintain a copy of the

best parameters. This cost is generally negligible, because it is acceptable to store

these parameters in a slower and larger form of memory (for example, training in

247

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

GPU memory, but storing the optimal parameters in host memory or on a disk

drive). Since the best parameters are written to infrequently and never read during

training, these occasional slow writes have little eﬀect on the total training time.

Early stopping is a very unobtrusive form of regularization, in that it requires

almost no change in the underlying training procedure, the objective function,

or the set of allowable parameter values. This means that it is easy to use early

stopping without damaging the learning dynamics. This is in contrast to weight

decay, where one must be careful not to use too much weight decay and trap the

network in a bad local minimum corresponding to a solution with pathologically

small weights.

Early stopping may be used either alone or in conjunction with other regulariza-

tion strategies. Even when using regularization strategies that modify the objective

function to encourage better generalization, it is rare for the best generalization to

occur at a local minimum of the training objective.

Early stopping requires a validation set, which means some training data is not

fed to the model. To best exploit this extra data, one can perform extra training

after the initial training with early stopping has completed. In the second, extra

training step, all of the training data is included. There are two basic strategies

one can use for this second training procedure.

One strategy (Algorithm 7.2) is to initialize the model again and retrain on all

of the data. In this second training pass, we train for the same number of steps as

the early stopping procedure determined was optimal in the ﬁrst pass. There are

some subtleties associated with this procedure. For example, there is not a good

way of knowing whether to retrain for the same number of parameter updates or

the same number of passes through the dataset. On the second round of training,

each pass through the dataset will require more parameter updates because the

training set is bigger.

Another strategy for using all of the data is to keep the parameters obtained

from the ﬁrst round of training and then

continue

training but now using all of

the data. At this stage, we now no longer have a guide for when to stop in terms

of a number of steps. Instead, we can monitor the average loss function on the

validation set, and continue training until it falls below the value of the training

set objective at which the early stopping procedure halted. This strategy avoids

the high cost of retraining the model from scratch, but is not as well-behaved. For

example, there is not any guarantee that the objective on the validation set will

ever reach the target value, so this strategy is not even guaranteed to terminate.

This procedure is presented more formally in Algorithm 7.3.

Early stopping is also useful because it reduces the computational cost of the

248

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

∗

Figure 7.4: An illustration of the eﬀect of early stopping. (Left) The solid contour

lines indicate the contours of the negative log-likelihood. The dashed line indicates the

trajectory taken by SGD beginning from the origin. Rather than stopping at the point

∗

that minimizes the cost, early stopping results in the trajectory stopping at an earlier

point

. (Right) An illustration of the eﬀect of

regularization for comparison. The

dashed circles indicate the contours of the

penalty, which causes the minimum of the

total cost to lie nearer the origin than the minimum of the unregularized cost.

training procedure. Besides the obvious reduction in cost due to limiting the number

of training iterations, it also has the beneﬁt of providing regularization without

requiring the addition of penalty terms to the cost function or the computation of

the gradients of such additional terms.

How early stopping acts as a regularizer:

So far we have stated that early

stopping is a regularization strategy, but we have supported this claim only by

showing learning curves where the validation set error has a U-shaped curve. What

is the actual mechanism by which early stopping regularizes the model? Bishop

(1995a) and Sjöberg and Ljung (1995) argued that early stopping has the eﬀect of

restricting the optimization procedure to a relatively small volume of parameter

space in the neighborhood of the initial parameter value

. More speciﬁcally,

imagine taking

optimization steps (corresponding to

training iterations) and

with learning rate



. We can view the product

τ

as a measure of eﬀective capacity.

Assuming the gradient is bounded, restricting both the number of iterations and

the learning rate limits the volume of parameter space reachable from

. In this

sense,

τ

behaves as if it were the reciprocal of the coeﬃcient used for weight decay.

Indeed, we can show how—in the case of a simple linear model with a quadratic

error function and simple gradient descent—early stopping is equivalent to

249

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

regularization.

In order to compare with classical

regularization, we examine a simple

setting where the only parameters are linear weights (

). We can model

the cost function

with a quadratic approximation in the neighborhood of the

empirically optimal value of the weights w

∗

J(θ) = J(w

∗

) +

(w − w

∗

)

H(w − w

∗

), (7.33)

where

is the Hessian matrix of

with respect to

evaluated at

∗

. Given the

assumption that

∗

is a minimum of

(

), we know that

is positive semideﬁnite.

Under a local Taylor series approximation, the gradient is given by:

∇

J(w) = H(w − w

∗

). (7.34)

We are going to study the trajectory followed by the parameter vector during

training. For simplicity, let us set the initial parameter vector to the origin,

that

is w

(0)

= 0. Let us suppose that we update the parameters via gradient descent:

(τ)

= w

(τ−1)

− ∇

J(w

(τ−1)

) (7.35)

= w

(τ−1)

− H(w

(τ−1)

− w

∗

) (7.36)

(τ)

− w

∗

= (I − H)(w

(τ−1)

− w

∗

) (7.37)

Let us now rewrite this expression in the space of the eigenvectors of

, exploiting

the eigendecomposition of

QΛQ

, where

is a diagonal matrix and

is an orthonormal basis of eigenvectors.

(τ)

− w

∗

= (I − QΛQ

)(w

(τ−1)

− w

∗

) (7.38)

(τ)

− w

∗

) = (I − Λ)Q

(τ−1)

− w

∗

) (7.39)

Assuming that

(0)

= 0 and that



is chosen to be small enough to guarantee

− λ

| <

1, the parameter trajectory during training after

parameter updates

is as follows:

(τ)

= [I − (I − Λ)

∗

. (7.40)

Now, the expression for

in Eq. 7.13 for

regularization can be rearranged

as:

w = (Λ + αI)

−1

ΛQ

∗

(7.41)

For neural networks, to obtain symmetry breaking between hidden units, we cannot initialize

all the parameters to

, as discussed in Sec. 6.2. However, the argument holds for any other

initial value w

(0)

250

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

w = [I − (Λ + αI)

−1

α]Q

∗

(7.42)

Comparing Eq. 7.40 and Eq. 7.42, we see that if the hyperparameters



, and

are chosen such that

(I − Λ)

= (Λ + αI)

−1

α, (7.43)

then

regularization and early stopping can be seen to be equivalent (at least

under the quadratic approximation of the objective function). Going even further,

by taking logarithms and using the series expansion for

log

(1 +

), we can conclude

that if all λ

are small (that is, λ

 1 and λ

/α  1) then

τ ≈

α

, (7.44)

α ≈

τ

. (7.45)

That is, under these assumptions, the number of training iterations

plays a role

inversely proportional to the

regularization parameter, and the inverse of

τ

plays the role of the weight decay coeﬃcient.

Parameter values corresponding to directions of signiﬁcant curvature (of the

objective function) are regularized less than directions of less curvature. Of course,

in the context of early stopping, this really means that parameters that correspond

to directions of signiﬁcant curvature tend to learn early relative to parameters

corresponding to directions of less curvature.

The derivations in this section have shown that a trajectory of length

ends

at a point that corresponds to a minimum of the

-regularized objective. Early

stopping is of course more than the mere restriction of the trajectory length;

instead, early stopping typically involves monitoring the validation set error in

order to stop the trajectory at a particularly good point in space. Early stopping

therefore has the advantage over weight decay that early stopping automatically

determines the correct amount of regularization while weight decay requires many

training experiments with diﬀerent values of its hyperparameter.

7.9 Parameter Tying and Parameter Sharing

Thus far, in this chapter, when we have discussed adding constraints or penalties

to the parameters, we have always done so with respect to a ﬁxed region or point.

For example,

regularization (or weight decay) penalizes model parameters for

deviating from the ﬁxed value of zero. However, sometimes we may need other

ways to express our prior knowledge about suitable values of the model parameters.

251

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Sometimes we might not know precisely what values the parameters should take

but we know, from knowledge of the domain and model architecture, that there

should be some dependencies between the model parameters.

A common type of dependency that we often want to express is that certain

parameters should be close to one another. Consider the following scenario: we

have two models performing the same classiﬁcation task (with the same set of

classes) but with somewhat diﬀerent input distributions. Formally, we have model

with parameters

(A)

and model

with parameters

(B)

. The two models

map the input to two diﬀerent, but related outputs:

ˆy

(A)

(

(A)

, x

) and

ˆy

(B)

= g(w

(B)

, x).

Let us imagine that the tasks are similar enough (perhaps with similar input

and output distributions) that we believe the model parameters should be close

to each other:

∀i

(A)

should be close to

(B)

. We can leverage this information

through regularization. Speciﬁcally, we can use a parameter norm penalty of the

form: Ω(

(A)

, w

(B)

) =

(A)

− w

(B)

. Here we used an

penalty, but other

choices are also possible.

This kind of approach was proposed by Lasserre et al. (2006), who regularized

the parameters of one model, trained as a classiﬁer in a supervised paradigm, to

be close to the parameters of another model, trained in an unsupervised paradigm

(to capture the distribution of the observed input data). The architectures were

constructed such that many of the parameters in the classiﬁer model could be

paired to corresponding parameters in the unsupervised model.

While a parameter norm penalty is one way to regularize parameters to be

close to one another, the more popular way is to use constraints:

to force sets of

parameters to be equal

. This method of regularization is often referred to as

parameter sharing, where we interpret the various models or model components as

sharing a unique set of parameters. A signiﬁcant advantage of parameter sharing

over regularizing the parameters to be close (via a norm penalty) is that only a

subset of the parameters (the unique set) need to be stored in memory. In certain

models—such as the convolutional neural network—this can lead to signiﬁcant

reduction in the memory footprint of the model.

Convolutional Neural Networks

By far the most popular and extensive use

of parameter sharing occurs in convolutional neural networks (CNNs) applied to

computer vision.

Natural images have many statistical properties that are invariant to translation.

For example, a photo of a cat remains a photo of a cat if it is translated one pixel

252

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

to the right. CNNs take this property into account by sharing parameters across

multiple image locations. The same feature (a hidden unit with the same weights)

is computed over diﬀerent locations in the input. This means that we can ﬁnd a

cat with the same cat detector whether the cat appears at column

or column

i + 1 in the image.

Parameter sharing has allowed CNNs to dramatically lower the number of unique

model parameters and to signiﬁcantly increase network sizes without requiring a

corresponding increase in training data. It remains one of the best examples of

how to eﬀectively incorporate domain knowledge into the network architecture.

CNNs will be discussed in more detail in Chapter 9.

7.10 Sparse Representations

Weight decay acts by placing a penalty directly on the model parameters. Another

strategy is to place a penalty on the activations of the units in a neural network,

encouraging their activations to be sparse. This indirectly imposes a complicated

penalty on the model parameters.

We have already discussed (in Sec. 7.1.2) how

penalization induces a sparse

parametrization—meaning that many of the parameters become zero (or close to

zero). Representational sparsity, on the other hand, describes a representation

where many of the elements of the representation are zero (or close to zero).

A simpliﬁed view of this distinction can be illustrated in the context of linear

regression:







−9

−3













4 0 0 −2 0 0

0 0 −1 0 3 0

0 5 0 0 0 0

1 0 0 −1 0 −4

1 0 0 0 −5 0













−2

−5







y ∈ R

A ∈ R

m×n

x ∈ R

(7.46)







−14













3 −1 2 −5 4 1

4 2 −3 −1 1 3

−1 5 4 2 −3 −2

3 1 2 −3 0 −3

−5 4 −2 2 −5 −1













−3







y ∈ R

B ∈ R

m×n

h ∈ R

(7.47)

253

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

In the ﬁrst expression, we have an example of a sparsely parametrized linear

regression model. In the second, we have linear regression with a sparse representa-

tion

of the data

. That is,

is a function of

that, in some sense, represents

the information present in x, but does so with a sparse vector.

Representational regularization is accomplished by the same sorts of mechanisms

that we have used in parameter regularization.

Norm penalty regularization of representations is performed by adding to the

loss function

a norm penalty on the representation, denoted Ω(

). As before, we

denote the regularized loss function by

J(θ; X, y) = J(θ; X, y) + αΩ(h) (7.48)

where

α ∈

, ∞

) weights the relative contribution of the norm penalty term, with

larger values of α corresponding to more regularization.

Just as an

penalty on the parameters induces parameter sparsity, an

penalty on the elements of the representation induces representational sparsity:

Ω(

) =

||h||

. Of course, the

penalty is only one choice of penalty

that can result in a sparse representation. Others include the penalty derived from

a Student-

prior on the representation (Olshausen and Field, 1996; Bergstra, 2011)

and KL divergence penalties (Larochelle and Bengio, 2008) that are especially

useful for representations with elements constrained to lie on the unit interval.

Lee et al. (2008) and Goodfellow et al. (2009) both provide examples of strategies

based on regularizing the average activation across several examples,

(i)

, to

be near some target value, such as a vector with .01 for each entry.

Other approaches obtain representational sparsity with a hard constraint on

the activation values. For example, orthogonal matching pursuit (Pati et al.,

1993) encodes an input

with the representation

that solves the constrained

optimization problem

arg min

h,khk

kx − W hk

, (7.49)

where

khk

is the number of non-zero entries of

. This problem can be solved

eﬃciently when

is constrained to be orthogonal. This method is often called

OMP-

with the value of

speciﬁed to indicate the number of non-zero features

allowed. Coates and Ng (2011) demonstrated that OMP-1 can be a very eﬀective

feature extractor for deep architectures.

Essentially any model that has hidden units can be made sparse. Throughout

this book, we will see many examples of sparsity regularization used in a variety of

contexts.

254

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

7.11 Bagging and Other Ensemble Methods

Bagging (short for bootstrap aggregating) is a technique for reducing generalization

error by combining several models (Breiman, 1994). The idea is to train several

diﬀerent models separately, then have all of the models vote on the output for test

examples. This is an example of a general strategy in machine learning called model

averaging. Techniques employing this strategy are known as ensemble methods.

The reason that model averaging works is that diﬀerent models will usually

not make all the same errors on the test set.

Consider for example a set of

regression models. Suppose that each model

makes an error



on each example, with the errors drawn from a zero-mean

multivariate normal distribution with variances

[



] =

and covariances

[



] =

. Then the error made by the average prediction of all the ensemble models is



. The expected squared error of the ensemble predictor is





















j6=i











(7.50)

v +

k − 1

c. (7.51)

In the case where the errors are perfectly correlated and

, the mean squared

error reduces to

, so the model averaging does not help at all. In the case where

the errors are perfectly uncorrelated and

= 0, the expected squared error of the

ensemble is only

. This means that the expected squared error of the ensemble

decreases linearly with the ensemble size. In other words, on average, the ensemble

will perform at least as well as any of its members, and if the members make

independent errors, the ensemble will perform signiﬁcantly better than its members.

Diﬀerent ensemble methods construct the ensemble of models in diﬀerent ways.

For example, each member of the ensemble could be formed by training a completely

diﬀerent kind of model using a diﬀerent algorithm or objective function. Bagging

is a method that allows the same kind of model, training algorithm and objective

function to be reused several times.

Speciﬁcally, bagging involves constructing

diﬀerent datasets. Each dataset

has the same number of examples as the original dataset, but each dataset is

constructed by sampling with replacement from the original dataset. This means

that, with high probability, each dataset is missing some of the examples from the

original dataset and also contains several duplicate examples (on average around

2/3 of the examples from the original dataset are found in the resulting training

255

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector

on the dataset depicted above, containing an ‘8’, a ‘6’ and a ‘9’. Suppose we make two

diﬀerent resampled datasets. The bagging training procedure is to construct each of these

datasets by sampling with replacement. The ﬁrst dataset omits the ‘9’ and repeats the ‘8’.

On this dataset, the detector learns that a loop on top of the digit corresponds to an ‘8’.

On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this case, the detector learns

that a loop on the bottom of the digit corresponds to an ‘8’. Each of these individual

classiﬁcation rules is brittle, but if we average their output then the detector is robust,

achieving maximal conﬁdence only when both loops of the ‘8’ are present.

set, if it has the same size as the original). Model

is then trained on dataset

. The diﬀerences between which examples are included in each dataset result in

diﬀerences between the trained models. See Fig. 7.5 for an example.

Neural networks reach a wide enough variety of solution points that they can

often beneﬁt from model averaging even if all of the models are trained on the same

dataset. Diﬀerences in random initialization, random selection of minibatches,

diﬀerences in hyperparameters, or diﬀerent outcomes of non-deterministic imple-

mentations of neural networks are often enough to cause diﬀerent members of the

ensemble to make partially independent errors.

Model averaging is an extremely powerful and reliable method for reducing

generalization error. Its use is usually discouraged when benchmarking algorithms

for scientiﬁc papers, because any machine learning algorithm can beneﬁt substan-

tially from model averaging at the price of increased computation and memory.

For this reason, benchmark comparisons are usually made using a single model.

Machine learning contests are usually won by methods using model averag-

ing over dozens of models. A recent prominent example is the Netﬂix Grand

Prize (Koren, 2009).

256

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Not all techniques for constructing ensembles are designed to make the ensemble

more regularized than the individual models. For example, a technique called

boosting (Freund and Schapire, 1996b,a) constructs an ensemble with higher capacity

than the individual models. Boosting has been applied to build ensembles of neural

networks (Schwenk and Bengio, 1998) by incrementally adding neural networks to

the ensemble. Boosting has also been applied interpreting an individual neural

network as an ensemble (Bengio et al., 2006a), incrementally adding hidden units

to the neural network.

7.12 Dropout

Dropout (Srivastava et al., 2014) provides a computationally inexpensive but

powerful method of regularizing a broad family of models. To a ﬁrst approximation,

dropout can be thought of as a method of making bagging practical for ensembles

of very many large neural networks. Bagging involves training multiple models,

and evaluating multiple models on each test example. This seems impractical

when each model is a large neural network, since training and evaluating such

networks is costly in terms of runtime and memory. It is common to use ensembles

of ﬁve to ten neural networks—Szegedy et al. (2014a) used six to win the ILSVRC—

but more than this rapidly becomes unwieldy. Dropout provides an inexpensive

approximation to training and evaluating a bagged ensemble of exponentially many

neural networks.

Speciﬁcally, dropout trains the ensemble consisting of all sub-networks that

can be formed by removing non-output units from an underlying base network,

as illustrated in Fig. 7.6. In most modern neural networks, based on a series of

aﬃne transformations and nonlinearities, we can eﬀectively remove a unit from a

network by multiplying its output value by zero. This procedure requires some

slight modiﬁcation for models such as radial basis function networks, which take

the diﬀerence between the unit’s state and some reference value. Here, we present

the dropout algorithm in terms of multiplication by zero for simplicity, but it can

be trivially modiﬁed to work with other operations that remove a unit from the

network.

Recall that to learn with bagging, we deﬁne

diﬀerent models, construct

diﬀerent datasets by sampling from the training set with replacement, and then

train model

on dataset

. Dropout aims to approximate this process, but with an

exponentially large number of neural networks. Speciﬁcally, to train with dropout,

we use a minibatch-based learning algorithm that makes small steps, such as

stochastic gradient descent. Each time we load an example into a minibatch, we

257

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Base network

Ensemble of Sub-Networks

Figure 7.6: Dropout trains an ensemble consisting of all sub-networks that can be

constructed by removing non-output units from an underlying base network. Here, we

begin with a base network with two visible units and two hidden units. There are sixteen

possible subsets of these four units. We show all sixteen subnetworks that may be formed

by dropping out diﬀerent subsets of units from the original network. In this small example,

a large proportion of the resulting networks have no input units or no path connecting

the input to the output. This problem becomes insigniﬁcant for networks with wider

layers, where the probability of dropping all possible paths from inputs to outputs becomes

smaller.

258

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

ˆx

Figure 7.7: An example of forward propagation through a feedforward network using

dropout. (Top) In this example, we use a feedforward network with two input units, one

hidden layer with two hidden units, and one output unit. (Bottom) To perform forward

propagation with dropout, we randomly sample a vector

with one entry for each input

or hidden unit in the network. The entries of

are binary and are sampled independently

from each other. The probability of each entry being 1 is a hyperparameter, usually 0

for the hidden layers and 0

8 for the input. Each unit in the network is multiplied by

the corresponding mask, and then forward propagation continues through the rest of the

network as usual. This is equivalent to randomly selecting one of the sub-networks from

Fig. 7.6 and running forward propagation through it.

259

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

randomly sample a diﬀerent binary mask to apply to all of the input and hidden

units in the network. The mask for each unit is sampled independently from all of

the others. The probability of sampling a mask value of one (causing a unit to be

included) is a hyperparameter ﬁxed before training begins. It is not a function

of the current value of the model parameters or the input example. Typically,

an input unit is included with probability 0.8 and a hidden unit is included with

probability 0.5. We then run forward propagation, back-propagation, and the

learning update as usual. Fig. 7.7 illustrates how to run forward propagation with

dropout.

More formally, suppose that a mask vector

speciﬁes which units to include,

and

(

θ, µ

) deﬁnes the cost of the model deﬁned by parameters

and mask

Then dropout training consists in minimizing

(

θ, µ

). The expectation contains

exponentially many terms but we can obtain an unbiased estimate of its gradient

by sampling values of µ.

Dropout training is not quite the same as bagging training. In the case of

bagging, the models are all independent. In the case of dropout, the models share

parameters, with each model inheriting a diﬀerent subset of parameters from the

parent neural network. This parameter sharing makes it possible to represent an

exponential number of models with a tractable amount of memory. In the case of

bagging, each model is trained to convergence on its respective training set. In the

case of dropout, typically most models are not explicitly trained at all—usually,

the model is large enough that it would be infeasible to sample all possible sub-

networks within the lifetime of the universe. Instead, a tiny fraction of the possible

sub-networks are each trained for a single step, and the parameter sharing causes

the remaining sub-networks to arrive at good settings of the parameters. These

are the only diﬀerences. Beyond these, dropout follows the bagging algorithm. For

example, the training set encountered by each sub-network is indeed a subset of

the original training set sampled with replacement.

To make a prediction, a bagged ensemble must accumulate votes from all of

its members. We refer to this process as inference in this context. So far, our

description of bagging and dropout has not required that the model be explicitly

probabilistic. Now, we assume that the model’s role is to output a probability

distribution. In the case of bagging, each model

produces a probability distribution

(i)

(

y | x

). The prediction of the ensemble is given by the arithmetic mean of all

of these distributions,

i=1

(i)

(y | x). (7.52)

In the case of dropout, each sub-model deﬁned by mask vector

deﬁnes a prob-

260

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

ability distribution

(

y | x, µ

). The arithmetic mean over all masks is given

p(µ)p(y | x, µ) (7.53)

where

(

) is the probability distribution that was used to sample

at training

time.

Because this sum includes an exponential number of terms, it is intractable

to evaluate except in cases where the structure of the model permits some form

of simpliﬁcation. So far, deep neural nets are not known to permit any tractable

simpliﬁcation. Instead, we can approximate the inference with sampling, by

averaging together the output from many masks. Even 10-20 masks are often

suﬃcient to obtain good performance.

However, there is an even better approach, that allows us to obtain a good

approximation to the predictions of the entire ensemble, at the cost of only one

forward propagation. To do so, we change to using the geometric mean rather than

the arithmetic mean of the ensemble members’ predicted distributions. Warde-

Farley et al. (2014) present arguments and empirical evidence that the geometric

mean performs comparably to the arithmetic mean in this context.

The geometric mean of multiple probability distributions is not guaranteed to be

a probability distribution. To guarantee that the result is a probability distribution,

we impose the requirement that none of the sub-models assigns probability 0 to any

event, and we renormalize the resulting distribution. The unnormalized probability

distribution deﬁned directly by the geometric mean is given by

˜p

ensemble

(y | x) =

p(y | x, µ) (7.54)

where

is the number of units that may be dropped. Here we use a uniform

distribution over

to simplify the presentation, but non-uniform distributions are

also possible. To make predictions we must re-normalize the ensemble:

ensemble

(y | x) =

˜p

ensemble

(y | x)

˜p

ensemble

| x)

. (7.55)

A key insight (Hinton et al., 2012c) involved in dropout is that we can approxi-

mate

ensemble

by evaluating

(

y | x

) in one model: the model with all units, but

with the weights going out of unit

multiplied by the probability of including unit

. The motivation for this modiﬁcation is to capture the right expected value of

the output from that unit. We call this approach the weight scaling inference rule.

261

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

There is not yet any theoretical argument for the accuracy of this approximate

inference rule in deep nonlinear networks, but empirically it performs very well.

Because we usually use an inclusion probability of

, the weight scaling rule

usually amounts to dividing the weights by 2 at the end of training, and then using

the model as usual. Another way to achieve the same result is to multiply the

states of the units by 2 during training. Either way, the goal is to make sure that

the expected total input to a unit at test time is roughly the same as the expected

total input to that unit at train time, even though half the units at train time are

missing on average.

For many classes of models that do not have nonlinear hidden units, the weight

scaling inference rule is exact. For a simple example, consider a softmax regression

classiﬁer with n input variables represented by the vector v:

P (y = y | v) = softmax



v + b



. (7.56)

We can index into the family of sub-models by element-wise multiplication of the

input with a binary vector d:

P (y = y | v; d) = softmax



(d  v) + b



. (7.57)

The ensemble predictor is deﬁned by re-normalizing the geometric mean over all

ensemble members’ predictions:

ensemble

(y = y | v) =

ensemble

(y = y | v)

ensemble

(y = y

| v)

(7.58)

where

ensemble

(y = y | v) =

d∈{0,1}

P (y = y | v; d). (7.59)

To see that the weight scaling rule is exact, we can simplify

ensemble

(y = y | v) =

d∈{0,1}

P (y = y | v; d) (7.60)

d∈{0,1}

softmax (W

(d  v) + b)

(7.61)

d∈{0,1}

exp



y,:

(d  v) + b



exp



(d  v) + b



(7.62)

262

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

d∈{0,1}

exp



y,:

(d  v) + b



d∈{0,1}

exp



(d  v) + b



(7.63)

Because

will be normalized, we can safely ignore multiplication by factors that

are constant with respect to y:

ensemble

(y = y | v) ∝

d∈{0,1}

exp



y,:

(d  v) + b



(7.64)

= exp





d∈{0,1}

y,:

(d  v) + b





(7.65)

= exp



y,:

v + b



(7.66)

Substituting this back into Eq. 7.58 we obtain a softmax classiﬁer with weights

W .

The weight scaling rule is also exact in other settings, including regression

networks with conditionally normal outputs, and deep networks that have hidden

layers without nonlinearities. However, the weight scaling rule is only an approxi-

mation for deep models that have nonlinearities. Though the approximation has

not been theoretically characterized, it often works well, empirically. Goodfellow

et al. (2013a) found experimentally that the weight scaling approximation can work

better (in terms of classiﬁcation accuracy) than Monte Carlo approximations to the

ensemble predictor. This held true even when the Monte Carlo approximation was

allowed to sample up to 1,000 sub-networks. Gal and Ghahramani (2015) found

that some models obtain better classiﬁcation accuracy using twenty samples and

the Monte Carlo approximation. It appears that the optimal choice of inference

approximation is problem-dependent.

Srivastava et al. (2014) showed that dropout is more eﬀective than other

standard computationally inexpensive regularizers, such as weight decay, ﬁlter

norm constraints and sparse activity regularization. Dropout may also be combined

with other forms of regularization to yield a further improvement.

One advantage of dropout is that it is very computationally cheap. Using

dropout during training requires only

(

) computation per example per update,

to generate

random binary numbers and multiply them by the state. Depending

on the implementation, it may also require

(

) memory to store these binary

numbers until the back-propagation stage. Running inference in the trained model

263

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

has the same cost per-example as if dropout were not used, though we must pay

the cost of dividing the weights by 2 once before beginning to run inference on

examples.

Another signiﬁcant advantage of dropout is that it does not signiﬁcantly limit

the type of model or training procedure that can be used. It works well with nearly

any model that uses a distributed representation and can be trained with stochastic

gradient descent. This includes feedforward neural networks, probabilistic models

such as restricted Boltzmann machines (Srivastava et al., 2014), and recurrent

neural networks (Bayer and Osendorfer, 2014; Pascanu et al., 2014a). Many other

regularization strategies of comparable power impose more severe restrictions on

the architecture of the model.

Though the cost per-step of applying dropout to a speciﬁc model is negligible,

the cost of using dropout in a complete system can be signiﬁcant. Because dropout

is a regularization technique, it reduces the eﬀective capacity of a model. To oﬀset

this eﬀect, we must increase the size of the model. Typically the optimal validation

set error is much lower when using dropout, but this comes at the cost of a much

larger model and many more iterations of the training algorithm. For very large

datasets, regularization confers little reduction in generalization error. In these

cases, the computational cost of using dropout and larger models may outweigh

the beneﬁt of regularization.

When extremely few labeled training examples are available, dropout is less

eﬀective. Bayesian neural networks (Neal, 1996) outperform dropout on the

Alternative Splicing Dataset (Xiong et al., 2011) where fewer than 5,000 examples

are available (Srivastava et al., 2014). When additional unlabeled data is available,

unsupervised feature learning can gain an advantage over dropout.

Wager et al. (2013) showed that, when applied to linear regression, dropout

is equivalent to

weight decay, with a diﬀerent weight decay coeﬃcient for

each input feature. The magnitude of each feature’s weight decay coeﬃcient is

determined by its variance. Similar results hold for other linear models. For deep

models, dropout is not equivalent to weight decay.

The stochasticity used while training with dropout is not necessary for the

approach’s success. It is just a means of approximating the sum over all sub-

models. Wang and Manning (2013) derived analytical approximations to this

marginalization. Their approximation, known as fast dropout resulted in faster

convergence time due to the reduced stochasticity in the computation of the

gradient. This method can also be applied at test time, as a more principled

(but also more computationally expensive) approximation to the average over all

sub-networks than the weight scaling approximation. Fast dropout has been used

264

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

to nearly match the performance of standard dropout on small neural network

problems, but has not yet yielded a signiﬁcant improvement or been applied to a

large problem.

Just as stochasticity is not necessary to achieve the regularizing eﬀect of

dropout, it is also not suﬃcient. To demonstrate this, Warde-Farley et al. (2014)

designed control experiments using a method called dropout boosting that they

designed to use exactly the same mask noise as traditional dropout but lack

its regularizing eﬀect. Dropout boosting trains the entire ensemble to jointly

maximize the log-likelihood on the training set. In the same sense that traditional

dropout is analogous to bagging, this approach is analogous to boosting. As

intended, experiments with dropout boosting show almost no regularization eﬀect

compared to training the entire network as a single model. This demonstrates that

the interpretation of dropout as bagging has value beyond the interpretation of

dropout as robustness to noise. The regularization eﬀect of the bagged ensemble is

only achieved when the stochastically sampled ensemble members are trained to

perform well independently of each other.

Dropout has inspired other stochastic approaches to training exponentially

large ensembles of models that share weights. DropConnect is a special case of

dropout where each product between a single scalar weight and a single hidden

unit state is considered a unit that can be dropped (Wan et al., 2013). Stochastic

pooling is a form of randomized pooling (see Sec. 9.3) for building ensembles

of convolutional networks with each convolutional network attending to diﬀerent

spatial locations of each feature map. So far, dropout remains the most widely

used implicit ensemble method.

One of the key insights of dropout is that training a network with stochastic

behavior and making predictions by averaging over multiple stochastic decisions

implements a form of bagging with parameter sharing. Earlier, we described

dropout as bagging an ensemble of models formed by including or excluding

units. However, there is no need for this model averaging strategy to be based on

inclusion and exclusion. In principle, any kind of random modiﬁcation is admissible.

In practice, we must choose modiﬁcation families that neural networks are able

to learn to resist. Ideally, we should also use model families that allow a fast

approximate inference rule. We can think of any form of modiﬁcation parametrized

by a vector

as training an ensemble consisting of

(

y | x, µ

) for all possible

values of

. There is no requirement that

have a ﬁnite number of values. For

example,

can be real-valued. Srivastava et al. (2014) showed that multiplying the

weights by

µ ∼ N

(

1, I

) can outperform dropout based on binary masks. Because

[

] =

the standard network automatically implements approximate inference

265

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

in the ensemble, without needing any weight scaling.

So far we have described dropout purely as a means of performing eﬃcient,

approximate bagging. However, there is another view of dropout that goes further

than this. Dropout trains not just a bagged ensemble of models, but an ensemble

of models that share hidden units. This means each hidden unit must be able to

perform well regardless of which other hidden units are in the model. Hidden units

must be prepared to be swapped and interchanged between models. Hinton et al.

(2012c) were inspired by an idea from biology: sexual reproduction, which involves

swapping genes between two diﬀerent organisms, creates evolutionary pressure for

genes to become not just good, but to become readily swapped between diﬀerent

organisms. Such genes and such features are very robust to changes in their

environment because they are not able to incorrectly adapt to unusual features

of any one organism or model. Dropout thus regularizes each hidden unit to be

not merely a good feature but a feature that is good in many contexts. Warde-

Farley et al. (2014) compared dropout training to training of large ensembles and

concluded that dropout oﬀers additional improvements to generalization error

beyond those obtained by ensembles of independent models.

It is important to understand that a large portion of the power of dropout

arises from the fact that the masking noise is applied to the hidden units. This

can be seen as a form of highly intelligent, adaptive destruction of the information

content of the input rather than destruction of the raw values of the input. For

example, if the model learns a hidden unit

that detects a face by ﬁnding the nose,

then dropping

corresponds to erasing the information that there is a nose in

the image. The model must learn another

, either that redundantly encodes the

presence of a nose, or that detects the face by another feature, such as the mouth.

Traditional noise injection techniques that add unstructured noise at the input are

not able to randomly erase the information about a nose from an image of a face

unless the magnitude of the noise is so great that nearly all of the information in

the image is removed. Destroying extracted features rather than original values

allows the destruction process to make use of all of the knowledge about the input

distribution that the model has acquired so far.

Another important aspect of dropout is that the noise is multiplicative. If the

noise were additive with ﬁxed scale, then a rectiﬁed linear hidden unit

with

added noise



could simply learn to have

become very large in order to make

the added noise



insigniﬁcant by comparison. Multiplicative noise does not allow

such a pathological solution to the noise robustness problem.

Another deep learning algorithm, batch normalization, reparametrizes the

model in a way that introduces both additive and multiplicative noise on the

266

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

007

x sign

(

∇

(

θ, x, y

))

x +

 sign

(

∇

(

θ, x, y

))

y =“panda” “nematode” “gibbon”

w/ 57.7%

conﬁdence

w/ 8.2%

conﬁdence

w/ 99.3 %

conﬁdence

Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet

(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose

elements are equal to the sign of the elements of the gradient of the cost function with

respect to the input, we can change GoogLeNet’s classiﬁcation of the image. Reproduced

with permission from Goodfellow et al. (2014b).

hidden units at training time. The primary purpose of batch normalization is to

improve optimization, but the noise can have a regularizing eﬀect, and sometimes

makes dropout unnecessary. Batch normalization is described further in Sec. 8.7.1.

7.13 Adversarial Training

In many cases, neural networks have begun to reach human performance when

evaluated on an i.i.d. test set. It is natural therefore to wonder whether these

models have obtained a true human-level understanding of these tasks. In order

to probe the level of understanding a network has of the underlying task, we can

search for examples that the model misclassiﬁes. Szegedy et al. (2014b) found that

even neural networks that perform at human level accuracy have a nearly 100%

error rate on examples that are intentionally constructed by using an optimization

procedure to search for an input

near a data point

such that the model output

is very diﬀerent at

. In many cases,

can be so similar to

that a human

observer cannot tell the diﬀerence between the original example and the adversarial

example, but the network can make highly diﬀerent predictions. See Fig. 7.8 for

an example.

Adversarial examples have many implications, for example, in computer security,

that are beyond the scope of this chapter. However, they are interesting in the

context of regularization because one can reduce the error rate on the original i.i.d.

test set via adversarial training—training on adversarially perturbed examples

267

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

from the training set (Szegedy et al., 2014b; Goodfellow et al., 2014b).

Goodfellow et al. (2014b) showed that one of the primary causes of these

adversarial examples is excessive linearity. Neural networks are built out of

primarily linear building blocks. In some experiments the overall function they

implement proves to be highly linear as a result. These linear functions are easy

to optimize. Unfortunately, the value of a linear function can change very rapidly

if it has numerous inputs. If we change each input by



, then a linear function

with weights

can change by as much as

||w||

, which can be a very large

amount if

is high-dimensional. Adversarial training discourages this highly

sensitive locally linear behavior by encouraging the network to be locally constant

in the neighborhood of the training data. This can be seen as a way of explicitly

introducing a local constancy prior into supervised neural nets.

Adversarial training helps to illustrate the power of using a large function

family in combination with aggressive regularization. Purely linear models, like

logistic regression, are not able to resist adversarial examples because they are

forced to be linear. Neural networks are able to represent functions that can range

from nearly linear to nearly locally constant and thus have the ﬂexibility to capture

linear trends in the training data while still learning to resist local perturbation.

Adversarial examples also provide a means of accomplishing semi-supervised

learning. At a point

that is not associated with a label in the dataset, the

model itself assigns some label

ˆy

. The model’s label

ˆy

may not be the true label,

but if the model is high quality, then

ˆy

has a high probability of providing the

true label. We can seek an adversarial example

that causes the classiﬁer to

output a label

with

ˆy

. Adversarial examples generated using not the

true label but a label provided by a trained model are called virtual adversarial

examples (Miyato et al., 2015). The classiﬁer may then be trained to assign the

same label to

and

. This encourages the classiﬁer to learn a function that is

robust to small changes anywhere along the manifold where the unlabeled data

lies. The assumption motivating this approach is that diﬀerent classes usually lie

on disconnected manifolds, and a small perturbation should not be able to jump

from one class manifold to another class manifold.

7.14 Tangent Distance, Tangent Prop, and Manifold

Tangent Classiﬁer

Many machine learning algorithms aim to overcome the curse of dimensionality

by assuming that the data lies near a low-dimensional manifold, as described in

268

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Sec. 5.11.3.

One of the early attempts to take advantage of the manifold hypothesis is the

tangent distance algorithm (Simard et al., 1993, 1998). It is a non-parametric

nearest-neighbor algorithm in which the metric used is not the generic Euclidean

distance but one that is derived from knowledge of the manifolds near which

probability concentrates. It is assumed that we are trying to classify examples and

that examples on the same manifold share the same category. Since the classiﬁer

should be invariant to the local factors of variation that correspond to movement

on the manifold, it would make sense to use as nearest-neighbor distance between

points

and

the distance between the manifolds

and

to which they

respectively belong. Although that may be computationally diﬃcult (it would

require solving an optimization problem, to ﬁnd the nearest pair of points on M

and

), a cheap alternative that makes sense locally is to approximate

by its

tangent plane at

and measure the distance between the two tangents, or between

a tangent plane and a point. That can be achieved by solving a low-dimensional

linear system (in the dimension of the manifolds). Of course, this algorithm requires

one to specify the tangent vectors.

In a related spirit, the tangent prop algorithm (Simard et al., 1992) (Fig. 7.9)

trains a neural net classiﬁer with an extra penalty to make each output

(

) of

the neural net locally invariant to known factors of variation. These factors of

variation correspond to movement along the manifold near which examples of the

same class concentrate. Local invariance is achieved by requiring

∇

(

) to be

orthogonal to the known manifold tangent vectors

(i)

, or equivalently that

the directional derivative of

in the directions

(i)

be small by adding a

regularization penalty Ω:

Ω(f) =



(∇

f(x))

(i)



. (7.67)

This regularizer can of course by scaled by an appropriate hyperparameter, and, for

most neural networks, we would need to sum over many outputs rather than the lone

output

(

) described here for simplicity. As with the tangent distance algorithm,

the tangent vectors are derived a priori, usually from the formal knowledge of

the eﬀect of transformations such as translation, rotation, and scaling in images.

Tangent prop has been used not just for supervised learning (Simard et al., 1992)

but also in the context of reinforcement learning (Thrun, 1995).

Tangent propagation is closely related to dataset augmentation. In both

cases, the user of the algorithm encodes his or her prior knowledge of the task

by specifying a set of transformations that should not alter the output of the

269

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

network. The diﬀerence is that in the case of dataset augmentation, the network is

explicitly trained to correctly classify distinct inputs that were created by applying

more than an inﬁnitesimal amount of these transformations. Tangent propagation

does not require explicitly visiting a new input point. Instead, it analytically

regularizes the model to resist perturbation in the directions corresponding to

the speciﬁed transformation. While this analytical approach is intellectually

elegant, it has two major drawbacks. First, it only regularizes the model to resist

inﬁnitesimal perturbation. Explicit dataset augmentation confers resistance to

larger perturbations. Second, the inﬁnitesimal approach poses diﬃculties for models

based on rectiﬁed linear units. These models can only shrink their derivatives

by turning units oﬀ or shrinking their weights. They are not able to shrink their

derivatives by saturating at a high value with large weights, as sigmoid or

tanh

units can. Dataset augmentation works well with rectiﬁed linear units because

diﬀerent subsets of rectiﬁed units can activate for diﬀerent transformed versions of

each original input.

Tangent propagation is also related to double backprop (Drucker and LeCun,

1992) and adversarial training (Szegedy et al., 2014b; Goodfellow et al., 2014b).

Double backprop regularizes the Jacobian to be small, while adversarial training

ﬁnds inputs near the original inputs and trains the model to produce the same

output on these as on the original inputs. Tangent propagation and dataset

augmentation using manually speciﬁed transformations both require that the

model should be invariant to certain speciﬁed directions of change in the input.

Double backprop and adversarial training both require that the model should be

invariant to all directions of change in the input so long as the change is small. Just

as dataset augmentation is the non-inﬁnitesimal version of tangent propagation,

adversarial training is the non-inﬁnitesimal version of double backprop.

The manifold tangent classiﬁer (Rifai et al., 2011c), eliminates the need to

know the tangent vectors a priori. As we will see in Chapter 14, autoencoders can

estimate the manifold tangent vectors. The manifold tangent classiﬁer makes use

of this technique to avoid needing user-speciﬁed tangent vectors. As illustrated

in Fig. 14.10, these estimated tangent vectors go beyond the classical invariants

that arise out of the geometry of images (such as translation, rotation and scaling)

and include factors that must be learned because they are object-speciﬁc (such as

moving body parts). The algorithm proposed with the manifold tangent classiﬁer

is therefore simple: (1) use an autoencoder to learn the manifold structure by

unsupervised learning, and (2) use these tangents to regularize a neural net classiﬁer

as in tangent prop (Eq. 7.67).

This chapter has described most of the general strategies used to regularize

270

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Normal

Tangent

Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al.,

1992) and manifold tangent classiﬁer (Rifai et al., 2011c), which both regularize the

classiﬁer output function

(

). Each curve represents the manifold for a diﬀerent class,

illustrated here as a one-dimensional manifold embedded in a two-dimensional space.

On one curve, we have chosen a single point and drawn a vector that is tangent to the

class manifold (parallel to and touching the manifold) and a vector that is normal to the

class manifold (orthogonal to the manifold). In multiple dimensions there may be many

tangent directions and many normal directions. We expect the classiﬁcation function to

change rapidly as it moves in the direction normal to the manifold, and not to change as

it moves along the class manifold. Both tangent propagation and the manifold tangent

classiﬁer regularize

(

) to not change very much as

moves along the manifold. Tangent

propagation requires the user to manually specify functions that compute the tangent

directions (such as specifying that small translations of images remain in the same class

manifold) while the manifold tangent classiﬁer estimates the manifold tangent directions

by training an autoencoder to ﬁt the training data. The use of autoencoders to estimate

manifolds will be described in Chapter 14.

neural networks. Regularization is a central theme of machine learning and as such

will be revisited periodically by most of the remaining chapters. Another central

theme of machine learning is optimization, described next.

271

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Algorithm 7.1

The early stopping meta-algorithm for determining the best

amount of time to train. This meta-algorithm is a general strategy that works

well with a variety of training algorithms and ways of quantifying error on the

validation set.

Let n be the number of steps between evaluations.

Let

be the “patience,” the number of times to observe worsening validation set

error before giving up.

Let θ

be the initial parameters.

θ ← θ

i ← 0

j ← 0

v ← ∞

∗

← θ

∗

← i

while j < p do

Update θ by running the training algorithm for n steps.

i ← i + n

← ValidationSetError(θ)

if v

< v then

j ← 0

∗

← θ

∗

← i

v ← v

else

j ← j + 1

end if

end while

Best parameters are θ

∗

, best number of training steps is i

∗

272

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Algorithm 7.2

A meta-algorithm for using early stopping to determine how long

to train, then retraining on all the data.

Let X

(train)

and y

(train)

be the training set.

Split

(train)

and

(train)

into (

(subtrain)

(valid)

) and (

(subtrain)

(valid)

)

respectively.

Run early stopping (Algorithm 7.1) starting from random

using

(subtrain)

and

(subtrain)

for training data and

(valid)

and

(valid)

for validation data. This

returns i

∗

, the optimal number of steps.

Set θ to random values again.

Train on X

(train)

and y

(train)

for i

∗

steps.

Algorithm 7.3

Meta-algorithm using early stopping to determine at what objec-

tive value we start to overﬁt, then continue training until that value is reached.

Let X

(train)

and y

(train)

be the training set.

Split

(train)

and

(train)

into (

(subtrain)

(valid)

) and (

(subtrain)

(valid)

)

respectively.

Run early stopping (Algorithm 7.1) starting from random

using

(subtrain)

and

(subtrain)

for training data and

(valid)

and

(valid)

for validation data. This

updates θ.

 ← J(θ, X

(subtrain)

, y

(subtrain)

)

while J(θ, X

(valid)

, y

(valid)

) >  do

Train on X

(train)

and y

(train)

for n steps.

end while

273