Chapter 6

Deep Feedforward Networks

Deep feedforward networks, also often called feedforward neural networks, or multi-

layer perceptrons (MLPs), are the quintessential deep learning models. The goal

of a feedforward network is to approximate some function

∗

. For example, for

a classiﬁer,

∗

(

) maps an input

to a category

. A feedforward network

deﬁnes a mapping

(

;

) and learns the value of the parameters

that result

in the best function approximation.

These models are called feedforward because information ﬂows through the

function being evaluated from

, through the intermediate computations used to

deﬁne

, and ﬁnally to the output

. There are no feedback connections in which

outputs of the model are fed back into itself. When feedforward neural networks

are extended to include feedback connections, they are called recurrent neural

networks, presented in Chapter 10.

Feedforward networks are of extreme importance to machine learning practi-

tioners. They form the basis of many important commercial applications. For

example, the convolutional networks used for object recognition from photos are a

specialized kind of feedforward network. Feedforward networks are a conceptual

stepping stone on the path to recurrent networks, which power many natural

language applications.

Feedforward neural networks are called networks because they are typically rep-

resented by composing together many diﬀerent functions. The model is associated

with a directed acyclic graph describing how the functions are composed together.

For example, we might have three functions

(1)

(2)

, and

(3)

connected in a

chain, to form

(

) =

(3)

(

(2)

(

(1)

(

))). These chain structures are the most

commonly used structures of neural networks. In this case,

(1)

is called the ﬁrst

layer of the network,

(2)

is called the second layer, and so on. The overall length

167

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of the chain gives the depth of the model. It is from this terminology that the

name “deep learning” arises. The ﬁnal layer of a feedforward network is called the

output layer. During neural network training, we drive

(

) to match

∗

(

). The

training data provides us with noisy, approximate examples of

∗

(

) evaluated at

diﬀerent training points. Each example

is accompanied by a label

y ≈ f

∗

(

The training examples specify directly what the output layer must do at each point

; it must produce a value that is close to

. The behavior of the other layers is

not directly speciﬁed by the training data. The learning algorithm must decide

how to use those layers to produce the desired output, but the training data does

not say what each individual layer should do. Instead, the learning algorithm must

decide how to use these layers to best implement an approximation of

∗

. Because

the training data does not show the desired output for each of these layers, these

layers are called hidden layers.

Finally, these networks are called neural because they are loosely inspired by

neuroscience. Each hidden layer of the network is typically vector-valued. The

dimensionality of these hidden layers determines the width of the model. Each

element of the vector may be interpreted as playing a role analogous to a neuron.

Rather than thinking of the layer as representing a single vector-to-vector function,

we can also think of the layer as consisting of many units that act in parallel,

each representing a vector-to-scalar function. Each unit resembles a neuron in

the sense that it receives input from many other units and computes its own

activation value. The idea of using many layers of vector-valued representation

is drawn from neuroscience. The choice of the functions

(i)

(

) used to compute

these representations is also loosely guided by neuroscientiﬁc observations about

the functions that biological neurons compute. However, modern neural network

research is guided by many mathematical and engineering disciplines, and the

goal of neural networks is not to perfectly model the brain. It is best to think of

feedforward networks as function approximation machines that are designed to

achieve statistical generalization, occasionally drawing some insights from what we

know about the brain, rather than as models of brain function.

One way to understand feedforward networks is to begin with linear models

and consider how to overcome their limitations. Linear models, such as logistic

regression and linear regression, are appealing because they may be ﬁt eﬃciently

and reliably, either in closed form or with convex optimization. Linear models also

have the obvious defect that the model capacity is limited to linear functions, so

the model cannot understand the interaction between any two input variables.

To extend linear models to represent nonlinear functions of

, we can apply

the linear model not to

itself but to a transformed input

(

), where

is a

168

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

nonlinear transformation. Equivalently, we can apply the kernel trick described in

Sec. 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying

the

mapping. We can think of

as providing a set of features describing

, or

as providing a new representation for x.

The question is then how to choose the mapping φ.

One option is to use a very generic

, such as the inﬁnite-dimensional

that

is implicitly used by kernel machines based on the RBF kernel. If

(

) is

of high enough dimension, we can always have enough capacity to ﬁt the

training set, but generalization to the test set often remains poor. Very

generic feature mappings are usually based only on the principle of local

smoothness and do not encode enough prior information to solve advanced

problems.

Another option is to manually engineer

. Until the advent of deep learning,

this was the dominant approach. This approach requires decades of human

eﬀort for each separate task, with practitioners specializing in diﬀerent

domains such as speech recognition or computer vision, and with little

transfer between domains.

The strategy of deep learning is to learn

. In this approach, we have a model

(

;

θ, w

) =

(

;

)

. We now have parameters

that we use to learn

from a broad class of functions, and parameters

that map from

(

) to

the desired output. This is an example of a deep feedforward network, with

deﬁning a hidden layer. This approach is the only one of the three that

gives up on the convexity of the training problem, but the beneﬁts outweigh

the harms. In this approach, we parametrize the representation as

(

;

)

and use the optimization algorithm to ﬁnd the

that corresponds to a good

representation. If we wish, this approach can capture the beneﬁt of the ﬁrst

approach by being highly generic—we do so by using a very broad family

(

;

). This approach can also capture the beneﬁt of the second approach.

Human practitioners can encode their knowledge to help generalization by

designing families

(

;

) that they expect will perform well. The advantage

is that the human designer only needs to ﬁnd the right general function

family rather than ﬁnding precisely the right function.

This general principle of improving models by learning features extends beyond

the feedforward networks described in this chapter. It is a recurring theme of deep

learning that applies to all of the kinds of models described throughout this book.

Feedforward networks are the application of this principle to learning deterministic

169

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

mappings from

that lack feedback connections. Other models presented

later will apply these principles to learning stochastic mappings, learning functions

with feedback, and learning probability distributions over a single vector.

We begin this chapter with a simple example of a feedforward network. Next,

we address each of the design decisions needed to deploy a feedforward network.

First, training a feedforward network requires making many of the same design

decisions as are necessary for a linear model: choosing the optimizer, the cost

function, and the form of the output units. We review these basics of gradient-based

learning, then proceed to confront some of the design decisions that are unique

to feedforward networks. Feedforward networks have introduced the concept of a

hidden layer, and this requires us to choose the activation functions that will be

used to compute the hidden layer values. We must also design the architecture of

the network, including how many layers the network should contain, how these

networks should be connected to each other, and how many units should be in

each layer. Learning in deep neural networks requires computing the gradients of

complicated functions. We present the back-propagation algorithm and its modern

generalizations, which can be used to eﬃciently compute these gradients. Finally,

we close with some historical perspective.

6.1 Example: Learning XOR

To make the idea of a feedforward network more concrete, we begin with an

example of a fully functioning feedforward network on a very simple task: learning

the XOR function.

The XOR function (“exclusive or”) is an operation on two binary values,

and

. When exactly one of these binary values is equal to 1, the XOR function

returns 1. Otherwise, it returns 0. The XOR function provides the target function

∗

(

) that we want to learn. Our model provides a function

(

;

) and

our learning algorithm will adapt the parameters

to make

as similar as possible

to f

∗

In this simple example, we will not be concerned with statistical generalization.

We want our network to perform correctly on the four points

{

, [0

, and [1

}

. We will train the network on all four of these points. The

only challenge is to ﬁt the training set.

We can treat this problem as a regression problem and use a mean squared error

loss function. We choose this loss function to simplify the math for this example

as much as possible. We will see later that there are other, more appropriate

170

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

approaches for modeling binary data.

Evaluated on our whole training set, the MSE loss function is

J(θ) =

x∈X

∗

(x) −f(x; θ))

. (6.1)

Now we must choose the form of our model,

(

;

). Suppose that we choose

a linear model, with θ consisting of w and b. Our model is deﬁned to be

f(x; w, b) = x

w + b. (6.2)

We can minimize

(

) in closed form with respect to

and

using the normal

equations.

After solving the normal equations, we obtain

and

. The linear

model simply outputs 0

5 everywhere. Why does this happen? Fig. 6.1 shows how

a linear model is not able to represent the XOR function. One way to solve this

problem is to use a model that learns a diﬀerent feature space in which a linear

model is able to represent the solution.

Speciﬁcally, we will introduce a very simple feedforward network with one

hidden layer containing two hidden units. See Fig. 6.2 for an illustration of

this model. This feedforward network has a vector of hidden units

that are

computed by a function

(1)

(

;

W , c

). The values of these hidden units are then

used as the input for a second layer. The second layer is the output layer of the

network. The output layer is still just a linear regression model, but now it is

applied to

rather than to

. The network now contains two functions chained

together:

(1)

(

;

W , c

) and

(2)

(

;

w, b

), with the complete model being

f(x; W , c, w, b) = f

(2)

(1)

(x)).

What function should

(1)

compute? Linear models have served us well so far,

and it may be tempting to make

(1)

be linear as well. Unfortunately, if

(1)

were

linear, then the feedforward network as a whole would remain a linear function of

its input. Ignoring the intercept terms for the moment, suppose

(1)

(

) =

and

(2)

(

) =

. Then

(

) =

. We could represent this function as

f(x) = x

where w

= W w.

Clearly, we must use a nonlinear function to describe the features. Most neural

networks do so using an aﬃne transformation controlled by learned parameters,

followed by a ﬁxed, nonlinear function called an activation function. We use that

strategy here, by deﬁning

(

)

where

provides the weights of a

linear transformation and

the biases. Previously, to describe a linear regression

model, we used a vector of weights and a scalar bias parameter to describe an

171

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

Original x Space

0 1 2

Learned h Space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbers

printed on the plot indicate the value that the learned function must output at each point.

(Left) A linear model applied directly to the original input cannot implement the XOR

function. When

= 0, the model’s output must increase as

increases. When

= 1,

the model’s output must decrease as

increases. A linear model must apply a ﬁxed

coeﬃcient

. The linear model therefore cannot use the value of

to change

the coeﬃcient on

and cannot solve this problem. (Right) In the transformed space

represented by the features extracted by a neural network, a linear model can now solve

the problem. In our example solution, the two points that must have output 1 have been

collapsed into a single point in feature space. In other words, the nonlinear features have

mapped both

= [1

and

= [0

to a single point in feature space,

= [1

The linear model can now describe the function as increasing in

and decreasing in

In this example, the motivation for learning the feature space is only to make the model

capacity greater so that it can ﬁt the training set. In more realistic applications, learned

representations can also help the model to generalize.

172

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.2: An example of a feedforward network, drawn in two diﬀerent styles. Speciﬁcally,

this is the feedforward network we use to solve the XOR example. It has a single hidden

layer containing two units. (Left) In this style, we draw every unit as a node in the

graph. This style is very explicit and unambiguous but for networks larger than this

example it can consume too much space. (Right) In this style, we draw a node in the

graph for each entire vector representing a layer’s activations. This style is much more

compact. Sometimes we annotate the edges in this graph with the name of the parameters

that describe the relationship between two layers. Here, we indicate that a matrix

describes the mapping from

, and a vector

describes the mapping from

We typically omit the intercept parameters associated with each layer when labeling this

kind of drawing.

aﬃne transformation from an input vector to an output scalar. Now, we describe

an aﬃne transformation from a vector

to a vector

, so an entire vector of bias

parameters is needed. The activation function

is typically chosen to be a function

that is applied element-wise, with

(

:,i

). In modern neural networks,

the default recommendation is to use the rectiﬁed linear unit or ReLU (Jarrett

et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) deﬁned by the activation

function g(z) = max{0, z} depicted in Fig. 6.3.

We can now specify our complete network as

f(x; W , c, w, b) = w

max{0, W

x + c} + b. (6.3)

We can now specify a solution to the XOR problem. Let

W =



1 1



, (6.4)

c =



−1



, (6.5)

w =



−2



, (6.6)

173

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

g(z) = max{0, z}

The Rectiﬁed Linear Activation Function

Figure 6.3: The rectiﬁed linear activation function. This activation function is the default

activation function recommended for use with most feedforward neural networks. Applying

this function to the output of a linear transformation yields a nonlinear transformation.

However, the function remains very close to linear, in the sense that is a piecewise linear

function with two linear pieces. Because rectiﬁed linear units are nearly linear, they

preserve many of the properties that make linear models easy to optimize with gradient-

based methods. They also preserve many of the properties that make linear models

generalize well. A common principle throughout computer science is that we can build

complicated systems from minimal components. Much as a Turing machine’s memory

needs only to be able to store 0 or 1 states, we can build a universal function approximator

from rectiﬁed linear functions.

174

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

and b = 0.

We can now walk through the way that the model processes a batch of inputs.

Let

be the design matrix containing all four points in the binary input space,

with one example per row:

X =







0 0

0 1

1 0

1 1







. (6.7)

The ﬁrst step in the neural network is to multiply the input matrix by the ﬁrst

layer’s weight matrix:

XW =







0 0

1 1

2 2







. (6.8)

Next, we add the bias vector c, to obtain







0 −1

1 0

2 1







. (6.9)

In this space, all of the examples lie along a line with slope 1. As we move along

this line, the output needs to begin at 0, then rise to 1, then drop back down to 0.

A linear model cannot implement such a function. To ﬁnish computing the value

of h for each example, we apply the rectiﬁed linear transformation:







0 0

1 0

2 1







. (6.10)

This transformation has changed the relationship between the examples. They no

longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a

linear model can solve the problem.

We ﬁnish by multiplying by the weight vector w:













. (6.11)

175

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The neural network has obtained the correct answer for every example in the batch.

In this example, we simply speciﬁed the solution, then showed that it obtained

zero error. In a real situation, there might be billions of model parameters and

billions of training examples, so one cannot simply guess the solution as we did

here. Instead, a gradient-based optimization algorithm can ﬁnd parameters that

produce very little error. The solution we described to the XOR problem is at a

global minimum of the loss function, so gradient descent could converge to this

point. There are other equivalent solutions to the XOR problem that gradient

descent could also ﬁnd. The convergence point of gradient descent depends on the

initial values of the parameters. In practice, gradient descent would usually not

ﬁnd clean, easily understood, integer-valued solutions like the one we presented

here.

6.2 Gradient-Based Learning

Designing and training a neural network is not much diﬀerent from training any

other machine learning model with gradient descent. In Sec. 5.10, we described

how to build a machine learning algorithm by specifying an optimization procedure,

a cost function, and a model family.

The largest diﬀerence between the linear models we have seen so far and neural

networks is that the nonlinearity of a neural network causes most interesting loss

functions to become non-convex. This means that neural networks are usually

trained by using iterative, gradient-based optimizers that merely drive the cost

function to a very low value, rather than the linear equation solvers used to train

linear regression models or the convex optimization algorithms with global conver-

gence guarantees used to train logistic regression or SVMs. Convex optimization

converges starting from any initial parameters (in theory—in practice it is very

robust but can encounter numerical problems). Stochastic gradient descent applied

to non-convex loss functions has no such convergence guarantee, and is sensitive

to the values of the initial parameters. For feedforward neural networks, it is

important to initialize all weights to small random values. The biases may be

initialized to zero or to small positive values. The iterative gradient-based opti-

mization algorithms used to train feedforward networks and almost all other deep

models will be described in detail in Chapter 8, with parameter initialization in

particular discussed in Sec. 8.4. For the moment, it suﬃces to understand that

the training algorithm is almost always based on using the gradient to descend the

cost function in one way or another. The speciﬁc algorithms are improvements

and reﬁnements on the ideas of gradient descent, introduced in Sec. 4.3, and,

176

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

more speciﬁcally, are most often improvements of the stochastic gradient descent

algorithm, introduced in Sec. 5.9.

We can of course, train models such as linear regression and support vector

machines with gradient descent too, and in fact this is common when the training

set is extremely large. From this point of view, training a neural network is not

much diﬀerent from training any other model. Computing the gradient is slightly

more complicated for a neural network, but can still be done eﬃciently and exactly.

Sec. 6.5 will describe how to obtain the gradient using the back-propagation

algorithm and modern generalizations of the back-propagation algorithm.

As with other machine learning models, to apply gradient-based learning we

must choose a cost function, and we must choose how to represent the output of

the model. We now revisit these design considerations with special emphasis on

the neural networks scenario.

6.2.1 Cost Functions

An important aspect of the design of a deep neural network is the choice of the

cost function. Fortunately, the cost functions for neural networks are more or less

the same as those for other parametric models, such as linear models.

In most cases, our parametric model deﬁnes a distribution

(

y | x

;

) and

we simply use the principle of maximum likelihood. This means we use the

cross-entropy between the training data and the model’s predictions as the cost

function.

Sometimes, we take a simpler approach, where rather than predicting a complete

probability distribution over

, we merely predict some statistic of

conditioned

on x. Specialized loss functions allow us to train a predictor of these estimates.

The total cost function used to train a neural network will often combine one

of the primary cost functions described here with a regularization term. We have

already seen some simple examples of regularization applied to linear models in Sec.

5.2.2. The weight decay approach used for linear models is also directly applicable

to deep neural networks and is among the most popular regularization strategies.

More advanced regularization strategies for neural networks will be described in

Chapter 7.

6.2.1.1 Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means

that the cost function is simply the negative log-likelihood, equivalently described

177

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

as the cross-entropy between the training data and the model distribution. This

cost function is given by

J(θ) = −E

x,y∼ˆp

data

log p

model

(y | x). (6.12)

The speciﬁc form of the cost function changes from model to model, depending

on the speciﬁc form of

log p

model

. The expansion of the above equation typically

yields some terms that do not depend on the model parameters and may be

discarded. For example, as we saw in Sec. 5.5.1, if

model

(

y | x

) =

(

;

(

;

)

, I

then we recover the mean squared error cost,

J(θ) =

x,y∼ˆp

data

||y − f(x; θ)||

+ const, (6.13)

up to a scaling factor of

and a term that does not depend on θ. The discarded

constant is based on the variance of the Gaussian distribution, which in this case

we chose not to parametrize. Previously, we saw that the equivalence between

maximum likelihood estimation with an output distribution and minimization of

mean squared error holds for a linear model, but in fact, the equivalence holds

regardless of the f(x; θ) used to predict the mean of the Gaussian.

An advantage of this approach of deriving the cost function from maximum

likelihood is that it removes the burden of designing cost functions for each model.

Specifying a model

(

y | x

) automatically determines a cost function

log p

(

y | x

One recurring theme throughout neural network design is that the gradient of

the cost function must be large and predictable enough to serve as a good guide

for the learning algorithm. Functions that saturate (become very ﬂat) undermine

this objective because they make the gradient become very small. In many cases

this happens because the activation functions used to produce the output of the

hidden units or the output units saturate. The negative log-likelihood helps to

avoid this problem for many models. Many output units involve an

exp

function

that can saturate when its argument is very negative. The

log

function in the

negative log-likelihood cost function undoes the

exp

of some output units. We will

discuss the interaction between the cost function and the choice of output unit in

Sec. 6.2.2.

One unusual property of the cross-entropy cost used to perform maximum

likelihood estimation is that it usually does not have a minimum value when applied

to the models commonly used in practice. For discrete output variables, most

models are parametrized in such a way that they cannot represent a probability

of zero or one, but can come arbitrarily close to doing so. Logistic regression

is an example of such a model. For real-valued output variables, if the model

178

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

can control the density of the output distribution (for example, by learning the

variance parameter of a Gaussian output distribution) then it becomes possible

to assign extremely high density to the correct training set outputs, resulting in

cross-entropy approaching negative inﬁnity. Regularization techniques described

in Chapter 7 provide several diﬀerent ways of modifying the learning problem so

that the model cannot reap unlimited reward in this way.

6.2.1.2 Learning Conditional Statistics

Instead of learning a full probability distribution

(

y | x

;

) we often want to learn

just one conditional statistic of y given x.

For example, we may have a predictor

(

;

) that we wish to predict the mean

of y.

If we use a suﬃciently powerful neural network, we can think of the neural

network as being able to represent any function

from a wide class of functions,

with this class being limited only by features such as continuity and boundedness

rather than by having a speciﬁc parametric form. From this point of view, we

can view the cost function as being a functional rather than just a function. A

functional is a mapping from functions to real numbers. We can thus think of

learning as choosing a function rather than merely choosing a set of parameters.

We can design our cost functional to have its minimum occur at some speciﬁc

function we desire. For example, we can design the cost functional to have its

minimum lie on the function that maps

to the expected value of

given

Solving an optimization problem with respect to a function requires a mathematical

tool called calculus of variations, described in Sec. 19.4.2. It is not necessary to

understand calculus of variations to understand the content of this chapter. At

the moment, it is only necessary to understand that calculus of variations may be

used to derive the following two results.

Our ﬁrst result derived using calculus of variations is that solving the optimiza-

tion problem

∗

= arg min

x,y∼p

data

||y − f(x)||

(6.14)

yields

∗

(x) = E

y∼p

data

(y|x)

[y], (6.15)

so long as this function lies within the class we optimize over. In other words, if we

could train on inﬁnitely many samples from the true data-generating distribution,

minimizing the mean squared error cost function gives a function that predicts the

mean of y for each value of x.

179

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Diﬀerent cost functions give diﬀerent statistics. A second result derived using

calculus of variations is that

∗

= arg min

x,y∼p

data

||y − f(x)||

(6.16)

yields a function that predicts the

median

value of

for each

, so long as such a

function may be described by the family of functions we optimize over. This cost

function is commonly called mean absolute error.

Unfortunately, mean squared error and mean absolute error often lead to poor

results when used with gradient-based optimization. Some output units that

saturate produce very small gradients when combined with these cost functions.

This is one reason that the cross-entropy cost function is more popular than mean

squared error or mean absolute error, even when it is not necessary to estimate an

entire distribution p(y | x).

6.2.2 Output Units

The choice of cost function is tightly coupled with the choice of output unit. Most

of the time, we simply use the cross-entropy between the data distribution and the

model distribution. The choice of how to represent the output then determines

the form of the cross-entropy function.

Any kind of neural network unit that may be used as an output can also be

used as a hidden unit. Here, we focus on the use of these units as outputs of the

model, but in principle they can be used internally as well. We revisit these units

with additional detail about their use as hidden units in Sec. 6.3.

Throughout this section, we suppose that the feedforward network provides a

set of hidden features deﬁned by

(

;

). The role of the output layer is then

to provide some additional transformation from the features to complete the task

that the network must perform.

6.2.2.1 Linear Units for Gaussian Output Distributions

One simple kind of output unit is an output unit based on an aﬃne transformation

with no nonlinearity. These are often just called linear units.

Given features

, a layer of linear output units produces a vector

Linear output layers are often used to produce the mean of a conditional

Gaussian distribution:

p(y | x) = N(y;

y, I). (6.17)

180

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Maximizing the log-likelihood is then equivalent to minimizing the mean squared

error.

The maximum likelihood framework makes it straightforward to learn the

covariance of the Gaussian too, or to make the covariance of the Gaussian be a

function of the input. However, the covariance must be constrained to be a positive

deﬁnite matrix for all inputs. It is diﬃcult to satisfy such constraints with a linear

output layer, so typically other output units are used to parametrize the covariance.

Approaches to modeling the covariance are described shortly, in Sec. 6.2.2.4.

Because linear units do not saturate, they pose little diﬃculty for gradient-

based optimization algorithms and may be used with a wide variety of optimization

algorithms.

6.2.2.2 Sigmoid Units for Bernoulli Output Distributions

Many tasks require predicting the value of a binary variable

. Classiﬁcation

problems with two classes can be cast in this form.

The maximum-likelihood approach is to deﬁne a Bernoulli distribution over

conditioned on x.

A Bernoulli distribution is deﬁned by just a single number. The neural net

needs to predict only

(

= 1

| x

). For this number to be a valid probability, it

must lie in the interval [0, 1].

Satisfying this constraint requires some careful design eﬀort. Suppose we were

to use a linear unit, and threshold its value to obtain a valid probability:

P (y = 1 | x) = max

0, min

1, w

h + b

. (6.18)

This would indeed deﬁne a valid conditional distribution, but we would not be able

to train it very eﬀectively with gradient descent. Any time that

strayed

outside the unit interval, the gradient of the output of the model with respect to

its parameters would be

. A gradient of

is typically problematic because the

learning algorithm no longer has a guide for how to improve the corresponding

parameters.

Instead, it is better to use a diﬀerent approach that ensures there is always a

strong gradient whenever the model has the wrong answer. This approach is based

on using sigmoid output units combined with maximum likelihood.

A sigmoid output unit is deﬁned by

ˆy = σ



h + b



(6.19)

181

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

where σ is the logistic sigmoid function described in Sec. 3.10.

We can think of the sigmoid output unit as having two components. First, it

uses a linear layer to compute

. Next, it uses the sigmoid activation

function to convert z into a probability.

We omit the dependence on

for the moment to discuss how to deﬁne a

probability distribution over

using the value

. The sigmoid can be motivated

by constructing an unnormalized probability distribution

(

), which does not

sum to 1. We can then divide by an appropriate constant to obtain a valid

probability distribution. If we begin with the assumption that the unnormalized log

probabilities are linear in

and

, we can exponentiate to obtain the unnormalized

probabilities. We then normalize to see that this yields a Bernoulli distribution

controlled by a sigmoidal transformation of z:

log

P (y) = yz (6.20)

P (y) = exp(yz) (6.21)

P (y) =

exp(yz)

exp(y

(6.22)

P (y) = σ ((2y − 1)z) . (6.23)

Probability distributions based on exponentiation and normalization are common

throughout the statistical modeling literature. The

variable deﬁning such a

distribution over binary variables is called a logit.

This approach to predicting the probabilities in log-space is natural to use

with maximum likelihood learning. Because the cost function used with maximum

likelihood is

−log P

(

y | x

), the

log

in the cost function undoes the

exp

of the

sigmoid. Without this eﬀect, the saturation of the sigmoid could prevent gradient-

based learning from making good progress. The loss function for maximum

likelihood learning of a Bernoulli parametrized by a sigmoid is

J(θ) = −log P (y | x) (6.24)

= −log σ ((2y − 1)z) (6.25)

= ζ ((1 − 2y)z) . (6.26)

This derivation makes use of some properties from Sec. 3.10. By rewriting

the loss in terms of the softplus function, we can see that it saturates only when

−

)

is very negative. Saturation thus occurs only when the model already

has the right answer—when

= 1 and

is very positive, or

= 0 and

is very

negative. When

has the wrong sign, the argument to the softplus function,

182

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

−

)

, may be simpliﬁed to

|z|

. As

|z|

becomes large while

has the wrong sign,

the softplus function asymptotes toward simply returning its argument

|z|

. The

derivative with respect to

asymptotes to

sign

(

), so, in the limit of extremely

incorrect

, the softplus function does not shrink the gradient at all. This property

is very useful because it means that gradient-based learning can act to quickly

correct a mistaken z.

When we use other loss functions, such as mean squared error, the loss can

saturate anytime

(

) saturates. The sigmoid activation function saturates to 0

when

becomes very negative and saturates to 1 when

becomes very positive.

The gradient can shrink too small to be useful for learning whenever this happens,

whether the model has the correct answer or the incorrect answer. For this reason,

maximum likelihood is almost always the preferred approach to training sigmoid

output units.

Analytically, the logarithm of the sigmoid is always deﬁned and ﬁnite, because

the sigmoid returns values restricted to the open interval (0

1), rather than using

the entire closed interval of valid probabilities [0

1]. In software implementations,

to avoid numerical problems, it is best to write the negative log-likelihood as a

function of

, rather than as a function of

ˆy

(

). If the sigmoid function

underﬂows to zero, then taking the logarithm of ˆy yields negative inﬁnity.

6.2.2.3 Softmax Units for Multinoulli Output Distributions

Any time we wish to represent a probability distribution over a discrete variable

with

possible values, we may use the softmax function. This can be seen as a

generalization of the sigmoid function which was used to represent a probability

distribution over a binary variable.

Softmax functions are most often used as the output of a classiﬁer, to represent

the probability distribution over

diﬀerent classes. More rarely, softmax functions

can be used inside the model itself, if we wish the model to choose between one of

n diﬀerent options for some internal variable.

In the case of binary variables, we wished to produce a single number

ˆy = P (y = 1 | x). (6.27)

Because this number needed to lie between 0 and 1, and because we wanted the

logarithm of the number to be well-behaved for gradient-based optimization of

the log-likelihood, we chose to instead predict a number

log

(

= 1

| x

Exponentiating and normalizing gave us a Bernoulli distribution controlled by the

sigmoid function.

183

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

To generalize to the case of a discrete variable with

values, we now need

to produce a vector

, with

ˆy

(

i | x

). We require not only that each

element of

ˆy

be between 0 and 1, but also that the entire vector sums to 1 so that

it represents a valid probability distribution. The same approach that worked for

the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear

layer predicts unnormalized log probabilities:

z = W

h + b, (6.28)

where

log

(

i | x

)

The softmax function can then exponentiate and

normalize z to obtain the desired

y. Formally, the softmax function is given by

softmax(z)

exp(z

)

exp(z

)

. (6.29)

As with the logistic sigmoid, the use of the

exp

function works very well when

training the softmax to output a target value

using maximum log-likelihood. In

this case, we wish to maximize

log P

(

;

) =

log softmax

(

)

. Deﬁning the

softmax in terms of

exp

is natural because the

log

in the log-likelihood can undo

the exp of the softmax:

log softmax(z)

= z

− log

exp(z

). (6.30)

The ﬁrst term of Eq. 6.30 shows that the input

always has a direct con-

tribution to the cost function. Because this term cannot saturate, we know that

learning can proceed, even if the contribution of

to the second term of Eq. 6.30

becomes very small. When maximizing the log-likelihood, the ﬁrst term encourages

to be pushed up, while the second term encourages all of

to be pushed down.

To gain some intuition for the second term,

log

exp

(

), observe that this term

can be roughly approximated by

max

. This approximation is based on the idea

that

exp

(

) is insigniﬁcant for any

that is noticeably less than

max

. The

intuition we can gain from this approximation is that the negative log-likelihood

cost function always strongly penalizes the most active incorrect prediction. If the

correct answer already has the largest input to the softmax, then the

−z

term

and the

log

exp

(

)

≈ max

terms will roughly cancel. This example

will then contribute little to the overall training cost, which will be dominated by

other examples that are not yet correctly classiﬁed.

So far we have discussed only a single example. Overall, unregularized maximum

likelihood will drive the model to learn parameters that drive the softmax to predict

184

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the fraction of counts of each outcome observed in the training set:

softmax(z(x; θ))

≈

j=1

(j)

=i,x

(j)

j=1

(j)

. (6.31)

Because maximum likelihood is a consistent estimator, this is guaranteed to happen

so long as the model family is capable of representing the training distribution. In

practice, limited model capacity and imperfect optimization will mean that the

model is only able to approximate these fractions.

Many objective functions other than the log-likelihood do not work as well

with the softmax function. Speciﬁcally, objective functions that do not use a

log

undo the

exp

of the softmax fail to learn when the argument to the

exp

becomes

very negative, causing the gradient to vanish. In particular, squared error is a

poor loss function for softmax units, and can fail to train the model to change its

output, even when the model makes highly conﬁdent incorrect predictions (Bridle,

1990). To understand why these other loss functions can fail, we need to examine

the softmax function itself.

Like the sigmoid, the softmax activation can saturate. The sigmoid function has

a single output that saturates when its input is extremely negative or extremely

positive. In the case of the softmax, there are multiple output values. These

output values can saturate when the diﬀerences between input values become

extreme. When the softmax saturates, many cost functions based on the softmax

also saturate, unless they are able to invert the saturating activating function.

To see that the softmax function responds to the diﬀerence between its inputs,

observe that the softmax output is invariant to adding the same scalar to all of its

inputs:

softmax(z) = softmax(z + c). (6.32)

Using this property, we can derive a numerically stable variant of the softmax:

softmax(z) = softmax(z − max

). (6.33)

The reformulated version allows us to evaluate softmax with only small numerical

errors even when

contains extremely large or extremely negative numbers. Ex-

amining the numerically stable variant, we see that the softmax function is driven

by the amount that its arguments deviate from max

An output

softmax

(

)

saturates to 1 when the corresponding input is maximal

(

max

) and

is much greater than all of the other inputs. The output

softmax

(

)

can also saturate to 0 when

is not maximal and the maximum is

much greater. This is a generalization of the way that sigmoid units saturate, and

185

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

can cause similar diﬃculties for learning if the loss function is not designed to

compensate for it.

The argument

to the softmax function can be produced in two diﬀerent ways.

The most common is simply to have an earlier layer of the neural network output

every element of

, as described above using the linear layer

. While

straightforward, this approach actually overparametrizes the distribution. The

constraint that the

outputs must sum to 1 means that only

n −

1 parameters are

necessary; the probability of the

-th value may be obtained by subtracting the

ﬁrst

n−

1 probabilities from 1. We can thus impose a requirement that one element

be ﬁxed. For example, we can require that

= 0. Indeed, this is exactly

what the sigmoid unit does. Deﬁning

(

= 1

| x

) =

(

) is equivalent to deﬁning

(

= 1

| x

) =

softmax

(

)

with a two-dimensional

and

= 0. Both the

n −

argument and the

argument approaches to the softmax can describe the same

set of probability distributions, but have diﬀerent learning dynamics. In practice,

there is rarely much diﬀerence between using the overparametrized version or the

restricted version, and it is simpler to implement the overparametrized version.

From a neuroscientiﬁc point of view, it is interesting to think of the softmax as

a way to create a form of competition between the units that participate in it: the

softmax outputs always sum to 1 so an increase in the value of one unit necessarily

corresponds to a decrease in the value of others. This is analogous to the lateral

inhibition that is believed to exist between nearby neurons in the cortex. At the

extreme (when the diﬀerence between the maximal

and the others is large in

magnitude) it becomes a form of winner-take-all (one of the outputs is nearly 1

and the others are nearly 0).

The name “softmax” can be somewhat confusing. The function is more closely

related to the argmax function than the max function. The term “soft” derives

from the fact that the softmax function is continuous and diﬀerentiable. The

argmax function, with its result represented as a one-hot vector, is not continuous

or diﬀerentiable. The softmax function thus provides a “softened” version of the

argmax. The corresponding soft version of the maximum function is

softmax

(

)

It would perhaps be better to call the softmax function “softargmax,” but the

current name is an entrenched convention.

6.2.2.4 Other Output Types

The linear, sigmoid, and softmax output units described above are the most

common. Neural networks can generalize to almost any kind of output layer that

we wish. The principle of maximum likelihood provides a guide for how to design

186

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

a good cost function for nearly any kind of output layer.

In general, if we deﬁne a conditional distribution

(

y | x

;

), the principle of

maximum likelihood suggests we use −log p(y | x; θ) as our cost function.

In general, we can think of the neural network as representing a function

(

;

The outputs of this function are not direct predictions of the value

. Instead,

(

;

) =

provides the parameters for a distribution over

. Our loss function

can then be interpreted as −log p(y; ω(x)).

For example, we may wish to learn the variance of a conditional Gaussian for

, given

. In the simple case, where the variance

is a constant, there is a

closed form expression because the maximum likelihood estimator of variance is

simply the empirical mean of the squared diﬀerence between observations

and

their expected value. A computationally more expensive approach that does not

require writing special-case code is to simply include the variance as one of the

properties of the distribution

(

y | x

) that is controlled by

(

;

). The

negative log-likelihood

−log p

(

;

(

)) will then provide a cost function with the

appropriate terms necessary to make our optimization procedure incrementally

learn the variance. In the simple case where the standard deviation does not depend

on the input, we can make a new parameter in the network that is copied directly

into

. This new parameter might be

itself or could be a parameter

representing

or it could be a parameter

representing

, depending on how we choose to

parametrize the distribution. We may wish our model to predict a diﬀerent amount

of variance in

for diﬀerent values of

. This is called a heteroscedastic model.

In the heteroscedastic case, we simply make the speciﬁcation of the variance be

one of the values output by

(

;

). A typical way to do this is to formulate the

Gaussian distribution using precision, rather than variance, as described in Eq.

3.22. In the multivariate case it is most common to use a diagonal precision matrix

diag(β). (6.34)

This formulation works well with gradient descent because the formula for the

log-likelihood of the Gaussian distribution parametrized by

involves only mul-

tiplication by

and addition of

log β

. The gradient of multiplication, addition,

and logarithm operations is well-behaved. By comparison, if we parametrized the

output in terms of variance, we would need to use division. The division function

becomes arbitrarily steep near zero. While large gradients can help learning,

arbitrarily large gradients usually result in instability. If we parametrized the

output in terms of standard deviation, the log-likelihood would still involve division,

and would also involve squaring. The gradient through the squaring operation

can vanish near zero, making it diﬃcult to learn parameters that are squared.

187

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Regardless of whether we use standard deviation, variance, or precision, we must

ensure that the covariance matrix of the Gaussian is positive deﬁnite. Because

the eigenvalues of the precision matrix are the reciprocals of the eigenvalues of

the covariance matrix, this is equivalent to ensuring that the precision matrix is

positive deﬁnite. If we use a diagonal matrix, or a scalar times the diagonal matrix,

then the only condition we need to enforce on the output of the model is positivity.

If we suppose that

is the raw activation of the model used to determine the

diagonal precision, we can use the softplus function to obtain a positive precision

vector:

(

)

This same strategy applies equally if using variance or standard

deviation rather than precision or if using a scalar times identity rather than

diagonal matrix.

It is rare to learn a covariance or precision matrix with richer structure than

diagonal. If the covariance is full and conditional, then a parametrization must

be chosen that guarantees positive-deﬁniteness of the predicted covariance matrix.

This can be achieved by writing

Σ(x) = B(x)B

(x)

, where

is an unconstrained

square matrix. One practical issue if the matrix is full rank is that computing the

likelihood is expensive, with a

d × d

matrix requiring

(

) computation for the

determinant and inverse of

(

) (or equivalently, and more commonly done, its

eigendecomposition or that of B(x)).

We often want to perform multimodal regression, that is, to predict real values

that come from a conditional distribution

(

y | x

) that can have several diﬀerent

peaks in

space for the same value of

. In this case, a Gaussian mixture is

a natural representation for the output (Jacobs et al., 1991; Bishop, 1994).

Neural networks with Gaussian mixtures as their output are often called mixture

density networks. A Gaussian mixture output with

components is deﬁned by

the conditional probability distribution

p(y | x) =

i=1

p(c = i | x)N(y; µ

(i)

(x), Σ

(i)

(x)). (6.35)

The neural network must have three outputs: a vector deﬁning

(

i | x

), a

matrix providing

(i)

(

) for all

, and a tensor providing

(i)

(

) for all

. These

outputs must satisfy diﬀerent constraints:

Mixture components

(

i | x

): these form a multinoulli distribution

over the

diﬀerent components associated with latent variable

, and can

We consider

to be latent because we do not observe it in the data: given input

and target

, it is not possible to know with certainty which Gaussian component was responsible for

, but

we can imagine that

was generated by picking one of them, and make that unobserved choice a

random variable.

188

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

typically be obtained by a softmax over an

-dimensional vector, to guarantee

that these outputs are positive and sum to 1.

Means

(i)

(

): these indicate the center or mean associated with the

-th

Gaussian component, and are unconstrained (typically with no nonlinearity

at all for these output units). If

is a

-vector, then the network must output

n × d

matrix containing all

of these

-dimensional vectors. Learning

these means with maximum likelihood is slightly more complicated than

learning the means of a distribution with only one output mode. We only

want to update the mean for the component that actually produced the

observation. In practice, we do not know which component produced each

observation. The expression for the negative log-likelihood naturally weights

each example’s contribution to the loss for each component by the probability

that the component produced the example.

Covariances

(i)

(

): these specify the covariance matrix for each component

. As when learning a single Gaussian component, we typically use a diagonal

matrix to avoid needing to compute determinants. As with learning the means

of the mixture, maximum likelihood is complicated by needing to assign

partial responsibility for each point to each mixture component. Gradient

descent will automatically follow the correct process if given the correct

speciﬁcation of the negative log-likelihood under the mixture model.

It has been reported that gradient-based optimization of conditional Gaussian

mixtures (on the output of neural networks) can be unreliable, in part because one

gets divisions (by the variance) which can be numerically unstable (when some

variance gets to be small for a particular example, yielding very large gradients).

One solution is to clip gradients (see Sec. 10.11.1) while another is to scale the

gradients heuristically (Murray and Larochelle, 2014).

Gaussian mixture outputs are particularly eﬀective in generative models of

speech (Schuster, 1999) or movements of physical objects (Graves, 2013). The

mixture density strategy gives a way for the network to represent multiple output

modes and to control the variance of its output, which is crucial for obtaining

a high degree of quality in these real-valued domains. An example of a mixture

density network is shown in Fig. 6.4.

In general, we may wish to continue to model larger vectors

containing more

variables, and to impose richer and richer structures on these output variables. For

example, we may wish for our neural network to output a sequence of characters

that forms a sentence. In these cases, we may continue to use the principle

of maximum likelihood applied to our model

(

;

(

)), but the model we use

189

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.4: Samples drawn from a neural network with a mixture density output layer.

The input

is sampled from a uniform distribution and the output

is sampled from

model

(

y | x

). The neural network is able to learn nonlinear mappings from the input to

the parameters of the output distribution. These parameters include the probabilities

governing which of three mixture components will generate the output as well as the

parameters for each mixture component. Each mixture component is Gaussian with

predicted mean and variance. All of these aspects of the output distribution are able to

vary with respect to the input x, and to do so in nonlinear ways.

to describe

becomes complex enough to be beyond the scope of this chapter.

Chapter 10 describes how to use recurrent neural networks to deﬁne such models

over sequences, and Part III describes advanced techniques for modeling arbitrary

probability distributions.

6.3 Hidden Units

So far we have focused our discussion on design choices for neural networks that

are common to most parametric machine learning models trained with gradient-

based optimization. Now we turn to an issue that is unique to feedforward neural

networks: how to choose the type of hidden unit to use in the hidden layers of the

model.

The design of hidden units is an extremely active area of research and does not

yet have many deﬁnitive guiding theoretical principles.

Rectiﬁed linear units are an excellent default choice of hidden unit. Many other

types of hidden units are available. It can be diﬃcult to determine when to use

which kind (though rectiﬁed linear units are usually an acceptable choice). We

190

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

describe here some of the basic intuitions motivating each type of hidden units.

These intuitions can be used to suggest when to try out each of these units. It is

usually impossible to predict in advance which will work best. The design process

consists of trial and error, intuiting that a kind of hidden unit may work well,

and then training a network with that kind of hidden unit and evaluating its

performance on a validation set.

Some of the hidden units included in this list are not actually diﬀerentiable at

all input points. For example, the rectiﬁed linear function

(

) =

max{

, z}

is not

diﬀerentiable at

= 0. This may seem like it invalidates

for use with a gradient-

based learning algorithm. In practice, gradient descent still performs well enough

for these models to be used for machine learning tasks. This is in part because

neural network training algorithms do not usually arrive at a local minimum of

the cost function, but instead merely reduce its value signiﬁcantly, as shown in

Fig. 4.3. These ideas will be described further in Chapter 8. Because we do not

expect training to actually reach a point where the gradient is

, it is acceptable

for the minima of the cost function to correspond to points with undeﬁned gradient.

Hidden units that are not diﬀerentiable are usually non-diﬀerentiable at only a

small number of points. In general, a function

(

) has a left derivative deﬁned

by the slope of the function immediately to the left of

and a right derivative

deﬁned by the slope of the function immediately to the right of

. A function

is diﬀerentiable at

only if both the left derivative and the right derivative are

deﬁned and equal to each other. The functions used in the context of neural

networks usually have deﬁned left derivatives and deﬁned right derivatives. In the

case of

(

) =

max{

, z}

, the left derivative at

= 0 is 0 and the right derivative

is 1. Software implementations of neural network training usually return one of

the one-sided derivatives rather than reporting that the derivative is undeﬁned or

raising an error. This may be heuristically justiﬁed by observing that gradient-

based optimization on a digital computer is subject to numerical error anyway.

When a function is asked to evaluate

(0), it is very unlikely that the underlying

value truly was 0. Instead, it was likely to be some small value



that was rounded

to 0. In some contexts, more theoretically pleasing justiﬁcations are available, but

these usually do not apply to neural network training. The important point is that

in practice one can safely disregard the non-diﬀerentiability of the hidden unit

activation functions described below.

Unless indicated otherwise, most hidden units can be described as accepting

a vector of inputs

, computing an aﬃne transformation

, and

then applying an element-wise nonlinear function

(

). Most hidden units are

distinguished from each other only by the choice of the form of the activation

function g(z).

191

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.3.1 Rectiﬁed Linear Units and Their Generalizations

Rectiﬁed linear units use the activation function g(z) = max{0, z}.

Rectiﬁed linear units are easy to optimize because they are so similar to linear

units. The only diﬀerence between a linear unit and a rectiﬁed linear unit is

that a rectiﬁed linear unit outputs zero across half its domain. This makes the

derivatives through a rectiﬁed linear unit remain large whenever the unit is active.

The gradients are not only large but also consistent. The second derivative of the

rectifying operation is 0 almost everywhere, and the derivative of the rectifying

operation is 1 everywhere that the unit is active. This means that the gradient

direction is far more useful for learning than it would be with activation functions

that introduce second-order eﬀects.

Rectiﬁed linear units are typically used on top of an aﬃne transformation:

h = g(W

x + b). (6.36)

When initializing the parameters of the aﬃne transformation, it can be a good

practice to set all elements of

to a small, positive value, such as 0

1. This makes

it very likely that the rectiﬁed linear units will be initially active for most inputs

in the training set and allow the derivatives to pass through.

Several generalizations of rectiﬁed linear units exist. Most of these general-

izations perform comparably to rectiﬁed linear units and occasionally perform

better.

One drawback to rectiﬁed linear units is that they cannot learn via gradient-

based methods on examples for which their activation is zero. A variety of

generalizations of rectiﬁed linear units guarantee that they receive gradient every-

where.

Three generalizations of rectiﬁed linear units are based on using a non-zero

slope

when

(

z, α

)

max

, z

) +

min

, z

). Absolute value

rectiﬁcation ﬁxes

−

1 to obtain

(

) =

|z|

. It is used for object recognition

from images (Jarrett et al., 2009), where it makes sense to seek features that are

invariant under a polarity reversal of the input illumination. Other generalizations

of rectiﬁed linear units are more broadly applicable. A leaky ReLU (Maas et al.,

2013) ﬁxes

to a small value like 0.01 while a parametric ReLU or PReLU treats

as a learnable parameter (He et al., 2015).

Maxout units (Goodfellow et al., 2013a) generalize rectiﬁed linear units further.

Instead of applying an element-wise function

(

), maxout units divide

into

groups of

values. Each maxout unit then outputs the maximum element of one

192

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

of these groups:

g(z)

= max

j∈G

(i)

(6.37)

where

(i)

is the indices of the inputs for group

{

(

i −

+ 1

, . . . , ik}

. This

provides a way of learning a piecewise linear function that responds to multiple

directions in the input x space.

A maxout unit can learn a piecewise linear, convex function with up to

pieces.

Maxout units can thus be seen as

learning the activation function

itself rather

than just the relationship between units. With large enough

, a maxout unit can

learn to approximate any convex function with arbitrary ﬁdelity. In particular,

a maxout layer with two pieces can learn to implement the same function of the

input

as a traditional layer using the rectiﬁed linear activation function, absolute

value rectiﬁcation function, or the leaky or parametric ReLU, or can learn to

implement a totally diﬀerent function altogether. The maxout layer will of course

be parametrized diﬀerently from any of these other layer types, so the learning

dynamics will be diﬀerent even in the cases where maxout learns to implement the

same function of x as one of the other layer types.

Each maxout unit is now parametrized by

weight vectors instead of just one,

so maxout units typically need more regularization than rectiﬁed linear units. They

can work well without regularization if the training set is large and the number of

pieces per unit is kept low (Cai et al., 2013).

Maxout units have a few other beneﬁts. In some cases, one can gain some sta-

tistical and computational advantages by requiring fewer parameters. Speciﬁcally,

if the features captured by

diﬀerent linear ﬁlters can be summarized without

losing information by taking the max over each group of

features, then the next

layer can get by with k times fewer weights.

Because each unit is driven by multiple ﬁlters, maxout units have some redun-

dancy that helps them to resist a phenomenon called catastrophic forgetting in

which neural networks forget how to perform tasks that they were trained on in

the past (Goodfellow et al., 2014a).

Rectiﬁed linear units and all of these generalizations of them are based on the

principle that models are easier to optimize if their behavior is closer to linear.

This same general principle of using linear behavior to obtain easier optimization

also applies in other contexts besides deep linear networks. Recurrent networks can

learn from sequences and produce a sequence of states and outputs. When training

them, one needs to propagate information through several time steps, which is much

easier when some linear computations (with some directional derivatives being of

magnitude near 1) are involved. One of the best-performing recurrent network

193

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

architectures, the LSTM, propagates information through time via summation—a

particular straightforward kind of such linear activation. This is discussed further

in Sec. 10.10.

6.3.2 Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectiﬁed linear units, most neural networks used the

logistic sigmoid activation function

g(z) = σ(z) (6.38)

or the hyperbolic tangent activation function

g(z) = tanh(z). (6.39)

These activation functions are closely related because tanh(z) = 2σ(2z) −1.

We have already seen sigmoid units as output units, used to predict the

probability that a binary variable is 1. Unlike piecewise linear units, sigmoidal

units saturate across most of their domain—they saturate to a high value when

is very positive, saturate to a low value when

is very negative, and are only

strongly sensitive to their input when

is near 0. The widespread saturation of

sigmoidal units can make gradient-based learning very diﬃcult. For this reason,

their use as hidden units in feedforward networks is now discouraged. Their use

as output units is compatible with the use of gradient-based learning when an

appropriate cost function can undo the saturation of the sigmoid in the output

layer.

When a sigmoidal activation function must be used, the hyperbolic tangent

activation function typically performs better than the logistic sigmoid. It resembles

the identity function more closely, in the sense that

tanh

(0) = 0 while

(0) =

Because

tanh

is similar to identity near 0, training a deep neural network

ˆy

tanh

(

tanh

(

)) resembles training a linear model

ˆy

long as the activations of the network can be kept small. This makes training the

tanh network easier.

Sigmoidal activation functions are more common in settings other than feed-

forward networks. Recurrent networks, many probabilistic models, and some

autoencoders have additional requirements that rule out the use of piecewise

linear activation functions and make sigmoidal units more appealing despite the

drawbacks of saturation.

194

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.3.3 Other Hidden Units

Many other types of hidden units are possible, but are used less frequently.

In general, a wide variety of diﬀerentiable functions perform perfectly well.

Many unpublished activation functions perform just as well as the popular ones.

To provide a concrete example, the authors tested a feedforward network using

cos

(

W x

) on the MNIST dataset and obtained an error rate of less than

1%, which is competitive with results obtained using more conventional activation

functions. During research and development of new techniques, it is common

to test many diﬀerent activation functions and ﬁnd that several variations on

standard practice perform comparably. This means that usually new hidden unit

types are published only if they are clearly demonstrated to provide a signiﬁcant

improvement. New hidden unit types that perform roughly comparably to known

types are so common as to be uninteresting.

It would be impractical to list all of the hidden unit types that have appeared

in the literature. We highlight a few especially useful and distinctive ones.

One possibility is to not have an activation

(

) at all. One can also think of

this as using the identity function as the activation function. We have already

seen that a linear unit can be useful as the output of a neural network. It may

also be used as a hidden unit. If every layer of the neural network consists of only

linear transformations, then the network as a whole will be linear. However, it

is acceptable for some layers of the neural network to be purely linear. Consider

a neural network layer with

inputs and

outputs,

(

). We may

replace this with two layers, with one layer using weight matrix

and the other

using weight matrix

. If the ﬁrst layer has no activation function, then we have

essentially factored the weight matrix of the original layer based on

. The

factored approach is to compute

(

). If

produces

outputs,

then

and

together contain only (

)

parameters, while

contains

parameters. For small

, this can be a considerable saving in parameters. It

comes at the cost of constraining the linear transformation to be low-rank, but

these low-rank relationships are often suﬃcient. Linear hidden units thus oﬀer an

eﬀective way of reducing the number of parameters in a network.

Softmax units are another kind of unit that is usually used as an output (as

described in Sec. 6.2.2.3) but may sometimes be used as a hidden unit. Softmax

units naturally represent a probability distribution over a discrete variable with

possible values, so they may be used as a kind of switch. These kinds of hidden

units are usually only used in more advanced architectures that explicitly learn to

manipulate memory, described in Sec. 10.12.

195

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

A few other reasonably common hidden unit types include:

•

Radial basis function or RBF unit:

exp



−

||W

:,i

− x||



. This

function becomes more active as

approaches a template

:,i

. Because it

saturates to 0 for most x, it can be diﬃcult to optimize.

•

Softplus:

(

) =

(

) =

log

(1 +

). This is a smooth version of the rectiﬁer,

introduced by Dugas et al. (2001) for function approximation and by Nair

and Hinton (2010) for the conditional distributions of undirected probabilistic

models. Glorot et al. (2011a) compared the softplus and rectiﬁer and found

better results with the latter. The use of the softplus is generally discouraged.

The softplus demonstrates that the performance of hidden unit types can

be very counterintuitive—one might expect it to have an advantage over

the rectiﬁer due to being diﬀerentiable everywhere or due to saturating less

completely, but empirically it does not.

•

Hard

tanh

: this is shaped similarly to the

tanh

and the rectiﬁer but unlike

the latter, it is bounded,

(

) =

max

(

−

, min

, a

)). It was introduced

by Collobert (2004).

Hidden unit design remains an active area of research and many useful hidden

unit types remain to be discovered.

6.4 Architecture Design

Another key design consideration for neural networks is determining the architecture.

The word architecture refers to the overall structure of the network: how many

units it should have and how these units should be connected to each other.

Most neural networks are organized into groups of units called layers. Most

neural network architectures arrange these layers in a chain structure, with each

layer being a function of the layer that preceded it. In this structure, the ﬁrst layer

is given by

(1)

= g

(1)



(1)>

x + b

(1)



, (6.40)

the second layer is given by

(2)

= g

(2)



(2)>

(1)

+ b

(2)



, (6.41)

and so on.

196

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

In these chain-based architectures, the main architectural considerations are

to choose the depth of the network and the width of each layer. As we will see,

a network with even one hidden layer is suﬃcient to ﬁt the training set. Deeper

networks often are able to use far fewer units per layer and far fewer parameters

and often generalize to the test set, but are also often harder to optimize. The

ideal network architecture for a task must be found via experimentation guided by

monitoring the validation set error.

6.4.1 Universal Approximation Properties and Depth

A linear model, mapping from features to outputs via matrix multiplication, can

by deﬁnition represent only linear functions. It has the advantage of being easy to

train because many loss functions result in convex optimization problems when

applied to linear models. Unfortunately, we often want to learn nonlinear functions.

At ﬁrst glance, we might presume that learning a nonlinear function requires

designing a specialized model family for the kind of nonlinearity we want to learn.

Fortunately, feedforward networks with hidden layers provide a universal approxi-

mation framework. Speciﬁcally, the universal approximation theorem (Hornik et al.,

1989; Cybenko, 1989) states that a feedforward network with a linear output layer

and at least one hidden layer with any “squashing” activation function (such as

the logistic sigmoid activation function) can approximate any Borel measurable

function from one ﬁnite-dimensional space to another with any desired non-zero

amount of error, provided that the network is given enough hidden units. The

derivatives of the feedforward network can also approximate the derivatives of the

function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability

is beyond the scope of this book; for our purposes it suﬃces to say that any

continuous function on a closed and bounded subset of

is Borel measurable

and therefore may be approximated by a neural network. A neural network may

also approximate any function mapping from any ﬁnite dimensional discrete space

to another. While the original theorems were ﬁrst stated in terms of units with

activation functions that saturate both for very negative and for very positive

arguments, universal approximation theorems have also been proven for a wider

class of activation functions, which includes the now commonly used rectiﬁed linear

unit (Leshno et al., 1993).

The universal approximation theorem means that regardless of what function

we are trying to learn, we know that a large MLP will be able to

represent

this

function. However, we are not guaranteed that the training algorithm will be able

learn

that function. Even if the MLP is able to represent the function, learning

can fail for two diﬀerent reasons. First, the optimization algorithm used for training

197

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

may not be able to ﬁnd the value of the parameters that corresponds to the desired

function. Second, the training algorithm might choose the wrong function due to

overﬁtting. Recall from Sec. 5.2.1 that the “no free lunch” theorem shows that

there is no universally superior machine learning algorithm. Feedforward networks

provide a universal system for representing functions, in the sense that, given a

function, there exists a feedforward network that approximates the function. There

is no universal procedure for examining a training set of speciﬁc examples and

choosing a function that will generalize to points not in the training set.

The universal approximation theorem says that there exists a network large

enough to achieve any degree of accuracy we desire, but the theorem does not

say how large this network will be. Barron (1993) provides some bounds on the

size of a single-layer network needed to approximate a broad class of functions.

Unfortunately, in the worse case, an exponential number of hidden units (possibly

with one hidden unit corresponding to each input conﬁguration that needs to be

distinguished) may be required. This is easiest to see in the binary case: the

number of possible binary functions on vectors

v ∈ {

}

is 2

and selecting

one such function requires 2

bits, which will in general require

) degrees of

freedom.

In summary, a feedforward network with a single layer is suﬃcient to represent

any function, but the layer may be infeasibly large and may fail to learn and

generalize correctly. In many circumstances, using deeper models can reduce the

number of units required to represent the desired function and can reduce the

amount of generalization error.

There exist families of functions which can be approximated eﬃciently by an

architecture with depth greater than some value

, but which require a much larger

model if depth is restricted to be less than or equal to

. In many cases, the number

of hidden units required by the shallow model is exponential in

. Such results

were ﬁrst proven for models that do not resemble the continuous, diﬀerentiable

neural networks used for machine learning, but have since been extended to these

models. The ﬁrst results were for circuits of logic gates (Håstad, 1986). Later

work extended these results to linear threshold units with non-negative weights

(Håstad and Goldmann, 1991; Hajnal et al., 1993), and then to networks with

continuous-valued activations (Maass, 1992; Maass et al., 1994). Many modern

neural networks use rectiﬁed linear units. Leshno et al. (1993) demonstrated

that shallow networks with a broad family of non-polynomial activation functions,

including rectiﬁed linear units, have universal approximation properties, but these

results do not address the questions of depth or eﬃciency—they specify only that

a suﬃciently wide rectiﬁer network could represent any function. Pascanu et al.

198

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(2013b) and Montufar et al. (2014) showed that functions representable with a

deep rectiﬁer net can require an exponential number of hidden units with a shallow

(one hidden layer) network. More precisely, they showed that piecewise linear

networks (which can be obtained from rectiﬁer nonlinearities or maxout units) can

represent functions with a number of regions that is exponential in the depth of the

network. Fig. 6.5 illustrates how a network with absolute value rectiﬁcation creates

mirror images of the function computed on top of some hidden unit, with respect

to the input of that hidden unit. Each hidden unit speciﬁes where to fold the

input space in order to create mirror responses (on both sides of the absolute value

nonlinearity). By composing these folding operations, we obtain an exponentially

large number of piecewise linear regions which can capture all kinds of regular

(e.g., repeating) patterns.

Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper

rectiﬁer networks formally shown by Pascanu et al. (2014a) and by Montufar et al. (2014).

(Left) An absolute value rectiﬁcation unit has the same output for every pair of mirror

points in its input. The mirror axis of symmetry is given by the hyperplane deﬁned by the

weights and bias of the unit. A function computed on top of that unit (the green decision

surface) will be a mirror image of a simpler pattern across that axis of symmetry. (Center)

The function can be obtained by folding the space around the axis of symmetry. (Right)

Another repeating pattern can be folded on top of the ﬁrst (by another downstream unit)

to obtain another symmetry (which is now repeated four times, with two hidden layers).

More precisely, the main theorem in Montufar et al. (2014) states that the

number of linear regions carved out by a deep rectiﬁer network with

inputs,

depth l, and n units per hidden layer, is





d(l−1)

, (6.42)

i.e., exponential in the depth l. In the case of maxout networks with k ﬁlters per

unit, the number of linear regions is



(l−1)+d



. (6.43)

199

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Of course, there is no guarantee that the kinds of functions we want to learn in

applications of machine learning (and in particular for AI) share such a property.

We may also want to choose a deep model for statistical reasons. Any time

we choose a speciﬁc machine learning algorithm, we are implicitly stating some

set of prior beliefs we have about what kind of function the algorithm should

learn. Choosing a deep model encodes a very general belief that the function we

want to learn should involve composition of several simpler functions. This can be

interpreted from a representation learning point of view as saying that we believe

the learning problem consists of discovering a set of underlying factors of variation

that can in turn be described in terms of other, simpler underlying factors of

variation. Alternately, we can interpret the use of a deep architecture as expressing

a belief that the function we want to learn is a computer program consisting of

multiple steps, where each step makes use of the previous step’s output. These

intermediate outputs are not necessarily factors of variation, but can instead be

analogous to counters or pointers that the network uses to organize its internal

processing. Empirically, greater depth does seem to result in better generalization

for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio, 2009;

Mesnil et al., 2011; Ciresan et al., 2012; Krizhevsky et al., 2012; Sermanet et al.,

2013; Farabet et al., 2013; Couprie et al., 2013; Kahou et al., 2013; Goodfellow

et al., 2014d; Szegedy et al., 2014a). See Fig. 6.6 and Fig. 6.7 for examples of some

of these empirical results. This suggests that using deep architectures does indeed

express a useful prior over the space of functions the model learns.

6.4.2 Other Architectural Considerations

So far we have described neural networks as being simple chains of layers, with the

main considerations being the depth of the network and the width of each layer.

In practice, neural networks show considerably more diversity.

Many neural network architectures have been developed for speciﬁc tasks.

Specialized architectures for computer vision called convolutional networks are

described in Chapter 9. Feedforward networks may also be generalized to the

recurrent neural networks for sequence processing, described in Chapter 10, which

have their own architectural considerations.

In general, the layers need not be connected in a chain, even though this is the

most common practice. Many architectures build a main chain but then add extra

architectural features to it, such as skip connections going from layer

to layer

+ 2 or higher. These skip connections make it easier for the gradient to ﬂow from

output layers to layers nearer the input.

200

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

3 4 5 6 7 8 9 10 11

Number of hidden layers

92.0

92.5

93.0

93.5

94.0

94.5

95.0

95.5

96.0

96.5

Test accuracy (%)

Effect of Depth

Figure 6.6: Empirical results showing that deeper networks generalize better when used

to transcribe multi-digit numbers from photographs of addresses. Data from Goodfellow

et al. (2014d). The test set accuracy consistently increases with increasing depth. See

Fig. 6.7 for a control experiment demonstrating that other increases to the model size do

not yield the same eﬀect.

201

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0.0 0.2 0.4 0.6 0.8 1.0

Number of parameters

×10

Test accuracy (%)

Eﬀect of Number of Parameters

3, convolutional

3, fully connected

11, convolutional

Figure 6.7: Deeper models tend to perform better. This is not merely because the model is

larger. This experiment from Goodfellow et al. (2014d) shows that increasing the number

of parameters in layers of convolutional networks without increasing their depth is not

nearly as eﬀective at increasing test set performance. The legend indicates the depth of

network used to make each curve and whether the curve represents variation in the size of

the convolutional or the fully connected layers. We observe that shallow models in this

context overﬁt at around 20 million parameters while deep ones can beneﬁt from having

over 60 million. This suggests that using a deep model expresses a useful preference over

the space of functions the model can learn. Speciﬁcally, it expresses a belief that the

function should consist of many simpler functions composed together. This could result

either in learning a representation that is composed in turn of simpler representations (e.g.,

corners deﬁned in terms of edges) or in learning a program with sequentially dependent

steps (e.g., ﬁrst locate a set of objects, then segment them from each other, then recognize

them).

202

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Another key consideration of architecture design is exactly how to connect a

pair of layers to each other. In the default neural network layer described by a linear

transformation via a matrix

, every input unit is connected to every output

unit. Many specialized networks in the chapters ahead have fewer connections, so

that each unit in the input layer is connected to only a small subset of units in

the output layer. These strategies for reducing the number of connections reduce

the number of parameters and the amount of computation required to evaluate

the network, but are often highly problem-dependent. For example, convolutional

networks, described in Chapter 9, use specialized patterns of sparse connections

that are very eﬀective for computer vision problems. In this chapter, it is diﬃcult

to give much more speciﬁc advice concerning the architecture of a generic neural

network. Subsequent chapters develop the particular architectural strategies that

have been found to work well for diﬀerent application domains.

6.5 Back-Propagation and Other Diﬀerentiation Algo-

rithms

When we use a feedforward neural network to accept an input

and produce an

output

, information ﬂows forward through the network. The inputs

provide

the initial information that then propagates up to the hidden units at each layer

and ﬁnally produces

. This is called forward propagation. During training,

forward propagation can continue onward until it produces a scalar cost

(

). The

back-propagation algorithm (Rumelhart et al., 1986a), often simply called backprop,

allows the information from the cost to then ﬂow backwards through the network,

in order to compute the gradient.

Computing an analytical expression for the gradient is straightforward, but

numerically evaluating such an expression can be computationally expensive. The

back-propagation algorithm does so using a simple and inexpensive procedure.

The term back-propagation is often misunderstood as meaning the whole

learning algorithm for multi-layer neural networks. Actually, back-propagation

refers only to the method for computing the gradient, while another algorithm,

such as stochastic gradient descent, is used to perform learning using this gradient.

Furthermore, back-propagation is often misunderstood as being speciﬁc to multi-

layer neural networks, but in principle it can compute derivatives of any function

(for some functions, the correct response is to report that the derivative of the

function is undeﬁned). Speciﬁcally, we will describe how to compute the gradient

∇

(

x, y

) for an arbitrary function

, where

is a set of variables whose derivatives

are desired, and

is an additional set of variables that are inputs to the function

203

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

but whose derivatives are not required. In learning algorithms, the gradient we most

often require is the gradient of the cost function with respect to the parameters,

∇

(

). Many machine learning tasks involve computing other derivatives, either

as part of the learning process, or to analyze the learned model. The back-

propagation algorithm can be applied to these tasks as well, and is not restricted

to computing the gradient of the cost function with respect to the parameters. The

idea of computing derivatives by propagating information through a network is

very general, and can be used to compute values such as the Jacobian of a function

with multiple outputs. We restrict our description here to the most commonly

used case where f has a single output.

6.5.1 Computational Graphs

So far we have discussed neural networks with a relatively informal graph language.

To describe the back-propagation algorithm more precisely, it is helpful to have a

more precise computational graph language.

Many ways of formalizing computation as graphs are possible.

Here, we use each node in the graph to indicate a variable. The variable may

be a scalar, vector, matrix, tensor, or even a variable of another type.

To formalize our graphs, we also need to introduce the idea of an operation.

An operation is a simple function of one or more variables. Our graph language

is accompanied by a set of allowable operations. Functions more complicated

than the operations in this set may be described by composing many operations

together.

Without loss of generality, we deﬁne an operation to return only a single

output variable. This does not lose generality because the output variable can have

multiple entries, such as a vector. Software implementations of back-propagation

usually support operations with multiple outputs, but we avoid this case in our

description because it introduces many extra details that are not important to

conceptual understanding.

If a variable

is computed by applying an operation to a variable

, then

we draw a directed edge from

. We sometimes annotate the output node

with the name of the operation applied, and other times omit this label when the

operation is clear from context.

Examples of computational graphs are shown in Fig. 6.8.

204

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(a)

⇥

(b)

(1)

dot

(2)

ˆyˆy



(c)

(1)

matmul

(2)

relu

(d)

ˆyˆy

dot



(1)

sqr

(2)

sum

(3)

⇥

Figure 6.8: Examples of computational graphs. (a) The graph using the

operation to

compute

. (b) The graph for the logistic regression prediction

ˆy



w + b



Some of the intermediate expressions do not have names in the algebraic expression

but need names in the graph. We simply name the

-th such variable

(i)

. (c) The

computational graph for the expression

max{

, XW

, which computes a design

matrix of rectiﬁed linear unit activations

given a design matrix containing a minibatch

of inputs

. (d) Examples a–c applied at most one operation to each variable, but it

is possible to apply more than one operation. Here we show a computation graph that

applies more than one operation to the weights

of a linear regression model. The

weights are used to make the both the prediction

ˆy

and the weight decay penalty

205

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.5.2 Chain Rule of Calculus

The chain rule of calculus (not to be confused with the chain rule of probability) is

used to compute the derivatives of functions formed by composing other functions

whose derivatives are known. Back-propagation is an algorithm that computes the

chain rule, with a speciﬁc order of operations that is highly eﬃcient.

Let

be a real number, and let

and

both be functions mapping from a real

number to a real number. Suppose that

(

) and

(

)) =

(

). Then

the chain rule states that

. (6.44)

We can generalize this beyond the scalar case. Suppose that

x ∈ R

y ∈ R

maps from

, and

maps from

. If

(

) and

(

), then

∂z

∂x

∂z

∂y

∂x

. (6.45)

In vector notation, this may be equivalently written as

∇

z =



∂y

∂x



∇

z, (6.46)

where

∂y

∂x

is the n × m Jacobian matrix of g.

From this we see that the gradient of a variable

can be obtained by multiplying

a Jacobian matrix

∂y

∂x

by a gradient

∇

. The back-propagation algorithm consists

of performing such a Jacobian-gradient product for each operation in the graph.

Usually we do not apply the back-propagation algorithm merely to vectors,

but rather to tensors of arbitrary dimensionality. Conceptually, this is exactly the

same as back-propagation with vectors. The only diﬀerence is how the numbers

are arranged in a grid to form a tensor. We could imagine ﬂattening each tensor

into a vector before we run back-propagation, computing a vector-valued gradient,

and then reshaping the gradient back into a tensor. In this rearranged view,

back-propagation is still just multiplying Jacobians by gradients.

To denote the gradient of a value

with respect to a tensor

, we write

∇

just as if

were a vector. The indices into

now have multiple coordinates—for

example, a 3-D tensor is indexed by three coordinates. We can abstract this away

by using a single variable

to represent the complete tuple of indices. For all

possible index tuples

, (

∇

)

gives

∂z

∂X

. This is exactly the same as how for all

206

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

possible integer indices

into a vector, (

∇

)

gives

∂z

∂x

. Using this notation, we

can write the chain rule as it applies to tensors. If Y = g(X) and z = f(Y), then

∇

z =

(∇

)

∂z

∂Y

. (6.47)

6.5.3 Recursively Applying the Chain Rule to Obtain Backprop

Using the chain rule, it is straightforward to write down an algebraic expression for

the gradient of a scalar with respect to any node in the computational graph that

produced that scalar. However, actually evaluating that expression in a computer

introduces some extra considerations.

Speciﬁcally, many subexpressions may be repeated several times within the

overall expression for the gradient. Any procedure that computes the gradient

will need to choose whether to store these subexpressions or to recompute them

several times. An example of how these repeated subexpressions arise is given in

Fig. 6.9. In some cases, computing the same subexpression twice would simply

be wasteful. For complicated graphs, there can be exponentially many of these

wasted computations, making a naive implementation of the chain rule infeasible.

In other cases, computing the same subexpression twice could be a valid way to

reduce memory consumption at the cost of higher runtime.

We ﬁrst begin by a version of the back-propagation algorithm that speciﬁes

the actual gradient computation directly (Algorithm 6.2 along with Algorithm 6.1

for the associated forward computation), in the order it will actually be done and

according to the recursive application of chain rule. One could either directly

perform these computations or view the description of the algorithm as a symbolic

speciﬁcation of the computational graph for computing the back-propagation. How-

ever, this formulation does not make explicit the manipulation and the construction

of the symbolic graph that performs the gradient computation. Such a formulation

is presented below in Sec. 6.5.6, with Algorithm 6.5, where we also generalize to

nodes that contain arbitrary tensors.

First consider a computational graph describing how to compute a single scalar

(n)

(say the loss on a training example). This scalar is the quantity whose

gradient we want to obtain, with respect to the

input nodes

(1)

)

. In

other words we wish to compute

∂u

(n)

∂u

(i)

for all

i ∈ {

, . . . , n

}

. In the application

of back-propagation to computing gradients for gradient descent over parameters,

(n)

will be the cost associated with an example or a minibatch, while

(1)

)

correspond to the parameters of the model.

207

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

We will assume that the nodes of the graph have been ordered in such a way

that we can compute their output one after the other, starting at

+1)

and

going up to

(n)

. As deﬁned in Algorithm 6.1, each node

(i)

is associated with an

operation f

(i)

and is computed by evaluating the function

(i)

= f(A

(i)

) (6.48)

where A

(i)

is the set of all nodes that are parents of u

(i)

Algorithm 6.1

A procedure that performs the computations mapping

inputs

(1)

)

to an output

(n)

. This deﬁnes a computational graph where each node

computes numerical value

(i)

by applying a function

(i)

to the set of arguments

(i)

that comprises the values of previous nodes

(j)

j < i

, with

j ∈ P a

(

(i)

). The

input to the computational graph is the vector

, and is set into the ﬁrst

nodes

(1)

)

. The output of the computational graph is read oﬀ the last (output)

node u

(n)

for i = 1, . . . , n

(i)

← x

end for

for i = n

+ 1, . . . , n do

(i)

← {u

(j)

| j ∈ P a(u

(i)

)}

(i)

← f

(i)

)

end for

return u

(n)

That algorithm speciﬁes the forward propagation computation, which we could

put in a graph

. In order to perform back-propagation, we can construct a

computational graph that depends on

and adds to it an extra set of nodes. These

form a subgraph

with one node per node of

. Computation in

proceeds in

exactly the reverse of the order of computation in

, and each node of

computes

the derivative

∂u

(n)

∂u

(i)

associated with the forward graph node

(i)

. This is done

using the chain rule with respect to scalar output u

(n)

∂u

(n)

∂u

(j)

i:j∈P a(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

(6.49)

as speciﬁed by Algorithm 6.2. The subgraph

contains exactly one edge for each

edge from node

(j)

to node

(i)

. The edge from

(j)

(i)

is associated with

the computation of

∂u

(i)

∂u

(j)

. In addition, a dot product is performed for each node,

between the gradient already computed with respect to nodes

(i)

that are children

208

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(j)

and the vector containing the partial derivatives

∂u

(i)

∂u

(j)

for the same children

nodes

(i)

. To summarize, the amount of computation required for performing

the back-propagation scales linearly with the number of edges in

, where the

computation for each edge corresponds to computing a partial derivative (of one

node with respect to one of its parents) as well as performing one multiplication

and one addition. Below, we generalize this analysis to tensor-valued nodes, which

is just a way to group multiple scalar values in the same node and enable more

eﬃcient implementations.

Algorithm 6.2

Simpliﬁed version of the back-propagation algorithm for computing

the derivatives of

(n)

with respect to the variables in the graph. This example is

intended to further understanding by showing a simpliﬁed case where all variables

are scalars, and we wish to compute the derivatives with respect to

(1)

, . . . , u

)

This simpliﬁed version computes the derivatives of all nodes in the graph. The

computational cost of this algorithm is proportional to the number of edges in

the graph, assuming that the partial derivative associated with each edge requires

a constant time. This is of the same order as the number of computations for

the forward propagation. Each

∂u

(i)

∂u

(j)

is a function of the parents

(j)

(i)

, thus

linking the nodes of the forward graph to those added for the back-propagation

graph.

Run forward propagation (Algorithm 6.1 for this example) to obtain the activa-

tions of the network

Initialize

grad_table

, a data structure that will store the derivatives that have

been computed. The entry

grad_table

[

(i)

] will store the computed value of

∂u

(n)

∂u

(i)

grad_table[∂u

(n)

] ← 1

for j = n − 1 down to 1 do

The next line computes

∂u

(n)

∂u

(j)

i:j∈P a(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

using stored values:

grad_table[u

(j)

] ←

i:j∈P a(u

(i)

)

grad_table[u

(i)

]

∂u

(i)

∂u

(j)

end for

return {grad_table[u

(i)

] | i = 1, . . . , n

}

The back-propagation algorithm is designed to reduce the number of common

subexpressions without regard to memory. Speciﬁcally, it performs on the order

of one Jacobian product per node in the graph. This can be seen from the fact

in Algorithm 6.2 that backprop visits each edge from node

(j)

to node

(i)

the graph exactly once in order to obtain the associated partial derivative

∂u

(j)

∂u

(i)

Back-propagation thus avoids the exponential explosion in repeated subexpressions.

209

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

However, other algorithms may be able to avoid more subexpressions by performing

simpliﬁcations on the computational graph, or may be able to conserve memory by

recomputing rather than storing some subexpressions. We will revisit these ideas

after describing the back-propagation algorithm itself.

6.5.4 Back-Propagation Computation in Fully-Connected MLP

To clarify the above deﬁnition of the back-propagation computation, let us consider

the speciﬁc graph associated with a fully-connected multi-layer MLP.

Algorithm 6.3 ﬁrst shows the forward propagation, which maps parameters to

the supervised loss

(

y, y

) associated with a single (input,target) training example

(x, y), with

y the output of the neural network when x is provided in input.

Algorithm 6.4 then shows the corresponding computation to be done for

applying the back-propagation algorithm to this graph.

Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to be

simple and straightforward to understand. However, they are specialized to one

speciﬁc problem.

Modern software implementations are based on the generalized form of back-

propagation described in Sec. 6.5.6 below, which can accommodate any computa-

tional graph by explicitly manipulating a data structure for representing symbolic

computation.

6.5.5 Symbol-to-Symbol Derivatives

Algebraic expressions and computational graphs both operate on symbols, or

variables that do not have speciﬁc values. These algebraic and graph-based

representations are called symbolic representations. When we actually use or

train a neural network, we must assign speciﬁc values to these symbols. We

replace a symbolic input to the network

with a speciﬁc numeric value, such as

[1.2, 3.765, −1.8]

Some approaches to back-propagation take a computational graph and a set

of numerical values for the inputs to the graph, then return a set of numerical

values describing the gradient at those input values. We call this approach “symbol-

to-number” diﬀerentiation. This is the approach used by libraries such as Torch

(Collobert et al., 2011b) and Caﬀe (Jia, 2013).

Another approach is to take a computational graph and add additional nodes

to the graph that provide a symbolic description of the desired derivatives. This

210

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Figure 6.9: A computational graph that results in repeated subexpressions when computing

the gradient. Let

w ∈ R

be the input to the graph. We use the same function

R → R

as the operation that we apply at every step of a chain:

(

To compute

∂z

∂w

, we apply Eq. 6.44 and obtain:

∂z

∂w

(6.50)

∂z

∂y

∂x

∂w

(6.51)

(y)f

(x)f

(w) (6.52)

(f(f (w)))f

(f(w))f

(w) (6.53)

Eq. 6.52 suggests an implementation in which we compute the value of

(

) only once

and store it in the variable

. This is the approach taken by the back-propagation

algorithm. An alternative approach is suggested by Eq. 6.53, where the subexpression

(

) appears more than once. In the alternative approach,

(

) is recomputed each time

it is needed. When the memory required to store the value of these expressions is low,

the back-propagation approach of Eq. 6.52 is clearly preferable because of its reduced

runtime. However, Eq. 6.53 is also a valid implementation of the chain rule, and is useful

when memory is limited.

211

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.3

Forward propagation through a typical deep neural network and

the computation of the cost function. The loss

(

y, y

) depends on the output

and on the target

(see Sec. 6.2.1.1 for examples of loss functions). To obtain the

total cost

, the loss may be added to a regularizer Ω(

), where

contains all the

parameters (weights and biases). Algorithm 6.4 shows how to compute gradients

with respect to parameters

and

. For simplicity, this demonstration uses

only a single input example

. Practical applications should use a minibatch. See

Sec. 6.5.7 for a more realistic demonstration.

Require: Network depth, l

Require: W

(i)

, i ∈ {1, . . . , l}, the weight matrices of the model

Require: b

(i)

, i ∈ {1, . . . , l}, the bias parameters of the model

Require: x, the input to process

Require: y, the target output

(0)

= x

for k = 1, . . . , l do

(k)

= b

(k)

+ W

(k)

(k−1)

(k)

= f(a

(k)

)

end for

y = h

(l)

J = L(

y, y) + λΩ(θ)

is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012)

and TensorFlow (Abadi et al., 2015). An example of how this approach works

is illustrated in Fig. 6.10. The primary advantage of this approach is that

the derivatives are described in the same language as the original expression.

Because the derivatives are just another computational graph, it is possible to run

back-propagation again, diﬀerentiating the derivatives in order to obtain higher

derivatives. Computation of higher-order derivatives is described in Sec. 6.5.10.

We will use the latter approach and describe the back-propagation algorithm in

terms of constructing a computational graph for the derivatives. Any subset of the

graph may then be evaluated using speciﬁc numerical values at a later time. This

allows us to avoid specifying exactly when each operation should be computed.

Instead, a generic graph evaluation engine can evaluate every node as soon as its

parents’ values are available.

The description of the symbol-to-symbol based approach subsumes the symbol-

to-number approach. The symbol-to-number approach can be understood as

performing exactly the same computations as are done in the graph built by the

symbol-to-symbol approach. The key diﬀerence is that the symbol-to-number

212

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.4

Backward computation for the deep neural network of Algo-

rithm 6.3, which uses in addition to the input

a target

. This computation

yields the gradients on the activations

(k)

for each layer

, starting from the

output layer and going backwards to the ﬁrst hidden layer. From these gradients,

which can be interpreted as an indication of how each layer’s output should change

to reduce error, one can obtain the gradient on the parameters of each layer. The

gradients on weights and biases can be immediately used as part of a stochas-

tic gradient update (performing the update right after the gradients have been

computed) or used with other gradient-based optimization methods.

After the forward computation, compute the gradient on the output layer:

g ← ∇

J = ∇

y, y)

for k = l, l − 1, . . . , 1 do

Convert the gradient on the layer’s output into a gradient into the pre-

nonlinearity activation (element-wise multiplication if f is element-wise):

g ← ∇

(k)

J = g  f

(k)

)

Compute gradients on weights and biases (including the regularization term,

where needed):

∇

(k)

J = g + λ∇

(k)

Ω(θ)

∇

(k)

J = g h

(k−1)>

+ λ∇

(k)

Ω(θ)

Propagate the gradients w.r.t. the next lower-level hidden layer’s activations:

g ← ∇

(k−1)

J = W

(k)>

end for

213

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

⇥

Figure 6.10: An example of the symbol-to-symbol approach to computing derivatives. In

this approach, the back-propagation algorithm does not need to ever access any actual

speciﬁc numeric values. Instead, it adds nodes to a computational graph describing how

to compute these derivatives. A generic graph evaluation engine can later compute the

derivatives for any speciﬁc numeric values. (Left) In this example, we begin with a graph

representing

(

))). (Right) We run the back-propagation algorithm, instructing

it to construct the graph for the expression corresponding to

. In this example, we do

not explain how the back-propagation algorithm works. The purpose is only to illustrate

what the desired result is: a computational graph with a symbolic description of the

derivative.

214

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

approach does not expose the graph.

6.5.6 General Back-Propagation

The back-propagation algorithm is very simple. To compute the gradient of some

scalar

with respect to one of its ancestors

in the graph, we begin by observing

that the gradient with respect to

is given by

= 1. We can then compute

the gradient with respect to each parent of

in the graph by multiplying the

current gradient by the Jacobian of the operation that produced

. We continue

multiplying by Jacobians traveling backwards through the graph in this way until

we reach

. For any node that may be reached by going backwards from

through

two or more paths, we simply sum the gradients arriving from diﬀerent paths at

that node.

More formally, each node in the graph

corresponds to a variable. To achieve

maximum generality, we describe this variable as being a tensor

. Tensor can

in general have any number of dimensions, and subsume scalars, vectors, and

matrices.

We assume that each variable V is associated with the following subroutines:

• get_operation

(

): This returns the operation that computes

, repre-

sented by the edges coming into

in the computational graph. For example,

there may be a Python or C++ class representing the matrix multiplication

operation, and the

get_operation

function. Suppose we have a variable that

is created by matrix multiplication,

. Then

get_operation

(

)

returns a pointer to an instance of the corresponding C++ class.

• get_consumers

(

V, G

): This returns the list of variables that are children of

V in the computational graph G.

• get_inputs(V, G)

: This returns the list of variables that are parents of

in the computational graph G.

Each operation

is also associated with a

bprop

operation. This

bprop

operation can compute a Jacobian-vector product as described by Eq. 6.47. This

is how the back-propagation algorithm is able to achieve great generality. Each

operation is responsible for knowing how to back-propagate through the edges in

the graph that it participates in. For example, we might use a matrix multiplication

operation to create a variable

. Suppose that the gradient of a scalar

with

respect to

is given by

. The matrix multiplication operation is responsible for

deﬁning two back-propagation rules, one for each of its input arguments. If we call

215

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

the

bprop

method to request the gradient with respect to

given that the gradient

on the output is

, then the

bprop

method of the matrix multiplication operation

must state that the gradient with respect to

is given by

. Likewise, if we

call the

bprop

method to request the gradient with respect to

, then the matrix

operation is responsible for implementing the

bprop

method and specifying that

the desired gradient is given by

. The back-propagation algorithm itself does

not need to know any diﬀerentiation rules. It only needs to call each operation’s

bprop

rules with the right arguments. Formally,

op.bprop

(

inputs, X, G

) must

return

(∇

op.f(inputs)

) G

, (6.54)

which is just an implementation of the chain rule as expressed in Eq. 6.47.

Here,

inputs

is a list of inputs that are supplied to the operation,

op.f

is the

mathematical function that the operation implements,

is the input whose gradient

we wish to compute, and G is the gradient on the output of the operation.

The

op.bprop

method should always pretend that all of its inputs are distinct

from each other, even if they are not. For example, if the

mul

operator is passed

two copies of

to compute

, the

op.bprop

method should still return

as the

derivative with respect to both inputs. The back-propagation algorithm will later

add both of these arguments together to obtain 2

, which is the correct total

derivative on x.

Software implementations of back-propagation usually provide both the opera-

tions and their

bprop

methods, so that users of deep learning software libraries are

able to back-propagate through graphs built using common operations like matrix

multiplication, exponents, logarithms, and so on. Software engineers who build a

new implementation of back-propagation or advanced users who need to add their

own operation to an existing library must usually derive the

op.bprop

method for

any new operations manually.

The back-propagation algorithm is formally described in Algorithm 6.5.

In Sec. 6.5.2, we motivated back-propagation as a strategy for avoiding comput-

ing the same subexpression in the chain rule multiple times. The naive algorithm

could have exponential runtime due to these repeated subexpressions. Now that

we have speciﬁed the back-propagation algorithm, we can understand its com-

putational cost. If we assume that each operation evaluation has roughly the

same cost, then we may analyze the computational cost in terms of the number

of operations executed. Keep in mind here that we refer to an operation as the

fundamental unit of our computational graph, which might actually consist of very

many arithmetic operations (for example, we might have a graph that treats matrix

216

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.5

The outermost skeleton of the back-propagation algorithm. This

portion does simple setup and cleanup work. Most of the important work happens

in the build_grad subroutine of Algorithm 6.6

Require: T, the target set of variables whose gradients must be computed.

Require: G, the computational graph

Require: z, the variable to be diﬀerentiated

Let

pruned to contain only nodes that are ancestors of

and descendents

of nodes in T.

Initialize grad_table, a data structure associating tensors to their gradients

grad_table[z] ← 1

for V in T do

build_grad(V, G, G

, grad_table)

end for

Return grad_table restricted to T

multiplication as a single operation). Computing a gradient in a graph with

nodes

will never execute more than

(

) operations or store the output of more than

(

) operations. Here we are counting operations in the computational graph, not

individual operations executed by the underlying hardware, so it is important to

remember that the runtime of each operation may be highly variable. For example,

multiplying two matrices that each contain millions of entries might correspond to

a single operation in the graph. We can see that computing the gradient requires as

most

(

) operations because the forward propagation stage will at worst execute

all

nodes in the original graph (depending on which values we want to compute,

we may not need to execute the entire graph). The back-propagation algorithm

adds one Jacobian-vector product, which should be expressed with

(1) nodes, per

edge in the original graph. Because the computational graph is a directed acyclic

graph it has at most

(

) edges. For the kinds of graphs that are commonly used

in practice, the situation is even better. Most neural network cost functions are

roughly chain-structured, causing back-propagation to have

(

) cost. This is far

better than the naive approach, which might need to execute exponentially many

nodes. This potentially exponential cost can be seen by expanding and rewriting

the recursive chain rule (Eq. 6.49) non-recursively:

∂u

(n)

∂u

(j)

path (u

(π

)

(π

)

,...,u

(π

)

from π

=j to π

k=2

∂u

(π

)

∂u

(π

k−1

)

. (6.55)

Since the number of paths from node

to node

can grow up to exponentially in the

217

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Algorithm 6.6

The inner loop subroutine

build_grad

(

V, G, G

, grad_table

) of

the back-propagation algorithm, called by the back-propagation algorithm deﬁned

in Algorithm 6.5.

Require: V, the variable whose gradient should be added to G and grad_table.

Require: G, the graph to modify.

Require: G

, the restriction of G to nodes that participate in the gradient.

Require: grad_table, a data structure mapping nodes to their gradients

if V is in grad_table then

Return grad_table[V]

end if

i ← 1

for C in get_consumers(V, G

) do

op ← get_operation(C)

D ← build_grad(C, G, G

, grad_table)

(i)

← op.bprop(get_inputs(C, G

), V, D)

i ← i + 1

end for

G ←

(i)

grad_table[V] = G

Insert G and the operations creating it into G

Return G

length of these paths, the number of terms in the above sum, which is the number

of such paths, can grow exponentially with the depth of the forward propagation

graph. This large cost would be incurred because the same computation for

∂u

(i)

∂u

(j)

would be redone many times. To avoid such recomputation, we can think

of back-propagation as a table-ﬁlling algorithm that takes advantage of storing

intermediate results

∂u

(n)

∂u

(i)

. Each node in the graph has a corresponding slot in a

table to store the gradient for that node. By ﬁlling in these table entries in order,

back-propagation avoids repeating many common subexpressions. This table-ﬁlling

strategy is sometimes called dynamic programming.

6.5.7 Example: Back-Propagation for MLP Training

As an example, we walk through the back-propagation algorithm as it is used to

train a multilayer perceptron.

Here we develop a very simple multilayer perception with a single hidden

layer. To train this model, we will use minibatch stochastic gradient descent.

218

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

The back-propagation algorithm is used to compute the gradient of the cost on a

single minibatch. Speciﬁcally, we use a minibatch of examples from the training

set formatted as a design matrix

and a vector of associated class labels

The network computes a layer of hidden features

max{

, XW

(1)

}

. To

simplify the presentation we do not use biases in this model. We assume that our

graph language includes a

relu

operation that can compute

max{

, Z}

element-

wise. The predictions of the unnormalized log probabilities over classes are then

given by

(2)

. We assume that our graph language includes a

cross_entropy

operation that computes the cross-entropy between the targets

and the probability

distribution deﬁned by these unnormalized log probabilities. The resulting cross-

entropy deﬁnes the cost

MLE

. Minimizing this cross-entropy performs maximum

likelihood estimation of the classiﬁer. However, to make this example more realistic,

we also include a regularization term. The total cost

J = J

MLE

+ λ





i,j



(1)

i,j



i,j



(2)

i,j







(6.56)

consists of the cross-entropy and a weight decay term with coeﬃcient

. The

computational graph is illustrated in Fig. 6.11.

The computational graph for the gradient of this example is large enough that

it would be tedious to draw or to read. This demonstrates one of the beneﬁts

of the back-propagation algorithm, which is that it can automatically generate

gradients that would be straightforward but tedious for a software engineer to

derive manually.

We can roughly trace out the behavior of the back-propagation algorithm

by looking at the forward propagation graph in Fig. 6.11. To train, we wish

to compute both

∇

(1)

and

∇

(2)

. There are two diﬀerent paths leading

backward from

to the weights: one through the cross-entropy cost, and one

through the weight decay cost. The weight decay cost is relatively simple; it will

always contribute 2λW

(i)

to the gradient on W

(i)

The other path through the cross-entropy cost is slightly more complicated.

Let

be the gradient on the unnormalized log probabilities

(2)

provided by

the

cross_entropy

operation. The back-propagation algorithm now needs to

explore two diﬀerent branches. On the shorter branch, it adds

to the

gradient on

(2)

, using the back-propagation rule for the second argument to

the matrix multiplication operation. The other branch corresponds to the longer

chain descending further along the network. First, the back-propagation algorithm

computes

∇

(2)>

using the back-propagation rule for the ﬁrst argument

to the matrix multiplication operation. Next, the

relu

operation uses its back-

219

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

(1)

matmul

relu

(3)

sqr

(4)

sum



(7)

(2)

matmul

MLE

cross_entropy

(5)

sqr

(6)

sum

(8)

⇥

Figure 6.11: The computational graph used to compute the cost used to train our example

of a single-layer MLP using the cross-entropy loss and weight decay.

propagation rule to zero out components of the gradient corresponding to entries

(1)

that were less than 0. Let the result be called

. The last step of the

back-propagation algorithm is to use the back-propagation rule for the second

argument of the matmul operation to add X

to the gradient on W

(1)

After these gradients have been computed, it is the responsibility of the gradient

descent algorithm, or another optimization algorithm, to use these gradients to

update the parameters.

For the MLP, the computational cost is dominated by the cost of matrix

multiplication. During the forward propagation stage, we multiply by each weight

matrix, resulting in

(

) multiply-adds, where

is the number of weights. During

the backward propagation stage, we multiply by the transpose of each weight

matrix, which has the same computational cost. The main memory cost of the

algorithm is that we need to store the input to the nonlinearity of the hidden layer.

This value is stored from the time it is computed until the backward pass has

returned to the same point. The memory cost is thus

(

), where

is the

number of examples in the minibatch and n

is the number of hidden units.

220

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

6.5.8 Complications

Our description of the back-propagation algorithm here is simpler than the imple-

mentations actually used in practice.

As noted above, we have restricted the deﬁnition of an operation to be a

function that returns a single tensor. Most software implementations need to

support operations that can return more than one tensor. For example, if we wish

to compute both the maximum value in a tensor and the index of that value, it is

best to compute both in a single pass through memory, so it is most eﬃcient to

implement this procedure as a single operation with two outputs.

We have not described how to control the memory consumption of back-

propagation. Back-propagation often involves summation of many tensors together.

In the naive approach, each of these tensors would be computed separately, then

all of them would be added in a second step. The naive approach has an overly

high memory bottleneck that can be avoided by maintaining a single buﬀer and

adding each value to that buﬀer as it is computed.

Real-world implementations of back-propagation also need to handle various

data types, such as 32-bit ﬂoating point, 64-bit ﬂoating point, and integer values.

The policy for handling each of these types takes special care to design.

Some operations have undeﬁned gradients, and it is important to track these

cases and determine whether the gradient requested by the user is undeﬁned.

Various other technicalities make real-world diﬀerentiation more complicated.

These technicalities are not insurmountable, and this chapter has described the key

intellectual tools needed to compute derivatives, but it is important to be aware

that many more subtleties exist.

6.5.9 Diﬀerentiation outside the Deep Learning Community

The deep learning community has been somewhat isolated from the broader

computer science community and has largely developed its own cultural attitudes

concerning how to perform diﬀerentiation. More generally, the ﬁeld of automatic

diﬀerentiation is concerned with how to compute derivatives algorithmically. The

back-propagation algorithm described here is only one approach to automatic

diﬀerentiation. It is a special case of a broader class of techniques called reverse

mode accumulation. Other approaches evaluate the subexpressions of the chain rule

in diﬀerent orders. In general, determining the order of evaluation that results in

the lowest computational cost is a diﬃcult problem. Finding the optimal sequence

of operations to compute the gradient is NP-complete (Naumann, 2008), in the

221

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

sense that it may require simplifying algebraic expressions into their least expensive

form.

For example, suppose we have variables

, p

, . . . , p

representing probabilities

and variables

, z

, . . . , z

representing unnormalized log probabilities. Suppose

we deﬁne

exp(z

)

exp(z

)

, (6.57)

where we build the softmax function out of exponentiation, summation and division

operations, and construct a cross-entropy loss

−

log q

. A human

mathematician can observe that the derivative of

with respect to

takes a very

simple form:

−p

. The back-propagation algorithm is not capable of simplifying

the gradient this way, and will instead explicitly propagate gradients through all of

the logarithm and exponentiation operations in the original graph. Some software

libraries such as Theano (Bergstra et al., 2010; Bastien et al., 2012) are able to

perform some kinds of algebraic substitution to improve over the graph proposed

by the pure back-propagation algorithm.

When the forward graph

has a single output node and each partial derivative

∂u

(i)

∂u

(j)

can be computed with a constant amount of computation, back-propagation

guarantees that the number of computations for the gradient computation is of

the same order as the number of computations for the forward computation: this

can be seen in Algorithm 6.2 because each local partial derivative

∂u

(i)

∂u

(j)

needs

to be computed only once along with an associated multiplication and addition

for the recursive chain-rule formulation (Eq. 6.49). The overall computation is

therefore

(# edges). However, it can potentially be reduced by simplifying the

computational graph constructed by back-propagation, and this is an NP-complete

task. Implementations such as Theano and TensorFlow use heuristics based on

matching known simpliﬁcation patterns in order to iteratively attempt to simplify

the graph. We deﬁned back-propagation only for the computation of a gradient of a

scalar output but back-propagation can be extended to compute a Jacobian (either

diﬀerent scalar nodes in the graph, or of a tensor-valued node containing

values). A naive implementation may then need

times more computation: for

each scalar internal node in the original forward graph, the naive implementation

computes

gradients instead of a single gradient. When the number of outputs

of the graph is larger than the number of inputs, it is sometimes preferable to

use another form of automatic diﬀerentiation called forward mode accumulation.

Forward mode computation has been proposed for obtaining real-time computation

of gradients in recurrent networks, for example (Williams and Zipser, 1989). This

also avoids the need to store the values and gradients for the whole graph, trading

oﬀ computational eﬃciency for memory. The relationship between forward mode

222

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

and backward mode is analogous to the relationship between left-multiplying versus

right-multiplying a sequence of matrices, such as

ABCD, (6.58)

where the matrices can be thought of as Jacobian matrices. For example, if

is a column vector while

has many rows, this corresponds to a graph with a

single output and many inputs, and starting the multiplications from the end

and going backwards only requires matrix-vector products. This corresponds to

the backward mode. Instead, starting to multiply from the left would involve a

series of matrix-matrix products, which makes the whole computation much more

expensive. However, if

has fewer rows than

has columns, it is cheaper to run

the multiplications left-to-right, corresponding to the forward mode.

In many communities outside of machine learning, it is more common to

implement diﬀerentiation software that acts directly on traditional programming

language code, such as Python or C code, and automatically generates programs

that diﬀerent functions written in these languages. In the deep learning community,

computational graphs are usually represented by explicit data structures created by

specialized libraries. The specialized approach has the drawback of requiring the

library developer to deﬁne the

bprop

methods for every operation and limiting the

user of the library to only those operations that have been deﬁned. However, the

specialized approach also has the beneﬁt of allowing customized back-propagation

rules to be developed for each operation, allowing the developer to improve speed

or stability in non-obvious ways that an automatic procedure would presumably

be unable to replicate.

Back-propagation is therefore not the only way or the optimal way of computing

the gradient, but it is a very practical method that continues to serve the deep

learning community very well. In the future, diﬀerentiation technology for deep

networks may improve as deep learning practitioners become more aware of advances

in the broader ﬁeld of automatic diﬀerentiation.

6.5.10 Higher-Order Derivatives

Some software frameworks support the use of higher-order derivatives. Among the

deep learning software frameworks, this includes at least Theano and TensorFlow.

These libraries use the same kind of data structure to describe the expressions for

derivatives as they use to describe the original function being diﬀerentiated. This

means that the symbolic diﬀerentiation machinery can be applied to derivatives.

In the context of deep learning, it is rare to compute a single second derivative

of a scalar function. Instead, we are usually interested in properties of the Hessian

223

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

matrix. If we have a function

→ R

, then the Hessian matrix is of size

n ×n

In typical deep learning applications,

will be the number of parameters in the

model, which could easily number in the billions. The entire Hessian matrix is

thus infeasible to even represent.

Instead of explicitly computing the Hessian, the typical deep learning approach

is to use Krylov methods. Krylov methods are a set of iterative techniques for

performing various operations like approximately inverting a matrix or ﬁnding

approximations to its eigenvectors or eigenvalues, without using any operation

other than matrix-vector products.

In order to use Krylov methods on the Hessian, we only need to be able to

compute the product between the Hessian matrix

and an arbitrary vector

. A

straightforward technique (Christianson, 1992) for doing so is to compute

Hv = ∇

(∇

f(x))

. (6.59)

Both of the gradient computations in this expression may be computed automati-

cally by the appropriate software library. Note that the outer gradient expression

takes the gradient of a function of the inner gradient expression.

is itself a vector produced by a computational graph, it is important to

specify that the automatic diﬀerentiation software should not diﬀerentiate through

the graph that produced v.

While computing the Hessian is usually not advisable, it is possible to do with

Hessian vector products. One simply computes

(i)

for all

= 1

, . . . , n,

where

(i)

is the one-hot vector with e

(i)

= 1 and all other entries equal to 0.

6.6 Historical Notes

Feedforward networks can be seen as eﬃcient nonlinear function approximators

based on using gradient descent to minimize the error in a function approximation.

From this point of view, the modern feedforward network is the culmination of

centuries of progress on the general function approximation task.

The chain rule that underlies the back-propagation algorithm was invented

in the 17th century (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra have

long been used to solve optimization problems in closed form, but gradient descent

was not introduced as a technique for iteratively approximating the solution to

optimization problems until the 19th century (Cauchy, 1847).

Beginning in the 1940s, these function approximation techniques were used to

motivate machine learning models such as the perceptron. However, the earliest

224

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

models were based on linear models. Critics including Marvin Minsky pointed

out several of the ﬂaws of the linear model family, such as it inability to learn the

XOR function, which led to a backlash against the entire neural network approach.

Learning nonlinear functions required the development of a multilayer per-

ceptron and a means of computing the gradient through such a model. Eﬃcient

applications of the chain rule based on dynamic programming began to appear in the

1960s and 1970s, mostly for control applications (Kelley, 1960; Bryson and Denham,

1961; Dreyfus, 1962; Bryson and Ho, 1969; Dreyfus, 1973) but also for sensitivity

analysis (Linnainmaa, 1976). Werbos (1981) proposed applying these techniques

to training artiﬁcial neural networks. The idea was ﬁnally developed in practice

after being independently rediscovered in diﬀerent ways (LeCun, 1985; Parker,

1985; Rumelhart et al., 1986a). The book Parallel Distributed Processing presented

the results of some of the ﬁrst successful experiments with back-propagation in a

chapter (Rumelhart et al., 1986b) that contributed greatly to the popularization

of back-propagation and initiated a very active period of research in multi-layer

neural networks. However, the ideas put forward by the authors of that book

and in particular by Rumelhart and Hinton go much beyond back-propagation.

They include crucial ideas about the possible computational implementation of

several central aspects of cognition and learning, which came under the name of

“connectionism” because of the importance given the connections between neurons

as the locus of learning and memory. In particular, these ideas include the notion

of distributed representation (Hinton et al., 1986).

Following the success of back-propagation, neural network research gained pop-

ularity and reached a peak in the early 1990s. Afterwards, other machine learning

techniques became more popular until the modern deep learning renaissance that

began in 2006.

The core ideas behind modern feedforward networks have not changed sub-

stantially since the 1980s. The same back-propagation algorithm and the same

approaches to gradient descent are still in use. Most of the improvement in neural

network performance from 1986 to 2015 can be attributed to two factors. First,

larger datasets have reduced the degree to which statistical generalization is a

challenge for neural networks. Second, neural networks have become much larger,

due to more powerful computers, and better software infrastructure. However, a

small number of algorithmic changes have improved the performance of neural

networks noticeably.

One of these algorithmic changes was the replacement of mean squared error

with the cross-entropy family of loss functions. Mean squared error was popular in

the 1980s and 1990s, but was gradually replaced by cross-entropy losses and the

225

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

principle of maximum likelihood as ideas spread between the statistics community

and the machine learning community. The use of cross-entropy losses greatly

improved the performance of models with sigmoid and softmax outputs, which

had previously suﬀered from saturation and slow learning when using the mean

squared error loss.

The other major algorithmic change that has greatly improved the performance

of feedforward networks was the replacement of sigmoid hidden units with piecewise

linear hidden units, such as rectiﬁed linear units. Rectiﬁcation using the

max{

, z}

function was introduced in early neural network models and dates back at least

as far as the Cognitron and Neocognitron (Fukushima, 1975, 1980). These early

models did not use rectiﬁed linear units, but instead applied rectiﬁcation to

nonlinear functions. Despite the early popularity of rectiﬁcation, rectiﬁcation was

largely replaced by sigmoids in the 1980s, perhaps because sigmoids perform better

when neural networks are very small. As of the early 2000s, rectiﬁed linear units

were avoided due to a somewhat superstitious belief that activation functions with

non-diﬀerentiable points must be avoided. This began to change in about 2009.

Jarrett et al. (2009) observed that “using a rectifying nonlinearity is the single most

important factor in improving the performance of a recognition system” among

several diﬀerent factors of neural network architecture design.

For small datasets, Jarrett et al. (2009) observed that using rectifying non-

linearities is even more important than learning the weights of the hidden layers.

Random weights are suﬃcient to propagate useful information through a rectiﬁed

linear network, allowing the classiﬁer layer at the top to learn how to map diﬀerent

feature vectors to class identities.

When more data is available, learning begins to extract enough useful knowledge

to exceed the performance of randomly chosen parameters. Glorot et al. (2011a)

showed that learning is far easier in deep rectiﬁed linear networks than in deep

networks that have curvature or two-sided saturation in their activation functions.

Rectiﬁed linear units are also of historical interest because they show that

neuroscience has continued to have an inﬂuence on the development of deep

learning algorithms. Glorot et al. (2011a) motivate rectiﬁed linear units from

biological considerations. The half-rectifying nonlinearity was intended to capture

these properties of biological neurons: 1) For some inputs, biological neurons are

completely inactive. 2) For some inputs, a biological neuron’s output is proportional

to its input. 3) Most of the time, biological neurons operate in the regime where

they are inactive (i.e., they should have sparse activations).

When the modern resurgence of deep learning began in 2006, feedforward

networks continued to have a bad reputation. From about 2006-2012, it was widely

226

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

believed that feedforward networks would not perform well unless they were assisted

by other models, such as probabilistic models. Today, it is now known that with the

right resources and engineering practices, feedforward networks perform very well.

Today, gradient-based learning in feedforward networks is used as a tool to develop

probabilistic models, such as the variational autoencoder and generative adversarial

networks, described in Chapter 20. Rather than being viewed as an unreliable

technology that must be supported by other techniques, gradient-based learning in

feedforward networks has been viewed since 2012 as a powerful technology that

may be applied to many other machine learning tasks. In 2006, the community

used unsupervised learning to support supervised learning, and now, ironically, it

is more common to use supervised learning to support unsupervised learning.

Feedforward networks continue to have unfulﬁlled potential. In the future, we

expect they will be applied to many more tasks, and that advances in optimization

algorithms and model design will improve their performance even further. This

chapter has primarily described the neural network family of models. In the

subsequent chapters, we turn to how to use these models—how to regularize and

train them.

227