Chapter 11
Practical methodology
Successfully applying deep learning techniques requires more than just a good
knowledge of what algorithms exist and the principles that explain how they
work. A good machine learning practitioner also needs to know how to choose an
algorithm for a particular application and how to monitor and respond to feedback
obtained from experiments in order to improve a machine learning system. During
day to day development of machine learning systems, practitioners need to decide
whether to gather more data, increase or decrease model capacity, add or remove
regularizing features, improve the optimization of a model, improve approximate
inference in a model, or debug the software implementation of the model. All of
these operations are at the very least time-consuming to try out, so it is important
to be able to determine the right course of action rather than blindly guessing.
Most of this book is about different machine learning models, training algo-
rithms, and objective functions. This may give the impression that the most
important ingredient to being a machine learning expert is knowing a wide variety
of machine learning techniques and being good at different kinds of math. In prac-
tice, one can usually do much better with a correct application of a commonplace
algorithm than by sloppily applying an obscure algorithm. Correct application of
an algorithm depends on mastering some fairly simple methodology. Many of the
recommendations in this chapter are adapted from Ng (2015).
We recommend the following practical design process:
Determine your goals—what error metric to use, and your target value for
this error metric. These goals and error metrics should be driven by the
problem that the application is intended to solve.
Establish a working end-to-end pipeline as soon as possible, including the
424
CHAPTER 11. PRACTICAL METHODOLOGY
estimation of the appropriate performance metrics.
Instrument the system well to determine bottlenecks in performance. Diag-
nose which components are performing worse than expected and whether it
is due to overfitting, underfitting, or a defect in the data or software.
Repeatedly make incremental changes such as gathering new data, adjusting
hyperparameters, or changing algorithms, based on specific findings from
your instrumentation.
As a running example, we will use Street View address number transcription
system (Goodfellow et al., 2014d). The purpose of this application is to add
buildings to Google Maps. Street View cars photograph the buildings and record
the GPS coordinates associated with each photograph. A convolutional network
recognizes the address number in each photograph, allowing the Google Maps
database to add that address in the correct location. The story of how this
commercial application was developed gives an example of how to follow the design
methodology we advocate.
We now describe each of the steps in this process.
11.1 Performance Metrics
Determining your goals, in terms of which error metric to use, is a necessary first
step because your error metric will guide all of your future actions. You should
also have an idea of what level of performance you desire.
Keep in mind that for most applications, it is impossible to achieve absolute
zero error. The Bayes error defines the minimum error rate that you can hope to
achieve, even if you have infinite training data and can recover the true probability
distribution. This is because your input features may not contain complete
information about the output variable, or because the system might be intrinsically
stochastic. You will also be limited by having a finite amount of training data.
The amount of training data can be limited for a variety of reasons. When your
goal is to build the best possible real-world product or service, you can typically
collect more data but must determine the value of reducing error further and weigh
this against the cost of collecting more data. Data collection can require time,
money, or human suffering (for example, if your data collection process involves
performing invasive medical tests). When your goal is to answer a scientific question
about which algorithm performs better on a fixed benchmark, the benchmark
425
CHAPTER 11. PRACTICAL METHODOLOGY
specification usually determines the training set and you are not allowed to collect
more data.
How can one determine a reasonable level of performance to expect? Typically,
in the academic setting, we have some estimate of the error rate that is attainable
based on previously published benchmark results. In the real-word setting, we
have some idea of the error rate that is necessary for an application to be safe,
cost-effective, or appealing to consumers. Once you have determined your realistic
desired error rate, your design decisions will be guided by reaching this error rate.
Another important consideration besides the target value of the performance
metric is the choice of which metric to use. Several different performance metrics
may be used to measure the effectiveness of a complete application that includes
machine learning components. These performance metrics are usually different
from the cost function used to train the model. As described in Sec. 5.1.2, it is
common to measure the accuracy, or equivalently, the error rate, of a system.
However, many applications require more advanced metrics.
Sometimes it is much more costly to make one kind of a mistake than another.
For example, an e-mail spam detection system can make two kinds of mistakes:
incorrectly classifying a legitimate message as spam, and incorrectly allowing a
spam message to appear in the inbox. It is much worse to block a legitimate
message than to allow a questionable message to pass through. Rather than
measuring the error rate of a spam classifier, we may wish to measure some form
of total cost, where the cost of blocking legitimate messages is higher than the cost
of allowing spam messages.
Sometimes we wish to train a binary classifier that is intended to detect some
rare event. For example, we might design a medical test for a rare disease. Suppose
that only one in every million people has this disease. We can easily achieve
99.9999% accuracy on the detection task, by simply hard-coding the classifier
to always report that the disease is absent. Clearly, accuracy is a poor way to
characterize the performance of such a system. One way to solve this problem is to
instead measure precision and recall. Precision is the fraction of detections reported
by the model that were correct, while recall is the fraction of true events that
were detected. A detector that says no one has the disease would achieve perfect
precision, but zero recall. A detector that says everyone has the disease would
achieve perfect recall, but precision equal to the percentage of people who have
the disease (0.0001% in our example of a disease that only one people in a million
have). When using precision and recall, it is common to plot a PR curve, with
precision on the
y
-axis and recall on the
x
-axis. The classifier generates a score
that is higher if the event to be detected occurred. For example, a feedforward
426
CHAPTER 11. PRACTICAL METHODOLOGY
network designed to detect a disease outputs
ˆy
=
P
(
y
= 1
| x
), estimating the
probability that a person whose medical results are described by features
x
has
the disease. We choose to report a detection whenever this score exceeds some
threshold. By varying the threshold, we can trade precision for recall. In many
cases, we wish to summarize the performance of the classifier with a single number
rather than a curve. To do so, we can convert precision
p
and recall
r
into an
F-score given by
F =
2pr
p + r
. (11.1)
Another option is to report the total area lying beneath the PR curve.
In some applications, it is possible for the machine learning system to refuse to
make a decision. This is useful when the machine learning algorithm can estimate
how confident it should be about a decision, especially if a wrong decision can
be harmful and if a human operator is able to occasionally take over. The Street
View transcription system provides an example of this situation. The task is to
transcribe the address number from a photograph in order to associate the location
where the photo was taken with the correct address in a map. Because the value
of the map degrades considerably if the map is inaccurate, it is important to add
an address only if the transcription is correct. If the machine learning system
thinks that it is less likely than a human being to obtain the correct transcription,
then the best course of action is to allow a human to transcribe the photo instead.
Of course, the machine learning system is only useful if it is able to dramatically
reduce the amount of photos that the human operators must process. A natural
performance metric to use in this situation is coverage. Coverage is the fraction of
examples for which the machine learning system is able to produce a response. It
is possible to trade coverage for accuracy. One can always obtain 100% accuracy
by refusing to process any example, but this reduces the coverage to 0%. For the
Street View task, the goal for the project was to reach human-level transcription
accuracy while maintaining 95% coverage. Human-level performance on this task
is 98% accuracy.
Many other metrics are possible. We can for example, measure click-through
rates, collect user satisfaction surveys, and so on. Many specialized application
areas have application-specific criteria as well.
What is important is to determine which performance metric to improve ahead
of time, then concentrate on improving this metric. Without clearly defined goals,
it can be difficult to tell whether changes to a machine learning system make
progress or not.
427
CHAPTER 11. PRACTICAL METHODOLOGY
11.2 Default Baseline Models
After choosing performance metrics and goals, the next step in any practical
application is to establish a reasonable end-to-end system as soon as possible. In
this section, we provide recommendations for which algorithms to use as the first
baseline approach in various situations. Keep in mind that deep learning research
progresses quickly, so better default algorithms are likely to become available soon
after this writing.
Depending on the complexity of your problem, you may even want to begin
without using deep learning. If your problem has a chance of being solved by
just choosing a few linear weights correctly, you may want to begin with a simple
statistical model like logistic regression.
If you know that your problem falls into an “AI-complete” category like object
recognition, speech recognition, machine translation, and so on, then you are likely
to do well by beginning with an appropriate deep learning model.
First, choose the general category of model based on the structure of your
data. If you want to perform supervised learning with fixed-size vectors as input,
use a feedforward network with fully connected layers. If the input has known
topological structure (for example, if the input is an image), use a convolutional
network. In these cases, you should begin by using some kind of piecewise linear
unit (ReLUs or their generalizations like Leaky ReLUs, PreLus and maxout). If
your input or output is a sequence, use a gated recurrent net (LSTM or GRU).
A reasonable choice of optimization algorithm is SGD with momentum with a
decaying learning rate (popular decay schemes that perform better or worse on
different problems include decaying linearly until reaching a fixed minimum learning
rate, decaying exponentially, or decreasing the learning rate by a factor of 2-10
each time validation error plateaus). Another very reasonable alternative is Adam.
Batch normalization can have a dramatic effect on optimization performance,
especially for convolutional networks and networks with sigmoidal nonlinearities.
While it is reasonable to omit batch normalization from the very first baseline, it
should be introduced quickly if optimization appears to be problematic.
Unless your training set contains tens of millions of examples or more, you
should include some mild forms of regularization from the start. Early stopping
should be used almost universally. Dropout is an excellent regularizer that is easy
to implement and compatible with many models and training algorithms. Batch
normalization also sometimes reduces generalization error and allows dropout to
be omitted, due to the noise in the estimate of the statistics used to normalize
each variable.
428
CHAPTER 11. PRACTICAL METHODOLOGY
If your task is similar to another task that has been studied extensively, you
will probably do well by first copying the model and algorithm that is already
known to perform best on the previously studied task. You may even want to copy
a trained model from that task. For example, it is common to use the features
from a convolutional network trained on ImageNet to solve other computer vision
tasks (Girshick et al., 2015).
A common question is whether to begin by using unsupervised learning, de-
scribed further in Part III. This is somewhat domain specific. Some domains, such
as natural language processing, are known to benefit tremendously from unsuper-
vised learning techniques such as learning unsupervised word embeddings. In other
domains, such as computer vision, current unsupervised learning techniques do
not bring a benefit, except in the semi-supervised setting, when the number of
labeled examples is very small (Kingma et al., 2014; Rasmus et al., 2015). If your
application is in a context where unsupervised learning is known to be important,
then include it in your first end-to-end baseline. Otherwise, only use unsupervised
learning in your first attempt if the task you want to solve is unsupervised. You
can always try adding unsupervised learning later if you observe that your initial
baseline overfits.
11.3 Determining Whether to Gather More Data
After the first end-to-end system is established, it is time to measure the perfor-
mance of the algorithm and determine how to improve it. Many machine learning
novices are tempted to make improvements by trying out many different algorithms.
However, it is often much better to gather more data than to improve the learning
algorithm.
How does one decide whether to gather more data? First, determine whether
the performance on the training set is acceptable. If performance on the training
set is poor, the learning algorithm is not using the training data that is already
available, so there is no reason to gather more data. Instead, try increasing the
size of the model by adding more layers or adding more hidden units to each layer.
Also, try improving the learning algorithm, for example by tuning the learning rate
hyperparameter. If large models and carefully tuned optimization algorithms do
not work well, then the problem might be the
quality
of the training data. The
data may be too noisy or may not include the right inputs needed to predict the
desired outputs. This suggests starting over, collecting cleaner data or collecting a
richer set of features.
If the performance on the training set is acceptable, then measure the per-
429
CHAPTER 11. PRACTICAL METHODOLOGY
formance on a test set. If the performance on the test set is also acceptable,
then there is nothing left to be done. If test set performance is much worse than
training set performance, then gathering more data is one of the most effective
solutions. The key considerations are the cost and feasibility of gathering more
data, the cost and feasibility of reducing the test error by other means, and the
amount of data that is expected to be necessary to improve test set performance
significantly. At large internet companies with millions or billions of users, it is
feasible to gather large datasets, and the expense of doing so can be considerably
less than the other alternatives, so the answer is almost always to gather more
training data. For example, the development of large labeled datasets was one of
the most important factors in solving object recognition. In other contexts, such as
medical applications, it may be costly or infeasible to gather more data. A simple
alternative to gathering more data is to reduce the size of the model or improve
regularization, by adjusting hyperparameters such as weight decay coefficients,
or by adding regularization strategies such as dropout. If you find that the gap
between train and test performance is still unacceptable even after tuning the
regularization hyperparameters, then gathering more data is advisable.
When deciding whether to gather more data, it is also necessary to decide
how much to gather. It is helpful to plot curves showing the relationship between
training set size and generalization error, like in Fig. 5.4. By extrapolating such
curves, one can predict how much additional training data would be needed to
achieve a certain level of performance. Usually, adding a small fraction of the total
number of examples will not have a noticeable impact on generalization error. It is
therefore recommended to experiment with training set sizes on a logarithmic scale,
for example doubling the number of examples between consecutive experiments.
If gathering much more data is not feasible, the only other way to improve
generalization error is to improve the learning algorithm itself. This becomes the
domain of research and not the domain of advice for applied practitioners.
11.4 Selecting Hyperparameters
Most deep learning algorithms come with many hyperparameters that control many
aspects of the algorithm’s behavior. Some of these hyperparameters affect the time
and memory cost of running the algorithm. Some of these hyperparameters affect
the quality of the model recovered by the training process and its ability to infer
correct results when deployed on new inputs.
There are two basic approaches to choosing these hyperparameters: choosing
them manually and choosing them automatically. Choosing the hyperparameters
430
CHAPTER 11. PRACTICAL METHODOLOGY
manually requires understanding what the hyperparameters do and how machine
learning models achieve good generalization. Automatic hyperparameter selection
algorithms greatly reduce the need to understand these ideas, but they are often
much more computationally costly.
11.4.1 Manual Hyperparameter Tuning
To set hyperparameters manually, one must understand the relationship between
hyperparameters, training error, generalization error and computational resources
(memory and runtime). This means establishing a solid foundation on the fun-
damental ideas concerning the effective capacity of a learning algorithm from
Chapter 5.
The goal of manual hyperparameter search is usually to find the lowest general-
ization error subject to some runtime and memory budget. We do not discuss how
to determine the runtime and memory impact of various hyperparameters here
because this is highly platform-dependent.
The primary goal of manual hyperparameter search is to adjust the effective
capacity of the model to match the complexity of the task. Effective capacity
is constrained by three factors: the representational capacity of the model, the
ability of the learning algorithm to successfully minimize the cost function used to
train the model, and the degree to which the cost function and training procedure
regularize the model. A model with more layers and more hidden units per layer has
higher representational capacity—it is capable of representing more complicated
functions. It can not necessarily actually learn all of these functions though, if
the training algorithm cannot discover that certain functions do a good job of
minimizing the training cost, or if regularization terms such as weight decay forbid
some of these functions.
The generalization error typically follows a U-shaped curve when plotted as
a function of one of the hyperparameters, as in Fig. 5.3. At one extreme, the
hyperparameter value corresponds to low capacity, and generalization error is high
because training error is high. This is the underfitting regime. At the other extreme,
the hyperparameter value corresponds to high capacity, and the generalization
error is high because the gap between training and test error is high. Somewhere
in the middle lies the optimal model capacity, which achieves the lowest possible
generalization error, by adding a medium generalization gap to a medium amount
of training error.
For some hyperparameters, overfitting occurs when the value of the hyper-
parameter is large. The number of hidden units in a layer is one such example,
431
CHAPTER 11. PRACTICAL METHODOLOGY
because increasing the number of hidden units increases the capacity of the model.
For some hyperparameters, overfitting occurs when the value of the hyperparame-
ter is small. For example, the smallest allowable weight decay coefficient of zero
corresponds to the greatest effective capacity of the learning algorithm.
Not every hyperparameter will be able to explore the entire U-shaped curve.
Many hyperparameters are discrete, such as the number of units in a layer or the
number of linear pieces in a maxout unit, so it is only possible to visit a few points
along the curve. Some hyperparameters are binary. Usually these hyperparameters
are switches that specify whether or not to use some optional component of
the learning algorithm, such as a preprocessing step that normalizes the input
features by subtracting their mean and dividing by their standard deviation. These
hyperparameters can only explore two points on the curve. Other hyperparameters
have some minimum or maximum value that prevents them from exploring some
part of the curve. For example, the minimum weight decay coefficient is zero. This
means that if the model is underfitting when weight decay is zero, we can not enter
the overfitting region by modifying the weight decay coefficient. In other words,
some hyperparameters can only subtract capacity.
The learning rate is perhaps the most important hyperparameter. If you
have time to tune only one hyperparameter, tune the learning rate. It con-
trols the effective capacity of the model in a more complicated way than other
hyperparameters—the effective capacity of the model is highest when the learning
rate is
correct
for the optimization problem, not when the learning rate is espe-
cially large or especially small. The learning rate has a U-shaped curve for training
error, illustrated in Fig. 11.1. When the learning rate is too large, gradient descent
can inadvertently increase rather than decrease the training error. In the idealized
quadratic case, this occurs if the learning rate is at least twice as large as its
optimal value (LeCun et al., 1998a). When the learning rate is too small, training
is not only slower, but may become permanently stuck with a high training error.
This effect is poorly understood (it would not happen for a convex loss function).
Tuning the parameters other than the learning rate requires monitoring both
training and test error to diagnose whether your model is overfitting or underfitting,
then adjusting its capacity appropriately.
If your error on the training set is higher than your target error rate, you have
no choice but to increase capacity. If you are not using regularization and you are
confident that your optimization algorithm is performing correctly, then you must
add more layers to your network or add more hidden units. Unfortunately, this
increases the computational costs associated with the model.
If your error on the test set is higher than than your target error rate, you can
432
CHAPTER 11. PRACTICAL METHODOLOGY
10
2
10
1
10
0
Learning rate (logarithmic scale)
0
1
2
3
4
5
6
7
8
Training error
Figure 11.1: Typical relationship between the learning rate and the training error. Notice
the sharp rise in error when the learning is above an optimal value. This is for a fixed
training time, as a smaller learning rate may sometimes only slow down training by a
factor proportional to the learning rate reduction. Generalization error can follow this
curve or be complicated by regularization effects arising out of having a too large or
too small learning rates, since poor optimization can, to some degree, reduce or prevent
overfitting, and even points with equivalent training error can have different generalization
error.
now take two kinds of actions. The test error is the sum of the training error and
the gap between training and test error. The optimal test error is found by trading
off these quantities. Neural networks typically perform best when the training
error is very low (and thus, when capacity is high) and the test error is primarily
driven by the gap between train and test error. Your goal is to reduce this gap
without increasing training error faster than the gap decreases. To reduce the gap,
change regularization hyperparameters to reduce effective model capacity, such as
by adding dropout or weight decay. Usually the best performance comes from a
large model that is regularized well, for example by using dropout.
Most hyperparameters can be set by reasoning about whether they increase or
decrease model capacity. Some examples are included in Table 11.1.
While manually tuning hyperparameters, do not lose sight of your end goal:
good performance on the test set. Adding regularization is only one way to achieve
this goal. As long as you have low training error, you can always reduce general-
ization error by collecting more training data. The brute force way to practically
guarantee success is to continually increase model capacity and training set size
until the task is solved. This approach does of course increase the computational
cost of training and inference, so it is only feasible given appropriate resources. In
433
CHAPTER 11. PRACTICAL METHODOLOGY
Hyperparameter
Increases
capacity
when. . .
Reason Caveats
Number of hid-
den units
increased
Increasing the number of
hidden units increases the
representational capacity
of the model.
Increasing the number
of hidden units increases
both the time and memory
cost of essentially every op-
eration on the model.
Learning rate
tuned op-
timally
An improper learning rate,
whether too high or too
low, results in a model
with low effective capacity
due to optimization failure
Convolution ker-
nel width
increased
Increasing the kernel width
increases the number of pa-
rameters in the model
A wider kernel results in
a narrower output dimen-
sion, reducing model ca-
pacity unless you use im-
plicit zero padding to re-
duce this effect. Wider
kernels require more mem-
ory for parameter storage
and increase runtime, but
a narrower output reduces
memory cost.
Implicit zero
padding
increased
Adding implicit zeros be-
fore convolution keeps the
representation size large
Increased time and mem-
ory cost of most opera-
tions.
Weight decay co-
efficient
decreased
Decreasing the weight de-
cay coefficient frees the
model parameters to be-
come larger
Dropout rate decreased
Dropping units less often
gives the units more oppor-
tunities to “conspire” with
each other to fit the train-
ing set
Table 11.1: The effect of various hyperparameters on model capacity.
434
CHAPTER 11. PRACTICAL METHODOLOGY
principle, this approach could fail due to optimization difficulties, but for many
problems optimization does not seem to be a significant barrier, provided that the
model is chosen appropriately.
11.4.2 Automatic Hyperparameter Optimization Algorithms
The ideal learning algorithm just takes a dataset and outputs a function, without
requiring hand-tuning of hyperparameters. The popularity of several learning
algorithms such as logistic regression and SVMs stems in part from their ability to
perform well with only one or two tuned hyperparameters. Neural networks can
sometimes perform well with only a small number of tuned hyperparameters, but
often benefit significantly from tuning of forty or more hyperparameters. Manual
hyperparameter tuning can work very well when the user has a good starting point,
such as one determined by others having worked on the same type of application
and architecture, or when the user has months or years of experience in exploring
hyperparameter values for neural networks applied to similar tasks. However,
for many applications, these starting points are not available. In these cases,
automated algorithms can find useful values of the hyperparameters.
If we think about the way in which the user of a learning algorithm searches
for good values of the hyperparameters, we realize that an optimization is taking
place: we are trying to find a value of the hyperparameters that optimizes an
objective function, such as validation error, sometimes under constraints (such as a
budget for training time, memory or recognition time). It is therefore possible, in
principle, to develop hyperparameter optimization algorithms that wrap a learning
algorithm and choose its hyperparameters, thus hiding the hyperparameters of the
learning algorithm from the user. Unfortunately, hyperparameter optimization
algorithms often have their own hyperparameters, such as the range of values that
should be explored for each of the learning algorithm’s hyperparameters. However,
these secondary hyperparameters are usually easier to choose, in the sense that
acceptable performance may be achieved on a wide range of tasks using the same
secondary hyperparameters for all tasks.
11.4.3 Grid Search
When there are three or fewer hyperparameters, the common practice is to perform
grid search. For each hyperparameter, the user selects a small finite set of values to
explore. The grid search algorithm then trains a model for every joint specification
of hyperparameter values in the Cartesian product of the set of values for each
individual hyperparameter. The experiment that yields the best validation set
435
CHAPTER 11. PRACTICAL METHODOLOGY
Grid Layout Random Layout
Unimportant parameter
Important parameter
Unimportant parameter
Important parameter
Grid Layout Random Layout
Unimportant parameter
Important parameter
Unimportant parameter
Important parameter
Grid Random
Figure 11.2: Comparison of grid search and random search. For illustration purposes we
display two hyperparameters but we are typically interested in having many more. (Left)
To perform grid search, we provide a set of values for each hyperparameter. The search
algorithm runs training for every joint hyperparameter setting in the cross product of these
sets. (Right) To perform random search, we provide a probability distribution over joint
hyperparameter configurations. Usually most of these hyperparameters are independent
from each other. Common choices for the distribution over a single hyperparameter include
uniform and log-uniform (to sample from a log-uniform distribution, take the
exp
of a
sample from a uniform distribution). The search algorithm then randomly samples joint
hyperparameter configurations and runs training with each of them. Both grid search
and random search evaluate the validation set error and return the best configuration.
The figure illustrates the typical case where only some hyperparameters have a significant
influence on the result. In this illustration, only the hyperparameter on the horizontal axis
has a significant effect. Grid search wastes an amount of computation that is exponential
in the number of non-influential hyperparameters, while random search tests a unique
value of every influential hyperparameter on nearly every trial.
436
CHAPTER 11. PRACTICAL METHODOLOGY
error is then chosen as having found the best hyperparameters. See the left of
Fig. 11.2 for an illustration of a grid of hyperparameter values.
How should the lists of values to search over be chosen? In the case of numerical
(ordered) hyperparameters, the smallest and largest element of each list is chosen
conservatively, based on prior experience with similar experiments, to make sure
that the optimal value is very likely to be in the selected range. Typically, a
grid search involves picking values approximately on a
logarithmic scale
, e.g., a
learning rate taken within the set
{.
1
, .
01
,
10
3
,
10
4
,
10
5
}
, or a number of hidden
units taken with the set {50, 100, 200, 500, 1000, 2000}.
Grid search usually performs best when it is performed repeatedly. For example,
suppose that we ran a grid search over a hyperparameter
α
using values of
{−
1
,
0
,
1
}
.
If the best value found is 1, then we underestimated the range in which the best
α
lies and we should shift the grid and run another search with
α
in, for example,
{
1
,
2
,
3
}
. If we find that the best value of
α
is 0, then we may wish to refine our
estimate by zooming in and running a grid search over {−.1, 0, .1}.
The obvious problem with grid search is that its computational cost grows
exponentially with the number of hyperparameters. If there are
m
hyperparameters,
each taking at most
n
values, then the number of training and evaluation trials
required grows as
O
(
n
m
). The trials may be run in parallel and exploit loose
parallelism (with almost no need for communication between different machines
carrying out the search) Unfortunately, due to the exponential cost of grid search,
even parallelization may not provide a satisfactory size of search.
11.4.4 Random Search
Fortunately, there is an alternative to grid search that is as simple to program, more
convenient to use, and converges much faster to good values of the hyperparameters:
random search (Bergstra and Bengio, 2012).
A random search proceeds as follows. First we define a marginal distribution
for each hyperparameter, e.g., a Bernoulli or multinoulli for binary or discrete
hyperparameters, or a uniform distribution on a log-scale for positive real-valued
hyperparameters. For example,
log_learning_rate u(1, 5) (11.2)
learning_rate = 10
log_learning_rate
. (11.3)
where
u
(
a, b
) indicates a sample of the uniform distribution in the interval (
a, b
).
Similarly the
log_number_of_hidden_units
may be sampled from
u
(
log
(50)
,
log(2000)).
437
CHAPTER 11. PRACTICAL METHODOLOGY
Unlike in the case of a grid search, one
should not discretize
or bin the
values of the hyperparameters. This allows one to explore a larger set of values, and
does not incur additional computational cost. In fact, as illustrated in Fig. 11.2, a
random search can be exponentially more efficient than a grid search, when there
are several hyperparameters that do not strongly affect the performance measure.
This is studied at length in Bergstra and Bengio (2012), who found that random
search reduces the validation set error much faster than grid search, in terms of
the number of trials run by each method.
As with grid search, one may often want to run repeated versions of random
search, to refine the search based on the results of the first run.
The main reason why random search finds good solutions faster than grid search
is that the there are no wasted experimental runs, unlike in the case of grid search,
when two values of a hyperparameter (given values of the other hyperparameters)
would give the same result. In the case of grid search, the other hyperparameters
would have the same values for these two runs, whereas with random search, they
would usually have different values. Hence if the change between these two values
does not marginally make much difference in terms of validation set error, grid
search will unnecessarily repeat two equivalent experiments while random search
will still give two independent explorations of the other hyperparameters.
11.4.5 Model-Based Hyperparameter Optimization
The search for good hyperparameters can be cast as an optimization problem.
The decision variables are the hyperparameters. The cost to be optimized is the
validation set error that results from training using these hyperparameters. In
simplified settings where it is feasible to compute the gradient of some differentiable
error measure on the validation set with respect to the hyperparameters, we can
simply follow this gradient (Bengio et al., 1999; Bengio, 2000; Maclaurin et al.,
2015). Unfortunately, in most practical settings, this gradient is unavailable, either
due to its high computation and memory cost, or due to hyperparameters having
intrinsically non-differentiable interactions with the validation set error, as in the
case of discrete-valued hyperparameters.
To compensate for this lack of a gradient, we can build a model of the validation
set error, then propose new hyperparameter guesses by performing optimization
within this model. Most model-based algorithms for hyperparameter search use a
Bayesian regression model to estimate both the expected value of the validation set
error for each hyperparameter and the uncertainty around this expectation. Opti-
mization thus involves a tradeoff between exploration (proposing hyperparameters
438
CHAPTER 11. PRACTICAL METHODOLOGY
for which there is high uncertainty, which may lead to a large improvement but may
also perform poorly) and exploitation (proposing hyperparameters which the model
is confident will perform as well as any hyperparameters it has seen so far—usually
hyperparameters that are very similar to ones it has seen before). Contemporary
approaches to hyperparameter optimization include Spearmint (Snoek et al., 2012),
TPE (Bergstra et al., 2011) and SMAC (Hutter et al., 2011).
Currently, we cannot unambiguously recommend Bayesian hyperparameter
optimization as an established tool for achieving better deep learning results or
for obtaining those results with less effort. Bayesian hyperparameter optimization
sometimes performs comparably to human experts, sometimes better, but fails
catastrophically on other problems. It may be worth trying to see if it works on
a particular problem but is not yet sufficiently mature or reliable. That being
said, hyperparameter optimization is an important field of research that, while
often driven primarily by the needs of deep learning, holds the potential to benefit
not only the entire field of machine learning but the discipline of engineering in
general.
One drawback common to most hyperparameter optimization algorithms with
more sophistication than random search is that they require for a training ex-
periment to run to completion before they are able to extract any information
from the experiment. This is much less efficient, in the sense of how much infor-
mation can be gleaned early in an experiment, than manual search by a human
practitioner, since one can usually tell early on if some set of hyperparameters is
completely pathological. Swersky et al. (2014) have introduced an early version
of an algorithm that maintains a set of multiple experiments. At various time
points, the hyperparameter optimization algorithm can choose to begin a new
experiment, to “freeze” a running experiment that is not promising, or to “thaw”
and resume an experiment that was earlier frozen but now appears promising given
more information.
11.5 Debugging Strategies
When a machine learning system performs poorly, it is usually difficult to tell
whether the poor performance is intrinsic to the algorithm itself or whether there
is a bug in the implementation of the algorithm. Machine learning systems are
difficult to debug for a variety of reasons.
In most cases, we do not know a priori what the intended behavior of the
algorithm is. In fact, the entire point of using machine learning is that it will
discover useful behavior that we were not able to specify ourselves. If we train a
439
CHAPTER 11. PRACTICAL METHODOLOGY
neural network on a new classification task and it achieves 5% test error, we have
no straightforward way of knowing if this is the expected behavior or sub-optimal
behavior.
A further difficulty is that most machine learning models have multiple parts
that are each adaptive. If one part is broken, the other parts can adapt and still
achieve roughly acceptable performance. For example, suppose that we are training
a neural net with several layers parametrized by weights
W
and biases
b
. Suppose
further that we have manually implemented the gradient descent rule for each
parameter separately, and we made an error in the update for the biases:
b b α (11.4)
where
α
is the learning rate. This erroneous update does not use the gradient at
all. It causes the biases to constantly become negative throughout learning, which
is clearly not a correct implementation of any reasonable learning algorithm. The
bug may not be apparent just from examining the output of the model though.
Depending on the distribution of the input, the weights may be able to adapt to
compensate for the negative biases.
Most debugging strategies for neural nets are designed to get around one or
both of these two difficulties. Either we design a case that is so simple that the
correct behavior actually can be predicted, or we design a test that exercises one
part of the neural net implementation in isolation.
Some important debugging tests include:
Visualize the model in action : When training a model to detect objects in
images, view some images with the detections proposed by the model displayed
superimposed on the image. When training a generative model of speech, listen to
some of the speech samples it produces. This may seem obvious, but it is easy to
fall into the practice of only looking at quantitative performance measurements
like accuracy or log-likelihood. Directly observing the machine learning model
performing its task will help you to determine whether the quantitative performance
numbers it achieves seem reasonable. Evaluation bugs can be some of the most
devastating bugs because they can mislead you into believing your system is
performing well when it is not.
Visualize the worst mistakes : Most models are able to output some sort of
confidence measure for the task they perform. For example, classifiers based on a
softmax output layer assign a probability to each class. The probability assigned
to the most likely class thus gives an estimate of the confidence the model has in
its classification decision. Typically, maximum likelihood training results in these
values being overestimates rather than accurate probabilities of correct prediction,
440
CHAPTER 11. PRACTICAL METHODOLOGY
but they are somewhat useful in the sense that examples that are actually less
likely to be correctly labeled receive smaller probabilities under the model. By
viewing the training set examples that are the hardest to model correctly, one can
often discover problems with the way the data has been preprocessed or labeled.
For example, the Street View transcription system originally had a problem where
the address number detection system would crop the image too tightly and omit
some of the digits. The transcription network then assigned very low probability
to the correct answer on these images. Sorting the images to identify the most
confident mistakes showed that there was a systematic problem with the cropping.
Modifying the detection system to crop much wider images resulted in much better
performance of the overall system, even though the transcription network needed
to be able to process greater variation in the position and scale of the address
numbers.
Reasoning about software using train and test error: It is often difficult to
determine whether the underlying software is correctly implemented. Some clues
can be obtained from the train and test error. If training error is low but test error
is high, then it is likely that that the training procedure works correctly, and the
model is overfitting for fundamental algorithmic reasons. An alternative possibility
is that the test error is measured incorrectly due to a problem with saving the
model after training then reloading it for test set evaluation, or if the test data
was prepared differently from the training data. If both train and test error are
high, then it is difficult to determine whether there is a software defect or whether
the model is underfitting due to fundamental algorithmic reasons. This scenario
requires further tests, described next.
Fit a tiny dataset: If you have high error on the training set, determine whether
it is due to genuine underfitting or due to a software defect. Usually even small
models can be guaranteed to be able fit a sufficiently small dataset. For example,
a classification dataset with only one example can be fit just by setting the biases
of the output layer correctly. Usually if you cannot train a classifier to correctly
label a single example, an autoencoder to successfully reproduce a single example
with high fidelity, or a generative model to consistently emit samples resembling a
single example, there is a software defect preventing successful optimization on the
training set. This test can be extended to a small dataset with few examples.
Compare back-propagated derivatives to numerical derivatives: If you are using
a software framework that requires you to implement your own gradient com-
putations, or if you are adding a new operation to a differentiation library and
must define its
bprop
method, then a common source of error is implementing this
gradient expression incorrectly. One way to verify that these derivatives are correct
441
CHAPTER 11. PRACTICAL METHODOLOGY
is to compare the derivatives computed by your implementation of automatic
differentiation to the derivatives computed by a finite differences. Because
f
0
(x) = lim
0
f(x + ) f(x)
, (11.5)
we can approximate the derivative by using a small, finite :
f
0
(x)
f(x + ) f(x)
. (11.6)
We can improve the accuracy of the approximation by using the centered difference:
f
0
(x)
f(x +
1
2
) f(x
1
2
)
. (11.7)
The perturbation size
must chosen to be large enough to ensure that the pertur-
bation is not rounded down too much by finite-precision numerical computations.
Usually, we will want to test the gradient or Jacobian of a vector-valued function
g
:
R
m
R
n
. Unfortunately, finite differencing only allows us to take a single
derivative at a time. We can either run finite differencing
mn
times to evaluate all
of the partial derivatives of
g
, or we can apply the test to a new function that uses
random projections at both the input and output of
g
. For example, we can apply
our test of the implementation of the derivatives to
f
(
x
) where
f
(
x
) =
u
T
g
(
vx
),
where
u
and
v
are randomly chosen vectors. Computing
f
0
(
x
) correctly requires
being able to back-propagate through
g
correctly, yet is efficient to do with finite
differences because
f
has only a single input and a single output. It is usually
a good idea to repeat this test for more than one value of
u
and
v
to reduce
the chance that the test overlooks mistakes that are orthogonal to the random
projection.
If one has access to numerical computation on complex numbers, then there is
a very efficient way to numerically estimate the gradient by using complex numbers
as input to the function (Squire and Trapp, 1998). The method is based on the
observation that
f(x + i) = f(x) + if
0
(x) + O(
2
) (11.8)
real(f(x + i)) = f(x) + O(
2
), imag(
f(x + i)
) = f
0
(x) + O(
2
), (11.9)
where
i
=
1
. Unlike in the real-valued case above, there is no cancellation effect
due to taking the difference between the value of
f
at different points. This allows
the use of tiny values of
like
= 10
150
, which make the
O
(
2
) error insignificant
for all practical purposes.
442
CHAPTER 11. PRACTICAL METHODOLOGY
Monitor histograms of activations and gradient: It is often useful to visualize
statistics of neural network activations and gradients, collected over a large amount
of training iterations (maybe one epoch). The pre-activation value of hidden units
can tell us if the units saturate, or how often they do. For example, for rectifiers,
how often are they off? Are there units that are always off? For tanh units,
the average of the absolute value of the pre-activations tells us how saturated
the unit is. In a deep network where the propagated gradients quickly grow or
quickly vanish, optimization may be hampered. Finally, it is useful to compare the
magnitude of parameter gradients to the magnitude of the parameters themselves.
As suggested by Bottou (2015), we would like the magnitude of parameter updates
over a minibatch to represent something like 1% of the magnitude of the parameter,
not 50% or 0.001% (which would make the parameters move too slowly). It may
be that some groups of parameters are moving at a good pace while others are
stalled. When the data is sparse (like in natural language), some parameters may
be very rarely updated, and this should be kept in mind when monitoring their
evolution.
Finally, many deep learning algorithms provide some sort of guarantee about
the results produced at each step. For example, in Part III, we will see some
approximate inference algorithms that work by using algebraic solutions to op-
timization problems. Typically these can be debugged by testing each of their
guarantees. Some guarantees that some optimization algorithms offer include that
the objective function will never increase after one step of the algorithm, that
the gradient with respect to some subset of variables will be zero after each step
of the algorithm, and that the gradient with respect to all variables will be zero
at convergence. Usually due to rounding error, these conditions will not hold
exactly in a digital computer, so the debugging test should include some tolerance
parameter.
11.6 Example: Multi-Digit Number Recognition
To provide an end-to-end description of how to apply our design methodology
in practice, we present a brief account of the Street View transcription system,
from the point of view of designing the deep learning components. Obviously,
many other components of the complete system, such as the Street View cars, the
database infrastructure, and so on, were of paramount importance.
From the point of view of the machine learning task, the process began with
data collection. The cars collected the raw data and human operators provided
labels. The transcription task was preceded by a significant amount of dataset
443
CHAPTER 11. PRACTICAL METHODOLOGY
curation, including using other machine learning techniques to
detect
the house
numbers prior to transcribing them.
The transcription project began with a choice of performance metrics and
desired values for these metrics. An important general principle is to tailor the
choice of metric to the business goals for the project. Because maps are only useful
if they have high accuracy, it was important to set a high accuracy requirement
for this project. Specifically, the goal was to obtain human-level, 98% accuracy.
This level of accuracy may not always be feasible to obtain. In order to reach
this level of accuracy, the Street View transcription system sacrifices coverage.
Coverage thus became the main performance metric optimized during the project,
with accuracy held at 98%. As the convolutional network improved, it became
possible to reduce the confidence threshold below which the network refuses to
transcribe the input, eventually exceeding the goal of 95% coverage.
After choosing quantitative goals, the next step in our recommended methodol-
ogy is to rapidly establish a sensible baseline system. For vision tasks, this means a
convolutional network with rectified linear units. The transcription project began
with such a model. At the time, it was not common for a convolutional network
to output a sequence of predictions. In order to begin with the simplest possible
baseline, the first implementation of the output layer of the model consisted of
n
different softmax units to predict a sequence of
n
characters. These softmax units
were trained exactly the same as if the task were classification, with each softmax
unit trained independently.
Our recommended methodology is to iteratively refine the baseline and test
whether each change makes an improvement. The first change to the Street View
transcription system was motivated by a theoretical understanding of the coverage
metric and the structure of the data. Specifically, the network refuses to classify
an input
x
whenever the probability of the output sequence
p
(
y | x
)
< t
for
some threshold
t
. Initially, the definition of
p
(
y | x
) was ad-hoc, based on simply
multiplying all of the softmax outputs together. This motivated the development
of a specialized output layer and cost function that actually computed a principled
log-likelihood. This approach allowed the example rejection mechanism to function
much more effectively.
At this point, coverage was still below 90%, yet there were no obvious theoretical
problems with the approach. Our methodology therefore suggests to instrument
the train and test set performance in order to determine whether the problem
is underfitting or overfitting. In this case, train and test set error were nearly
identical. Indeed, the main reason this project proceeded so smoothly was the
availability of a dataset with tens of millions of labeled examples. Because train
444
CHAPTER 11. PRACTICAL METHODOLOGY
and test set error were so similar, this suggested that the problem was either due
to underfitting or due to a problem with the training data. One of the debugging
strategies we recommend is to visualize the model’s worst errors. In this case, that
meant visualizing the incorrect training set transcriptions that the model gave the
highest confidence. These proved to mostly consist of examples where the input
image had been cropped too tightly, with some of the digits of the address being
removed by the cropping operation. For example, a photo of an address “1849”
might be cropped too tightly, with only the “849” remaining visible. This problem
could have been resolved by spending weeks improving the accuracy of the address
number detection system responsible for determining the cropping regions. Instead,
the team took a much more practical decision, to simply expand the width of the
crop region to be systematically wider than the address number detection system
predicted. This single change added ten percentage points to the transcription
system’s coverage.
Finally, the last few percentage points of performance came from adjusting
hyperparameters. This mostly consisted of making the model larger while main-
taining some restrictions on its computational cost. Because train and test error
remained roughly equal, it was always clear that any performance deficits were due
to underfitting, as well as due to a few remaining problems with the dataset itself.
Overall, the transcription project was a great success, and allowed hundreds of
millions of addresses to be transcribed both faster and at lower cost than would
have been possible via human effort.
We hope that the design principles described in this chapter will lead to many
other similar successes.
445