Chapter 5

Machine Learning Basics

Deep learning is a speciﬁc kind of machine learning. In order to understand

deep learning well, one must have a solid understanding of the basic principles

of machine learning. This chapter provides a brief course in the most important

general principles that will be applied throughout the rest of the book. Novice

readers or those who want a wider perspective are encouraged to consider machine

learning textbooks with a more comprehensive coverage of the fundamentals, such

as Murphy (2012) or Bishop (2006). If you are already familiar with machine

learning basics, feel free to skip ahead to Sec. 5.11. That section covers some per-

spectives on traditional machine learning techniques that have strongly inﬂuenced

the development of deep learning algorithms.

We begin with a deﬁnition of what a learning algorithm is, and present an

example: the linear regression algorithm. We then proceed to describe how the

challenge of ﬁtting the training data diﬀers from the challenge of ﬁnding patterns

that generalize to new data. Most machine learning algorithms have settings

called hyperparameters that must be determined external to the learning algorithm

itself; we discuss how to set these using additional data. Machine learning is

essentially a form of applied statistics with increased emphasis on the use of

computers to statistically estimate complicated functions and a decreased emphasis

on proving conﬁdence intervals around these functions; we therefore present the

two central approaches to statistics: frequentist estimators and Bayesian inference.

Most machine learning algorithms can be divided into the categories of supervised

learning and unsupervised learning; we describe these categories and give some

examples of simple learning algorithms from each category. Most deep learning

algorithms are based on an optimization algorithm called stochastic gradient

descent. We describe how to combine various algorithm components such as an

CHAPTER 5. MACHINE LEARNING BASICS

optimization algorithm, a cost function, a model, and a dataset to build a machine

learning algorithm. Finally, in Sec. 5.11, we describe some of the factors that have

limited the ability of traditional machine learning to generalize. These challenges

have motivated the development of deep learning algorithms that overcome these

obstacles.

5.1 Learning Algorithms

A machine learning algorithm is an algorithm that is able to learn from data. But

what do we mean by learning? Mitchell (1997) provides the deﬁnition “A computer

program is said to learn from experience

with respect to some class of tasks

and performance measure

, if its performance at tasks in

, as measured by

improves with experience

.” One can imagine a very wide variety of experiences

, tasks

, and performance measures

, and we do not make any attempt in this

book to provide a formal deﬁnition of what may be used for each of these entities.

Instead, the following sections provide intuitive descriptions and examples of the

diﬀerent kinds of tasks, performance measures and experiences that can be used

to construct machine learning algorithms.

5.1.1 The Task, T

Machine learning allows us to tackle tasks that are too diﬃcult to solve with

ﬁxed programs written and designed by human beings. From a scientiﬁc and

philosophical point of view, machine learning is interesting because developing our

understanding of machine learning entails developing our understanding of the

principles that underlie intelligence.

In this relatively formal deﬁnition of the word “task,” the process of learning

itself is not the task. Learning is our means of attaining the ability to perform the

task. For example, if we want a robot to be able to walk, then walking is the task.

We could program the robot to learn to walk, or we could attempt to directly write

a program that speciﬁes how to walk manually.

Machine learning tasks are usually described in terms of how the machine

learning system should process an example. An example is a collection of features

that have been quantitatively measured from some object or event that we want

the machine learning system to process. We typically represent an example as a

vector

x ∈ R

where each entry

of the vector is another feature. For example,

the features of an image are usually the values of the pixels in the image.

CHAPTER 5. MACHINE LEARNING BASICS

Many kinds of tasks can be solved with machine learning. Some of the most

common machine learning tasks include the following:

•

Classiﬁcation: In this type of task, the computer program is asked to specify

which of

categories some input belongs to. To solve this task, the learning

algorithm is usually asked to produce a function

→ {

, . . . , k}

. When

(

), the model assigns an input described by vector

to a category

identiﬁed by numeric code

. There are other variants of the classiﬁcation

task, for example, where

outputs a probability distribution over classes.

An example of a classiﬁcation task is object recognition, where the input

is an image (usually described as a set of pixel brightness values), and the

output is a numeric code identifying the object in the image. For example,

the Willow Garage PR2 robot is able to act as a waiter that can recognize

diﬀerent kinds of drinks and deliver them to people on command (Good-

fellow et al., 2010). Modern object recognition is best accomplished with

deep learning (Krizhevsky et al., 2012; Ioﬀe and Szegedy, 2015). Object

recognition is the same basic technology that allows computers to recognize

faces (Taigman et al., 2014), which can be used to automatically tag people

in photo collections and allow computers to interact more naturally with

their users.

•

Classiﬁcation with missing inputs: Classiﬁcation becomes more challenging if

the computer program is not guaranteed that every measurement in its input

vector will always be provided. In order to solve the classiﬁcation task, the

learning algorithm only has to deﬁne a single function mapping from a vector

input to a categorical output. When some of the inputs may be missing,

rather than providing a single classiﬁcation function, the learning algorithm

must learn a set of functions. Each function corresponds to classifying

with

a diﬀerent subset of its inputs missing. This kind of situation arises frequently

in medical diagnosis, because many kinds of medical tests are expensive or

invasive. One way to eﬃciently deﬁne such a large set of functions is to learn

a probability distribution over all of the relevant variables, then solve the

classiﬁcation task by marginalizing out the missing variables. With

input

variables, we can now obtain all 2

diﬀerent classiﬁcation functions needed

for each possible set of missing inputs, but we only need to learn a single

function describing the joint probability distribution. See Goodfellow et al.

(2013b) for an example of a deep probabilistic model applied to such a task

in this way. Many of the other tasks described in this section can also be

generalized to work with missing inputs; classiﬁcation with missing inputs is

just one example of what machine learning can do.

100

CHAPTER 5. MACHINE LEARNING BASICS

•

Regression: In this type of task, the computer program is asked to predict a

numerical value given some input. To solve this task, the learning algorithm

is asked to output a function

→ R

. This type of task is similar to

classiﬁcation, except that the format of output is diﬀerent. An example of

a regression task is the prediction of the expected claim amount that an

insured person will make (used to set insurance premiums), or the prediction

of future prices of securities. These kinds of predictions are also used for

algorithmic trading.

•

Transcription: In this type of task, the machine learning system is asked to

observe a relatively unstructured representation of some kind of data and

transcribe it into discrete, textual form. For example, in optical character

recognition, the computer program is shown a photograph containing an

image of text and is asked to return this text in the form of a sequence

of characters (e.g., in ASCII or Unicode format). Google Street View uses

deep learning to process address numbers in this way (Goodfellow et al.,

2014d). Another example is speech recognition, where the computer program

is provided an audio waveform and emits a sequence of characters or word

ID codes describing the words that were spoken in the audio recording. Deep

learning is a crucial component of modern speech recognition systems used

at major companies including Microsoft, IBM and Google (Hinton et al.,

2012b).

•

Machine translation: In a machine translation task, the input already consists

of a sequence of symbols in some language, and the computer program must

convert this into a sequence of symbols in another language. This is commonly

applied to natural languages, such as to translate from English to French.

Deep learning has recently begun to have an important impact on this kind

of task (Sutskever et al., 2014; Bahdanau et al., 2015).

•

Structured output: Structured output tasks involve any task where the output

is a vector (or other data structure containing multiple values) with important

relationships between the diﬀerent elements. This is a broad category, and

subsumes the transcription and translation tasks described above, but also

many other tasks. One example is parsing—mapping a natural language

sentence into a tree that describes its grammatical structure and tagging nodes

of the trees as being verbs, nouns, or adverbs, and so on. See Collobert (2011)

for an example of deep learning applied to a parsing task. Another example

is pixel-wise segmentation of images, where the computer program assigns

every pixel in an image to a speciﬁc category. For example, deep learning can

101

CHAPTER 5. MACHINE LEARNING BASICS

be used to annotate the locations of roads in aerial photographs (Mnih and

Hinton, 2010). The output need not have its form mirror the structure of

the input as closely as in these annotation-style tasks. For example, in image

captioning, the computer program observes an image and outputs a natural

language sentence describing the image (Kiros et al., 2014a,b; Mao et al.,

2015; Vinyals et al., 2015b; Donahue et al., 2014; Karpathy and Li, 2015;

Fang et al., 2015; Xu et al., 2015). These tasks are called structured output

tasks because the program must output several values that are all tightly

inter-related. For example, the words produced by an image captioning

program must form a valid sentence.

•

Anomaly detection: In this type of task, the computer program sifts through

a set of events or objects, and ﬂags some of them as being unusual or atypical.

An example of an anomaly detection task is credit card fraud detection. By

modeling your purchasing habits, a credit card company can detect misuse

of your cards. If a thief steals your credit card or credit card information,

the thief’s purchases will often come from a diﬀerent probability distribution

over purchase types than your own. The credit card company can prevent

fraud by placing a hold on an account as soon as that card has been used

for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of

anomaly detection methods.

•

Synthesis and sampling: In this type of task, the machine learning algorithm

is asked to generate new examples that are similar to those in the training

data. Synthesis and sampling via machine learning can be useful for media

applications where it can be expensive or boring for an artist to generate large

volumes of content by hand. For example, video games can automatically

generate textures for large objects or landscapes, rather than requiring an

artist to manually label each pixel (Luo et al., 2013). In some cases, we

want the sampling or synthesis procedure to generate some speciﬁc kind of

output given the input. For example, in a speech synthesis task, we provide a

written sentence and ask the program to emit an audio waveform containing

a spoken version of that sentence. This is a kind of structured output task,

but with the added qualiﬁcation that there is no single correct output for

each input, and we explicitly desire a large amount of variation in the output,

in order for the output to seem more natural and realistic.

•

Imputation of missing values: In this type of task, the machine learning

algorithm is given a new example

x ∈ R

, but with some entries

missing. The algorithm must provide a prediction of the values of the missing

entries.

102

CHAPTER 5. MACHINE LEARNING BASICS

•

Denoising: In this type of task, the machine learning algorithm is given in

input a corrupted example

x ∈ R

obtained by an unknown corruption process

from a clean example

x ∈ R

. The learner must predict the clean example

from its corrupted version

, or more generally predict the conditional

probability distribution p(x |

x).

•

Density estimation or probability mass function estimation: In the density

estimation problem, the machine learning algorithm is asked to learn a

function

model

→ R

, where

model

(

) can be interpreted as a probability

density function (if

is continuous) or a probability mass function (if

discrete) on the space that the examples were drawn from. To do such a task

well (we will specify exactly what that means when we discuss performance

measures

), the algorithm needs to learn the structure of the data it

has seen. It must know where examples cluster tightly and where they

are unlikely to occur. Most of the tasks described above require that the

learning algorithm has at least implicitly captured the structure of the

probability distribution. Density estimation allows us to explicitly capture

that distribution. In principle, we can then perform computations on that

distribution in order to solve the other tasks as well. For example, if we

have performed density estimation to obtain a probability distribution

(

we can use that distribution to solve the missing value imputation task. If

a value

is missing and all of the other values, denoted

−i

, are given,

then we know the distribution over it is given by

(

| x

−i

). In practice,

density estimation does not always allow us to solve all of these related tasks,

because in many cases the required operations on

(

) are computationally

intractable.

Of course, many other tasks and types of tasks are possible. The types of tasks

we list here are intended only to provide examples of what machine learning can

do, not to deﬁne a rigid taxonomy of tasks.

5.1.2 The Performance Measure, P

In order to evaluate the abilities of a machine learning algorithm, we must design

a quantitative measure of its performance. Usually this performance measure

speciﬁc to the task T being carried out by the system.

For tasks such as classiﬁcation, classiﬁcation with missing inputs, and transcrip-

tion, we often measure the accuracy of the model. Accuracy is just the proportion

of examples for which the model produces the correct output. We can also obtain

103

CHAPTER 5. MACHINE LEARNING BASICS

equivalent information by measuring the error rate, the proportion of examples for

which the model produces an incorrect output. We often refer to the error rate as

the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly

classiﬁed and 1 if it is not. For tasks such as density estimation, it does not make

sense to measure accuracy, error rate, or any other kind of 0-1 loss. Instead, we

must use a diﬀerent performance metric that gives the model a continuous-valued

score for each example. The most common approach is to report the average

log-probability the model assigns to some examples.

Usually we are interested in how well the machine learning algorithm performs

on data that it has not seen before, since this determines how well it will work when

deployed in the real world. We therefore evaluate these performance measures

using a test set of data that is separate from the data used for training the machine

learning system.

The choice of performance measure may seem straightforward and objective,

but it is often diﬃcult to choose a performance measure that corresponds well to

the desired behavior of the system.

In some cases, this is because it is diﬃcult to decide what should be measured.

For example, when performing a transcription task, should we measure the accuracy

of the system at transcribing entire sequences, or should we use a more ﬁne-grained

performance measure that gives partial credit for getting some elements of the

sequence correct? When performing a regression task, should we penalize the

system more if it frequently makes medium-sized mistakes or if it rarely makes

very large mistakes? These kinds of design choices depend on the application.

In other cases, we know what quantity we would ideally like to measure, but

measuring it is impractical. For example, this arises frequently in the context of

density estimation. Many of the best probabilistic models represent probability

distributions only implicitly. Computing the actual probability value assigned to

a speciﬁc point in space in many such models is intractable. In these cases, one

must design an alternative criterion that still corresponds to the design objectives,

or design a good approximation to the desired criterion.

5.1.3 The Experience, E

Machine learning algorithms can be broadly categorized as unsupervised or su-

pervised by what kind of experience they are allowed to have during the learning

process.

Most of the learning algorithms in this book can be understood as being allowed

to experience an entire dataset. A dataset is a collection of many examples, as

104

CHAPTER 5. MACHINE LEARNING BASICS

deﬁned in Sec. 5.1.1. Sometimes we will also call examples data points.

One of the oldest datasets studied by statisticians and machine learning re-

searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of

diﬀerent parts of 150 iris plants. Each individual plant corresponds to one example.

The features within each example are the measurements of each of the parts of the

plant: the sepal length, sepal width, petal length and petal width. The dataset

also records which species each plant belonged to. Three diﬀerent species are

represented in the dataset.

Unsupervised learning algorithms experience a dataset containing many features,

then learn useful properties of the structure of this dataset. In the context of deep

learning, we usually want to learn the entire probability distribution that generated

a dataset, whether explicitly as in density estimation or implicitly for tasks like

synthesis or denoising. Some other unsupervised learning algorithms perform other

roles, like clustering, which consists of dividing the dataset into clusters of similar

examples.

Supervised learning algorithms experience a dataset containing features, but

each example is also associated with a label or target. For example, the Iris dataset

is annotated with the species of each iris plant. A supervised learning algorithm

can study the Iris dataset and learn to classify iris plants into three diﬀerent species

based on their measurements.

Roughly speaking, unsupervised learning involves observing several examples

of a random vector

, and attempting to implicitly or explicitly learn the proba-

bility distribution

(

), or some interesting properties of that distribution, while

supervised learning involves observing several examples of a random vector

and

an associated value or vector

, and learning to predict

from

, usually by

estimating

(

y | x

). The term supervised learning originates from the view of

the target

being provided by an instructor or teacher who shows the machine

learning system what to do. In unsupervised learning, there is no instructor or

teacher, and the algorithm must learn to make sense of the data without this guide.

Unsupervised learning and supervised learning are not formally deﬁned terms.

The lines between them are often blurred. Many machine learning technologies can

be used to perform both tasks. For example, the chain rule of probability states

that for a vector x ∈ R

, the joint distribution can be decomposed as

p(x) =

i=1

p(x

| x

, . . . , x

i−1

). (5.1)

This decomposition means that we can solve the ostensibly unsupervised problem of

modeling

(

) by splitting it into

supervised learning problems. Alternatively, we

105

CHAPTER 5. MACHINE LEARNING BASICS

can solve the supervised learning problem of learning

(

y | x

) by using traditional

unsupervised learning technologies to learn the joint distribution

(

x, y

) and

inferring

p(y | x) =

p(x, y)

p(x, y

)

. (5.2)

Though unsupervised learning and supervised learning are not completely formal or

distinct concepts, they do help to roughly categorize some of the things we do with

machine learning algorithms. Traditionally, people refer to regression, classiﬁcation

and structured output problems as supervised learning. Density estimation in

support of other tasks is usually considered unsupervised learning.

Other variants of the learning paradigm are possible. For example, in semi-

supervised learning, some examples include a supervision target but others do

not. In multi-instance learning, an entire collection of examples is labeled as

containing or not containing an example of a class, but the individual members

of the collection are not labeled. For a recent example of multi-instance learning

with deep models, see Kotzias et al. (2015).

Some machine learning algorithms do not just experience a ﬁxed dataset. For

example, reinforcement learning algorithms interact with an environment, so there

is a feedback loop between the learning system and its experiences. Such algorithms

are beyond the scope of this book. Please see Sutton and Barto (1998) or Bertsekas

and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al.

(2013) for the deep learning approach to reinforcement learning.

Most machine learning algorithms simply experience a dataset. A dataset can

be described in many ways. In all cases, a dataset is a collection of examples,

which are in turn collections of features.

One common way of describing a dataset is with a design matrix. A design

matrix is a matrix containing a diﬀerent example in each row. Each column of the

matrix corresponds to a diﬀerent feature. For instance, the Iris dataset contains

150 examples with four features for each example. This means we can represent

the dataset with a design matrix

X ∈ R

150×4

, where

i,1

is the sepal length of

plant

i,2

is the sepal width of plant

, etc. We will describe most of the learning

algorithms in this book in terms of how they operate on design matrix datasets.

Of course, to describe a dataset as a design matrix, it must be possible to

describe each example as a vector, and each of these vectors must be the same size.

This is not always possible. For example, if you have a collection of photographs

with diﬀerent widths and heights, then diﬀerent photographs will contain diﬀerent

numbers of pixels, so not all of the photographs may be described with the same

length of vector. Sec. 9.7 and Chapter 10 describe how to handle diﬀerent types

106

CHAPTER 5. MACHINE LEARNING BASICS

of such heterogeneous data. In cases like these, rather than describing the dataset

as a matrix with

rows, we will describe it as a set containing

elements:

(1)

, x

(2)

, . . . , x

(m)

}

. This notation does not imply that any two example vectors

(i)

and x

(j)

have the same size.

In the case of supervised learning, the example contains a label or target as

well as a collection of features. For example, if we want to use a learning algorithm

to perform object recognition from photographs, we need to specify which object

appears in each of the photos. We might do this with a numeric code, with 0

signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working

with a dataset containing a design matrix of feature observations

, we also

provide a vector of labels y, with y

providing the label for example i.

Of course, sometimes the label may be more than just a single number. For

example, if we want to train a speech recognition system to transcribe entire

sentences, then the label for each example sentence is a sequence of words.

Just as there is no formal deﬁnition of supervised and unsupervised learning,

there is no rigid taxonomy of datasets or experiences. The structures described here

cover most cases, but it is always possible to design new ones for new applications.

5.1.4 Example: Linear Regression

Our deﬁnition of a machine learning algorithm as an algorithm that is capable

of improving a computer program’s performance at some task via experience is

somewhat abstract. To make this more concrete, we present an example of a simple

machine learning algorithm: linear regression. We will return to this example

repeatedly as we introduce more machine learning concepts that help to understand

its behavior.

As the name implies, linear regression solves a regression problem. In other

words, the goal is to build a system that can take a vector

x ∈ R

as input and

predict the value of a scalar

y ∈ R

as its output. In the case of linear regression,

the output is a linear function of the input. Let

ˆy

be the value that our model

predicts y should take on. We deﬁne the output to be

ˆy = w

x (5.3)

where w ∈ R

is a vector of parameters.

Parameters are values that control the behavior of the system. In this case,

the coeﬃcient that we multiply by feature

before summing up the contributions

from all the features. We can think of

as a set of weights that determine how

each feature aﬀects the prediction. If a feature

receives a positive weight

107

CHAPTER 5. MACHINE LEARNING BASICS

then increasing the value of that feature increases the value of our prediction

ˆy

If a feature receives a negative weight, then increasing the value of that feature

decreases the value of our prediction. If a feature’s weight is large in magnitude,

then it has a large eﬀect on the prediction. If a feature’s weight is zero, it has no

eﬀect on the prediction.

We thus have a deﬁnition of our task

: to predict

from

by outputting

ˆy = w

x. Next we need a deﬁnition of our performance measure, P .

Suppose that we have a design matrix of

example inputs that we will not

use for training, only for evaluating how well the model performs. We also have

a vector of regression targets providing the correct value of

for each of these

examples. Because this dataset will only be used for evaluation, we call it the test

set. We refer to the design matrix of inputs as

(test)

and the vector of regression

targets as y

(test)

One way of measuring the performance of the model is to compute the mean

squared error of the model on the test set. If

(test)

gives the predictions of the

model on the test set, then the mean squared error is given by

MSE

test

(

(test)

− y

(test)

)

. (5.4)

Intuitively, one can see that this error measure decreases to 0 when

(test)

We can also see that

MSE

test

(test)

− y

(test)

, (5.5)

so the error increases whenever the Euclidean distance between the predictions

and the targets increases.

To make a machine learning algorithm, we need to design an algorithm that

will improve the weights

in a way that reduces

MSE

test

when the algorithm

is allowed to gain experience by observing a training set (

(train)

, y

(train)

). One

intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to

minimize the mean squared error on the training set, MSE

train

To minimize MSE

train

, we can simply solve for where its gradient is 0:

∇

MSE

train

= 0 (5.6)

⇒ ∇

(train)

− y

(train)

= 0 (5.7)

⇒

∇

||X

(train)

w − y

(train)

= 0 (5.8)

108

CHAPTER 5. MACHINE LEARNING BASICS

−1.0 −0.5 0.0 0.5 1.0

−3

−2

−1

Linear regression example

0.5 1.0 1.5

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

MSE

(train)

Optimization of w

Figure 5.1: A linear regression problem, with a training set consisting of ten data points,

each containing one feature. Because there is only one feature, the weight vector

contains only a single parameter to learn,

. (Left) Observe that linear regression learns

to set

such that the line

comes as close as possible to passing through all the

training points. (Right) The plotted point indicates the value of

found by the normal

equations, which we can see minimizes the mean squared error on the training set.

⇒ ∇



(train)

w − y

(train)





(train)

w − y

(train)



= 0 (5.9)

⇒ ∇



(train)>

(train)

w − 2w

(train)>

(train)

+ y

(train)>

(train)



= 0

(5.10)

⇒ 2X

(train)>

(train)

w − 2X

(train)>

(train)

= 0 (5.11)

⇒ w =



(train)>

(train)



−1

(train)>

(train)

(5.12)

The system of equations whose solution is given by Eq. 5.12 is known as the

normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm.

For an example of the linear regression learning algorithm in action, see Fig. 5.1.

It is worth noting that the term linear regression is often used to refer to a

slightly more sophisticated model with one additional parameter—an intercept

term b. In this model

ˆy = w

x + b (5.13)

so the mapping from parameters to predictions is still a linear function but the

mapping from features to predictions is now an aﬃne function. This extension to

aﬃne functions means that the plot of the model’s predictions still looks like a

line, but it need not pass through the origin. Instead of adding the bias parameter

, one can continue to use the model with only weights but augment

with an

109

CHAPTER 5. MACHINE LEARNING BASICS

extra entry that is always set to 1. The weight corresponding to the extra 1 entry

plays the role of the bias parameter. We will frequently use the term “linear” when

referring to aﬃne functions throughout this book.

The intercept term

is often called the bias parameter of the aﬃne transfor-

mation. This terminology derives from the point of view that the output of the

transformation is biased toward being

in the absence of any input. This term

is diﬀerent from the idea of a statistical bias, in which a statistical estimation

algorithm’s expected estimate of a quantity is not equal to the true quantity.

Linear regression is of course an extremely simple and limited learning algorithm,

but it provides an example of how a learning algorithm can work. In the subsequent

sections we will describe some of the basic principles underlying learning algorithm

design and demonstrate how these principles can be used to build more complicated

learning algorithms.

5.2 Capacity, Overﬁtting and Underﬁtting

The central challenge in machine learning is that we must perform well on

new,

previously unseen

inputs—not just those on which our model was trained. The

ability to perform well on previously unobserved inputs is called generalization.

Typically, when training a machine learning model, we have access to a training

set, we can compute some error measure on the training set called the training

error, and we reduce this training error. So far, what we have described is simply

an optimization problem. What separates machine learning from optimization is

that we want the generalization error, also called the test error, to be low as well.

The generalization error is deﬁned as the expected value of the error on a new

input. Here the expectation is taken across diﬀerent possible inputs, drawn from

the distribution of inputs we expect the system to encounter in practice.

We typically estimate the generalization error of a machine learning model by

measuring its performance on a test set of examples that were collected separately

from the training set.

In our linear regression example, we trained the model by minimizing the

training error,

(train)

||X

(train)

w − y

(train)

, (5.14)

but we actually care about the test error,

(test)

||X

(test)

w − y

(test)

How can we aﬀect performance on the test set when we get to observe only the

training set? The ﬁeld of statistical learning theory provides some answers. If the

110

CHAPTER 5. MACHINE LEARNING BASICS

training and the test set are collected arbitrarily, there is indeed little we can do.

If we are allowed to make some assumptions about how the training and test set

are collected, then we can make some progress.

The train and test data are generated by a probability distribution over datasets

called the data generating process. We typically make a set of assumptions known

collectively as the i.i.d. assumptions These assumptions are that the examples

in each dataset are independent from each other, and that the train set and test

set are identically distributed, drawn from the same probability distribution as

each other. This assumption allows us to describe the data generating process

with a probability distribution over a single example. The same distribution is

then used to generate every train example and every test example. We call that

shared underlying distribution the data generating distribution, denoted

data

. This

probabilistic framework and the i.i.d. assumptions allow us to mathematically

study the relationship between training error and test error.

One immediate connection we can observe between the training and test error

is that the expected training error of a randomly selected model is equal to the

expected test error of that model. Suppose we have a probability distribution

(

x, y

) and we sample from it repeatedly to generate the train set and the test set.

For some ﬁxed value

, then the expected training set error is exactly the same as

the expected test set error, because both expectations are formed using the same

dataset sampling process. The only diﬀerence between the two conditions is the

name we assign to the dataset we sample.

Of course, when we use a machine learning algorithm, we do not ﬁx the

parameters ahead of time, then sample both datasets. We sample the training set,

then use it to choose the parameters to reduce training set error, then sample the

test set. Under this process, the expected test error is greater than or equal to

the expected value of training error. The factors determining how well a machine

learning algorithm will perform are its ability to:

1. Make the training error small.

2. Make the gap between training and test error small.

These two factors correspond to the two central challenges in machine learning:

underﬁtting and overﬁtting. Underﬁtting occurs when the model is not able to

obtain a suﬃciently low error value on the training set. Overﬁtting occurs when

the gap between the training error and test error is too large.

We can control whether a model is more likely to overﬁt or underﬁt by altering

its capacity. Informally, a model’s capacity is its ability to ﬁt a wide variety of

111

CHAPTER 5. MACHINE LEARNING BASICS

functions. Models with low capacity may struggle to ﬁt the training set. Models

with high capacity can overﬁt by memorizing properties of the training set that do

not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing its

hypothesis space, the set of functions that the learning algorithm is allowed to

select as being the solution. For example, the linear regression algorithm has the

set of all linear functions of its input as its hypothesis space. We can generalize

linear regression to include polynomials, rather than just linear functions, in its

hypothesis space. Doing so increases the model’s capacity.

A polynomial of degree one gives us the linear regression model with which we

are already familiar, with prediction

ˆy = b + wx. (5.15)

By introducing

as another feature provided to the linear regression model, we

can learn a model that is quadratic as a function of x:

ˆy = b + w

x + w

. (5.16)

Though this model implements a quadratic function of its

input

, the output is

still a linear function of the

parameters

, so we can still use the normal equations

to train the model in closed form. We can continue to add more powers of

additional features, for example to obtain a polynomial of degree 9:

ˆy = b +

i=1

. (5.17)

Machine learning algorithms will generally perform best when their capacity

is appropriate in regard to the true complexity of the task they need to perform

and the amount of training data they are provided with. Models with insuﬃcient

capacity are unable to solve complex tasks. Models with high capacity can solve

complex tasks, but when their capacity is higher than needed to solve the present

task they may overﬁt.

Fig. 5.2 shows this principle in action. We compare a linear, quadratic and

degree-9 predictor attempting to ﬁt a problem where the true underlying function

is quadratic. The linear function is unable to capture the curvature in the true un-

derlying problem, so it underﬁts. The degree-9 predictor is capable of representing

the correct function, but it is also capable of representing inﬁnitely many other

functions that pass exactly through the training points, because we have more

112

CHAPTER 5. MACHINE LEARNING BASICS

parameters than training examples. We have little chance of choosing a solution

that generalizes well when so many wildly diﬀerent solutions exist. In this example,

the quadratic model is perfectly matched to the true structure of the task so it

generalizes well to new data.

Underfitting

Appropriate capacity

Overfitting

Figure 5.2: We ﬁt three models to this example training set. The training data was

generated synthetically, by randomly sampling

values and choosing

deterministically

by evaluating a quadratic function. (Left) A linear function ﬁt to the data suﬀers from

underﬁtting—it cannot capture the curvature that is present in the data. (Center) A

quadratic function ﬁt to the data generalizes well to unseen points. It does not suﬀer from

a signiﬁcant amount of overﬁtting or underﬁtting. (Right) A polynomial of degree 9 ﬁt to

the data suﬀers from overﬁtting. Here we used the Moore-Penrose pseudoinverse to solve

the underdetermined normal equations. The solution passes through all of the training

points exactly, but we have not been lucky enough for it to extract the correct structure.

It now has a deep valley in between two training points that does not appear in the true

underlying function. It also increases sharply on the left side of the data, while the true

function decreases in this area.

So far we have only described changing a model’s capacity by changing the

number of input features it has (and simultaneously adding new parameters

associated with those features). There are in fact many ways of changing a model’s

capacity. Capacity is not determined only by the choice of model. The model

speciﬁes which family of functions the learning algorithm can choose from when

varying the parameters in order to reduce a training objective. This is called the

representational capacity of the model. In many cases, ﬁnding the best function

within this family is a very diﬃcult optimization problem. In practice, the learning

algorithm does not actually ﬁnd the best function, but merely one that signiﬁcantly

reduces the training error. These additional limitations, such as the imperfection

113

CHAPTER 5. MACHINE LEARNING BASICS

of the optimization algorithm, mean that the learning algorithm’s eﬀective capacity

may be less than the representational capacity of the model family.

Our modern ideas about improving the generalization of machine learning

models are reﬁnements of thought dating back to philosophers at least as early

as Ptolemy. Many early scholars invoke a principle of parsimony that is now

most widely known as Occam’s razor (c. 1287-1347). This principle states that

among competing hypotheses that explain known observations equally well, one

should choose the “simplest” one. This idea was formalized and made more precise

in the 20th century by the founders of statistical learning theory (Vapnik and

Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).

Statistical learning theory provides various means of quantifying model capacity.

Among these, the most well-known is the Vapnik-Chervonenkis dimension, or VC

dimension. The VC dimension measures the capacity of a binary classiﬁer. The

VC dimension is deﬁned as being the largest possible value of

for which there

exists a training set of

diﬀerent

points that the classiﬁer can label arbitrarily.

Quantifying the capacity of the model allows statistical learning theory to

make quantitative predictions. The most important results in statistical learning

theory show that the discrepancy between training error and generalization error

is bounded from above by a quantity that grows as the model capacity grows but

shrinks as the number of training examples increases (Vapnik and Chervonenkis,

1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide

intellectual justiﬁcation that machine learning algorithms can work, but they are

rarely used in practice when working with deep learning algorithms. This is in

part because the bounds are often quite loose and in part because it can be quite

diﬃcult to determine the capacity of deep learning algorithms. The problem of

determining the capacity of a deep learning model is especially diﬃcult because the

eﬀective capacity is limited by the capabilities of the optimization algorithm, and

we have little theoretical understanding of the very general non-convex optimization

problems involved in deep learning.

We must remember that while simpler functions are more likely to generalize

(to have a small gap between training and test error) we must still choose a

suﬃciently complex hypothesis to achieve low training error. Typically, training

error decreases until it asymptotes to the minimum possible error value as model

capacity increases (assuming the error measure has a minimum value). Typically,

generalization error has a U-shaped curve as a function of model capacity. This is

illustrated in Fig. 5.3.

To reach the most extreme case of arbitrarily high capacity, we introduce

the concept of non-parametric models. So far, we have seen only parametric

114

CHAPTER 5. MACHINE LEARNING BASICS

0 Optimal Capacity

Capacity

Error

Underﬁtting zone Overﬁtting zone

Generalization gap

Training error

Generalization error

Figure 5.3: Typical relationship between capacity and error. Training and test error

behave diﬀerently. At the left end of the graph, training error and generalization error

are both high. This is the underﬁtting regime. As we increase capacity, training error

decreases, but the gap between training and generalization error increases. Eventually,

the size of this gap outweighs the decrease in training error, and we enter the overﬁtting

regime, where capacity is too large, above the optimal capacity.

models, such as linear regression. Parametric models learn a function described

by a parameter vector whose size is ﬁnite and ﬁxed before any data is observed.

Non-parametric models have no such limitation.

Sometimes, non-parametric models are just theoretical abstractions (such as

an algorithm that searches over all possible probability distributions) that cannot

be implemented in practice. However, we can also design practical non-parametric

models by making their complexity a function of the training set size. One example

of such an algorithm is nearest neighbor regression. Unlike linear regression, which

has a ﬁxed-length vector of weights, the nearest neighbor regression model simply

stores the

and

from the training set. When asked to classify a test point

the model looks up the nearest entry in the training set and returns the associated

regression target. In other words,

ˆy

where

arg min ||X

i,:

− x||

. The

algorithm can also be generalized to distance metrics other than the

norm, such

as learned distance metrics (Goldberger et al., 2005). If the algorithm is allowed

to break ties by averaging the

values for all

i,:

that are tied for nearest, then

this algorithm is able to achieve the minimum possible training error (which might

be greater than zero, if two identical inputs are associated with diﬀerent outputs)

on any regression dataset.

Finally, we can also create a non-parametric learning algorithm by wrapping a

parametric learning algorithm inside another algorithm that increases the number

115

CHAPTER 5. MACHINE LEARNING BASICS

of parameters as needed. For example, we could imagine an outer loop of learning

that changes the degree of the polynomial learned by linear regression on top of a

polynomial expansion of the input.

The ideal model is an oracle that simply knows the true probability distribution

that generates the data. Even such a model will still incur some error on many

problems, because there may still be some noise in the distribution. In the case

of supervised learning, the mapping from

may be inherently stochastic,

may be a deterministic function that involves other variables besides those

included in

. The error incurred by an oracle making predictions from the true

distribution p(x, y) is called the Bayes error.

Training and generalization error vary as the size of the training set varies.

Expected generalization error can never increase as the number of training examples

increases. For non-parametric models, more data yields better generalization until

the best possible error is achieved. Any ﬁxed parametric model with less than

optimal capacity will asymptote to an error value that exceeds the Bayes error. See

Fig. 5.4 for an illustration. Note that it is possible for the model to have optimal

capacity and yet still have a large gap between training and generalization error.

In this situation, we may be able to reduce this gap by gathering more training

examples.

5.2.1 The No Free Lunch Theorem

Learning theory claims that a machine learning algorithm can generalize well from

a ﬁnite training set of examples. This seems to contradict some basic principles of

logic. Inductive reasoning, or inferring general rules from a limited set of examples,

is not logically valid. To logically infer a rule describing every member of a set,

one must have information about every member of that set.

In part, machine learning avoids this problem by oﬀering only probabilistic rules,

rather than the entirely certain rules used in purely logical reasoning. Machine

learning promises to ﬁnd rules that are probably correct about most members of

the set they concern.

Unfortunately, even this does not resolve the entire problem. The no free lunch

theorem for machine learning (Wolpert, 1996) states that, averaged over all possible

data generating distributions, every classiﬁcation algorithm has the same error

rate when classifying previously unobserved points. In other words, in some sense,

no machine learning algorithm is universally any better than any other. The most

sophisticated algorithm we can conceive of has the same average performance (over

all possible tasks) as merely predicting that every point belongs to the same class.

116

CHAPTER 5. MACHINE LEARNING BASICS

Number of training examples

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Error (MSE)

Bayes error

Train (quadratic)

Test (quadratic)

Test (optimal capacity)

Train (optimal capacity)

Number of training examples

Optimal capacity (polynomial degree)

Figure 5.4: The eﬀect of the training dataset size on the train and test error, as well as

on the optimal model capacity. We constructed a synthetic regression problem based on

adding moderate amount of noise to a degree 5 polynomial, generated a single test set,

and then generated several diﬀerent sizes of training set. For each size, we generated 40

diﬀerent training sets in order to plot error bars showing 95% conﬁdence intervals. (Top)

The MSE on the train and test set for two diﬀerent models: a quadratic model, and a

model with degree chosen to minimize the test error. Both are ﬁt in closed form. For

the quadratic model, the training error increases as the size of the training set increases.

This is because larger datasets are harder to ﬁt. Simultaneously, the test error decreases,

because fewer incorrect hypotheses are consistent with the training data. The quadratic

model does not have enough capacity to solve the task, so its test error asymptotes to

a high value. The test error at optimal capacity asymptotes to the Bayes error. The

training error can fall below the Bayes error, due to the ability of the training algorithm

to memorize speciﬁc instances of the training set. As the training size increases to inﬁnity,

the training error of any ﬁxed-capacity model (here, the quadratic model) must rise to at

least the Bayes error. (Bottom) As the training set size increases, the optimal capacity

(shown here as the degree of the optimal polynomial regressor) increases. The optimal

capacity plateaus after reaching suﬃcient complexity to solve the task.

117

CHAPTER 5. MACHINE LEARNING BASICS

Fortunately, these results hold only when we average over all possible data

generating distributions. If we make assumptions about the kinds of probability

distributions we encounter in real-world applications, then we can design learning

algorithms that perform well on these distributions.

This means that the goal of machine learning research is not to seek a universal

learning algorithm or the absolute best learning algorithm. Instead, our goal is to

understand what kinds of distributions are relevant to the “real world” that an AI

agent experiences, and what kinds of machine learning algorithms perform well on

data drawn from the kinds of data generating distributions we care about.

5.2.2 Regularization

The no free lunch theorem implies that we must design our machine learning

algorithms to perform well on a speciﬁc task. We do so by building a set of

preferences into the learning algorithm. When these preferences are aligned with

the learning problems we ask the algorithm to solve, it performs better.

So far, the only method of modifying a learning algorithm we have discussed is

to increase or decrease the model’s capacity by adding or removing functions from

the hypothesis space of solutions the learning algorithm is able to choose. We gave

the speciﬁc example of increasing or decreasing the degree of a polynomial for a

regression problem. The view we have described so far is oversimpliﬁed.

The behavior of our algorithm is strongly aﬀected not just by how large we

make the set of functions allowed in its hypothesis space, but by the speciﬁc identity

of those functions. The learning algorithm we have studied so far, linear regression,

has a hypothesis space consisting of the set of linear functions of its input. These

linear functions can be very useful for problems where the relationship between

inputs and outputs truly is close to linear. They are less useful for problems

that behave in a very nonlinear fashion. For example, linear regression would

not perform very well if we tried to use it to predict

sin

(

) from

. We can thus

control the performance of our algorithms by choosing what kind of functions we

allow them to draw solutions from, as well as by controlling the amount of these

functions.

We can also give a learning algorithm a preference for one solution in its

hypothesis space to another. This means that both functions are eligible, but one

is preferred. The unpreferred solution be chosen only if it ﬁts the training data

signiﬁcantly better than the preferred solution.

For example, we can modify the training criterion for linear regression to

include weight decay. To perform linear regression with weight decay, we minimize

118

CHAPTER 5. MACHINE LEARNING BASICS

a sum comprising both the mean squared error on the training and a criterion

(

) that expresses a preference for the weights to have smaller squared

norm.

Speciﬁcally,

J(w) = MSE

train

+ λw

w, (5.18)

where

is a value chosen ahead of time that controls the strength of our preference

for smaller weights. When

= 0, we impose no preference, and larger

forces the

weights to become smaller. Minimizing

(

) results in a choice of weights that

make a tradeoﬀ between ﬁtting the training data and being small. This gives us

solutions that have a smaller slope, or put weight on fewer of the features. As an

example of how we can control a model’s tendency to overﬁt or underﬁt via weight

decay, we can train a high-degree polynomial regression model with diﬀerent values

of λ. See Fig. 5.5 for the results.

Underfitting

(Excessive ¸)

Appropriate weight decay

(Medium ¸)

Overfitting

(¸ !0)

Figure 5.5: We ﬁt a high-degree polynomial regression model to our example training set

from Fig. 5.2. The true function is quadratic, but here we use only models with degree 9.

We vary the amount of weight decay to prevent these high-degree models from overﬁtting.

(Left) With very large

, we can force the model to learn a function with no slope at

all. This underﬁts because it can only represent a constant function. (Center) With a

medium value of λ, the learning algorithm recovers a curve with the right general shape.

Even though the model is capable of representing functions with much more complicated

shape, weight decay has encouraged it to use a simpler function described by smaller

coeﬃcients. (Right) With weight decay approaching zero (i.e., using the Moore-Penrose

pseudoinverse to solve the underdetermined problem with minimal regularization), the

degree-9 polynomial overﬁts signiﬁcantly, as we saw in Fig. 5.2.

More generally, we can regularize a model that learns a function

(

;

) by

adding a penalty called a regularizer to the cost function. In the case of weight

decay, the regularizer is Ω(

) =

. In Chapter 7, we will see that many other

119

CHAPTER 5. MACHINE LEARNING BASICS

regularizers are possible.

Expressing preferences for one function over another is a more general way

of controlling a model’s capacity than including or excluding members from the

hypothesis space. We can think of excluding a function from a hypothesis space as

expressing an inﬁnitely strong preference against that function.

In our weight decay example, we expressed our preference for linear functions

deﬁned with smaller weights explicitly, via an extra term in the criterion we

minimize. There are many other ways of expressing preferences for diﬀerent

solutions, both implicitly and explicitly. Together, these diﬀerent approaches are

known as regularization.

Regularization is any modiﬁcation we make to

a learning algorithm that is intended to reduce its generalization error

but not its training error.

Regularization is one of the central concerns of the

ﬁeld of machine learning, rivaled in its importance only by optimization.

The no free lunch theorem has made it clear that there is no best machine

learning algorithm, and, in particular, no best form of regularization. Instead

we must choose a form of regularization that is well-suited to the particular task

we want to solve. The philosophy of deep learning in general and this book in

particular is that a very wide range of tasks (such as all of the intellectual tasks

that people can do) may all be solved eﬀectively using very general-purpose forms

of regularization.

5.3 Hyperparameters and Validation Sets

Most machine learning algorithms have several settings that we can use to control

the behavior of the learning algorithm. These settings are called hyperparameters.

The values of hyperparameters are not adapted by the learning algorithm itself

(though we can design a nested learning procedure where one learning algorithm

learns the best hyperparameters for another learning algorithm).

In the polynomial regression example we saw in Fig. 5.2, there is a single hyper-

parameter: the degree of the polynomial, which acts as a capacity hyperparameter.

The

value used to control the strength of weight decay is another example of a

hyperparameter.

Sometimes a setting is chosen to be a hyperparameter that the learning algo-

rithm does not learn because it is diﬃcult to optimize. More frequently, we do

not learn the hyperparameter because it is not appropriate to learn that hyper-

parameter on the training set. This applies to all hyperparameters that control

model capacity. If learned on the training set, such hyperparameters would always

120

CHAPTER 5. MACHINE LEARNING BASICS

choose the maximum possible model capacity, resulting in overﬁtting (refer to

Fig. 5.3). For example, we can always ﬁt the training set better with a higher

degree polynomial and a weight decay setting of

= 0 than we could with a lower

degree polynomial and a positive weight decay setting.

To solve this problem, we need a validation set of examples that the training

algorithm does not observe.

Earlier we discussed how a held-out test set, composed of examples coming from

the same distribution as the training set, can be used to estimate the generalization

error of a learner, after the learning process has completed. It is important that the

test examples are not used in any way to make choices about the model, including

its hyperparameters. For this reason, no example from the test set can be used

in the validation set. Therefore, we always construct the validation set from the

training data. Speciﬁcally, we split the training data into two disjoint subsets. One

of these subsets is used to learn the parameters. The other subset is our validation

set, used to estimate the generalization error during or after training, allowing

for the hyperparameters to be updated accordingly. The subset of data used to

learn the parameters is still typically called the training set, even though this

may be confused with the larger pool of data used for the entire training process.

The subset of data used to guide the selection of hyperparameters is called the

validation set. Typically, one uses about 80% of the training data for training and

20% for validation. Since the validation set is used to “train” the hyperparameters,

the validation set error will underestimate the generalization error, though typically

by a smaller amount than the training error. After all hyperparameter optimization

is complete, the generalization error may be estimated using the test set.

In practice, when the same test set has been used repeatedly to evaluate

performance of diﬀerent algorithms over many years, and especially if we consider

all the attempts from the scientiﬁc community at beating the reported state-of-

the-art performance on that test set, we end up having optimistic evaluations with

the test set as well. Benchmarks can thus become stale and then do not reﬂect the

true ﬁeld performance of a trained system. Thankfully, the community tends to

move on to new (and usually more ambitious and larger) benchmark datasets.

5.3.1 Cross-Validation

Dividing the dataset into a ﬁxed training set and a ﬁxed test set can be problematic

if it results in the test set being small. A small test set implies statistical uncertainty

around the estimated average test error, making it diﬃcult to claim that algorithm

A works better than algorithm B on the given task.

121

CHAPTER 5. MACHINE LEARNING BASICS

When the dataset has hundreds of thousands of examples or more, this is not

a serious issue. When the dataset is too small, there are alternative procedures,

which allow one to use all of the examples in the estimation of the mean test

error, at the price of increased computational cost. These procedures are based on

the idea of repeating the training and testing computation on diﬀerent randomly

chosen subsets or splits of the original dataset. The most common of these is the

-fold cross-validation procedure, shown in Algorithm 5.1, in which a partition

of the dataset is formed by splitting it into

non-overlapping subsets. The test

error may then be estimated by taking the average test error across

trials. On

trial

, the

-th subset of the data is used as the test set and the rest of the data is

used as the training set. One problem is that there exist no unbiased estimators of

the variance of such average error estimators (Bengio and Grandvalet, 2004), but

approximations are typically used.

5.4 Estimators, Bias and Variance

The ﬁeld of statistics gives us many tools that can be used to achieve the machine

learning goal of solving a task not only on the training set but also to generalize.

Foundational concepts such as parameter estimation, bias and variance are useful

to formally characterize notions of generalization, underﬁtting and overﬁtting.

5.4.1 Point Estimation

Point estimation is the attempt to provide the single “best” prediction of some

quantity of interest. In general the quantity of interest can be a single parameter

or a vector of parameters in some parametric model, such as the weights in our

linear regression example in Sec. 5.1.4, but it can also be a whole function.

In order to distinguish estimates of parameters from their true value, our

convention will be to denote a point estimate of a parameter θ by

θ.

Let

(1)

, . . . , x

(m)

}

be a set of

independent and identically distributed

(i.i.d.) data points. A point estimator or statistic is any function of the data:

= g(x

(1)

, . . . , x

(m)

). (5.19)

The deﬁnition does not require that

return a value that is close to the true

or even that the range of

is the same as the set of allowable values of

This deﬁnition of a point estimator is very general and allows the designer of an

estimator great ﬂexibility. While almost any function thus qualiﬁes as an estimator,

122

CHAPTER 5. MACHINE LEARNING BASICS

Algorithm 5.1

The

-fold cross-validation algorithm. It can be used to estimate

generalization error of a learning algorithm

when the given dataset

is too

small for a simple train/test or train/valid split to yield accurate estimation of

generalization error, because the mean of a loss

on a small test set may have too

high variance. The dataset

contains as elements the abstract examples

(i)

(for

the

-th example), which could stand for an (input,target) pair

(i)

= (

(i)

, y

(i)

)

in the case of supervised learning, or for just an input

(i)

in the case

of unsupervised learning. The algorithm returns the vector of errors

for each

example in

, whose mean is the estimated generalization error. The errors on

individual examples can be used to compute a conﬁdence interval around the mean

(Eq. 5.47). While these conﬁdence intervals are not well-justiﬁed after the use of

cross-validation, it is still common practice to use them to declare that algorithm

is better than algorithm

only if the conﬁdence interval of the error of algorithm

A lies below and does not intersect the conﬁdence interval of algorithm B.

Deﬁne KFoldXV(D, A, L, k):

Require: D, the given dataset, with elements z

(i)

Require: A

, the learning algorithm, seen as a function that takes a dataset as

input and outputs a learned function

Require: L

, the loss function, seen as a function from a learned function

and

an example z

(i)

∈ D to a scalar ∈ R

Require: k, the number of folds

Split D into k mutually exclusive subsets D

, whose union is D.

for i from 1 to k do

= A(D\D

)

for z

(j)

in D

= L(f

, z

(j)

)

end for

Return e

123

CHAPTER 5. MACHINE LEARNING BASICS

a good estimator is a function whose output is close to the true underlying

that

generated the training data.

For now, we take the frequentist perspective on statistics. That is, we assume

that the true parameter value

is ﬁxed but unknown, while the point estimate

is a function of the data. Since the data is drawn from a random process, any

function of the data is random. Therefore

θ is a random variable.

Point estimation can also refer to the estimation of the relationship between

input and target variables. We refer to these types of point estimates as function

estimators.

Function Estimation

As we mentioned above, sometimes we are interested in

performing function estimation (or function approximation). Here we are trying to

predict a variable

given an input vector

. We assume that there is a function

(

) that describes the approximate relationship between

and

. For example,

we may assume that

(

) +



, where



stands for the part of

that is not

predictable from

. In function estimation, we are interested in approximating

with a model or estimate

. Function estimation is really just the same as

estimating a parameter

; the function estimator

is simply a point estimator in

function space. The linear regression example (discussed above in Sec. 5.1.4) and

the polynomial regression example (discussed in Sec. 5.2) are both examples of

scenarios that may be interpreted either as estimating a parameter

or estimating

a function

f mapping from x to y.

We now review the most commonly studied properties of point estimators and

discuss what they tell us about these estimators.

5.4.2 Bias

The bias of an estimator is deﬁned as:

bias(

) = E(

) −θ (5.20)

where the expectation is over the data (seen as samples from a random variable) and

is the true underlying value of

used to deﬁne the data generating distribution.

An estimator

is said to be unbiased if

bias

(

) =

, which implies that

(

) =

. An estimator

is said to be asymptotically unbiased if

lim

m→∞

bias

(

) =

which implies that lim

m→∞

) = θ.

Example: Bernoulli Distribution

Consider a set of samples

(1)

, . . . , x

(m)

}

that are independently and identically distributed according to a Bernoulli distri-

124

CHAPTER 5. MACHINE LEARNING BASICS

bution with mean θ:

P (x

(i)

; θ) = θ

(i)

(1 −θ)

(1−x

(i)

)

. (5.21)

A common estimator for the

parameter of this distribution is the mean of the

training samples:

i=1

(i)

. (5.22)

To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq.

5.20:

bias(

) = E[

] −θ (5.23)

= E

i=1

(i)

− θ (5.24)

i=1

(i)

− θ (5.25)

i=1

(i)



(i)

(1 −θ)

(1−x

(i)

)



− θ (5.26)

i=1

(θ) −θ (5.27)

= θ − θ = 0 (5.28)

Since bias(

θ) = 0, we say that our estimator

θ is unbiased.

Example: Gaussian Distribution Estimator of the Mean

Now, consider

a set of samples

(1)

, . . . , x

(m)

}

that are independently and identically distributed

according to a Gaussian distribution

(

(i)

) =

(

(i)

;

µ, σ

), where

i ∈ {

, . . . , m}

Recall that the Gaussian probability density function is given by

p(x

(i)

; µ, σ

) =

√

2πσ

exp

−

(i)

− µ)

. (5.29)

A common estimator of the Gaussian mean parameter is known as the sample

mean:

ˆµ

i=1

(i)

(5.30)

125

CHAPTER 5. MACHINE LEARNING BASICS

To determine the bias of the sample mean, we are again interested in calculating

its expectation:

bias(ˆµ

) = E[ˆµ

] −µ (5.31)

= E

i=1

(i)

− µ (5.32)

i=1

(i)

− µ (5.33)

i=1

− µ (5.34)

= µ −µ = 0 (5.35)

Thus we ﬁnd that the sample mean is an unbiased estimator of Gaussian mean

parameter.

Example: Estimators of the Variance of a Gaussian Distribution

As an

example, we compare two diﬀerent estimators of the variance parameter

of a

Gaussian distribution. We are interested in knowing if either estimator is biased.

The ﬁrst estimator of σ

we consider is known as the sample variance:

ˆσ

i=1



(i)

− ˆµ



, (5.36)

where

ˆµ

is the sample mean, deﬁned above. More formally, we are interested in

computing

bias(ˆσ

) = E[ˆσ

] −σ

(5.37)

We begin by evaluating the term E[ˆσ

E[ˆσ

] =E

i=1



(i)

− ˆµ



(5.38)

m −1

(5.39)

Returning to Eq. 5.37, we conclude that the bias of

ˆσ

−σ

. Therefore, the

sample variance is a biased estimator.

126

CHAPTER 5. MACHINE LEARNING BASICS

The unbiased sample variance estimator

˜σ

m −1

i=1



(i)

− ˆµ



(5.40)

provides an alternative approach. As the name suggests this estimator is unbiased.

That is, we ﬁnd that E[˜σ

] = σ

E[˜σ

] = E

m −1

i=1



(i)

− ˆµ



(5.41)

m −1

E[ˆσ

] (5.42)

m −1



m −1



(5.43)

= σ

. (5.44)

We have two estimators: one is biased and the other is not. While unbiased

estimators are clearly desirable, they are not always the “best” estimators. As we

will see we often use biased estimators that possess other important properties.

5.4.3 Variance and Standard Error

Another property of the estimator that we might want to consider is how much

we expect it to vary as a function of the data sample. Just as we computed the

expectation of the estimator to determine its bias, we can compute its variance.

The variance of an estimator is simply the variance

Var(

θ) (5.45)

where the random variable is the training set. Alternately, the square root of the

variance is called the standard error, denoted SE(

θ).

The variance or the standard error of an estimator provides a measure of how

we would expect the estimate we compute from data to vary as we independently

resample the dataset from the underlying data generating process. Just as we

might like an estimator to exhibit low bias we would also like it to have relatively

low variance.

When we compute any statistic using a ﬁnite number of samples, our estimate

of the true underlying parameter is uncertain, in the sense that we could have

obtained other samples from the same distribution and their statistics would have

127

CHAPTER 5. MACHINE LEARNING BASICS

been diﬀerent. The expected degree of variation in any estimator is a source of

error that we want to quantify.

The standard error of the mean is given by

SE(ˆµ

) =

Var[

i=1

(i)

] =

√

, (5.46)

where

is the true variance of the samples

. The standard error is often

estimated by using an estimate of

. Unfortunately, neither the square root of

the sample variance nor the square root of the unbiased estimator of the variance

provide an unbiased estimate of the standard deviation. Both approaches tend

to underestimate the true standard deviation, but are still used in practice. The

square root of the unbiased estimator of the variance is less of an underestimate.

For large m, the approximation is quite reasonable.

The standard error of the mean is very useful in machine learning experiments.

We often estimate the generalization error by computing the sample mean of the

error on the test set. The number of examples in the test set determines the

accuracy of this estimate. Taking advantage of the central limit theorem, which

tells us that the mean will be approximately distributed with a normal distribution,

we can use the standard error to compute the probability that the true expectation

falls in any chosen interval. For example, the 95% conﬁdence interval centered on

the mean is ˆµ

(ˆµ

− 1.96SE(ˆµ

), ˆµ

+ 1.96SE(ˆµ

)), (5.47)

under the normal distribution with mean

ˆµ

and variance

(

ˆµ

)

. In machine

learning experiments, it is common to say that algorithm

is better than algorithm

if the upper bound of the 95% conﬁdence interval for the error of algorithm

less than the lower bound of the 95% conﬁdence interval for the error of algorithm

Example: Bernoulli Distribution

We once again consider a set of samples

(1)

, . . . , x

(m)

}

drawn independently and identically from a Bernoulli distribution

(recall

(

(i)

;

) =

(i)

− θ

)

(1−x

(i)

)

). This time we are interested in computing

the variance of the estimator

i=1

(i)

Var





= Var

i=1

(i)

(5.48)

128

CHAPTER 5. MACHINE LEARNING BASICS

i=1

Var



(i)



(5.49)

i=1

θ(1 −θ) (5.50)

mθ(1 −θ) (5.51)

θ(1 −θ) (5.52)

The variance of the estimator decreases as a function of

, the number of examples

in the dataset. This is a common property of popular estimators that we will

return to when we discuss consistency (see Sec. 5.4.5).

5.4.4 Trading oﬀ Bias and Variance to Minimize Mean Squared

Error

Bias and variance measure two diﬀerent sources of error in an estimator. Bias

measures the expected deviation from the true value of the function or parameter.

Variance on the other hand, provides a measure of the deviation from the expected

estimator value that any particular sampling of the data is likely to cause.

What happens when we are given a choice between two estimators, one with

more bias and one with more variance? How do we choose between them? For

example, imagine that we are interested in approximating the function shown in

Fig. 5.2 and we are only oﬀered the choice between a model with large bias and

one that suﬀers from large variance. How do we choose between them?

The most common way to negotiate this trade-oﬀ is to use cross-validation.

Empirically, cross-validation is highly successful on many real-world tasks. Alter-

natively, we can also compare the mean squared error (MSE) of the estimates:

MSE = E[(

− θ)

] (5.53)

= Bias(

)

+ Var(

) (5.54)

The MSE measures the overall expected deviation—in a squared error sense—

between the estimator and the true value of the parameter

. As is clear from

Eq. 5.54, evaluating the MSE incorporates both the bias and the variance. Desirable

estimators are those with small MSE and these are estimators that manage to keep

both their bias and variance somewhat in check.

The relationship between bias and variance is tightly linked to the machine

learning concepts of capacity, underﬁtting and overﬁtting. In the case where gen-

129

CHAPTER 5. MACHINE LEARNING BASICS

Capacity

Bias

Generalization

error

Variance

Optimal

capacity

Overﬁtting zoneUnderﬁtting zone

Figure 5.6: As capacity increases (

-axis), bias (dotted) tends to decrease and variance

(dashed) tends to increase, yielding another U-shaped curve for generalization error (bold

curve). If we vary capacity along one axis, there is an optimal capacity, with underﬁtting

when the capacity is below this optimum and overﬁtting when it is above. This relationship

is similar to the relationship between capacity, underﬁtting, and overﬁtting, discussed in

Sec. 5.2 and Fig. 5.3.

eralization error is measured by the MSE (where bias and variance are meaningful

components of generalization error), increasing capacity tends to increase variance

and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped

curve of generalization error as a function of capacity.

5.4.5 Consistency

So far we have discussed the properties of various estimators for a training set of

ﬁxed size. Usually, we are also concerned with the behavior of an estimator as the

amount of training data grows. In particular, we usually wish that, as the number

of data points

in our dataset increases, our point estimates converge to the true

value of the corresponding parameters. More formally, we would like that

lim

m→∞

→ θ. (5.55)

The symbol

→

means that the convergence is in probability, i.e. for any

 >

(

− θ| > 

)

→

0 as

m → ∞

. The condition described by Eq. 5.55 is

known as consistency. It is sometimes referred to as weak consistency, with

strong consistency referring to the almost sure convergence of

. Almost sure

130

CHAPTER 5. MACHINE LEARNING BASICS

convergence of a sequence of random variables

(1)

, x

(2)

, . . .

to a value

occurs

when p(lim

m→∞

(m)

= x) = 1.

Consistency ensures that the bias induced by the estimator is assured to

diminish as the number of data examples grows. However, the reverse is not

true—asymptotic unbiasedness does not imply consistency. For example, consider

estimating the mean parameter

of a normal distribution

(

;

µ, σ

), with a

dataset consisting of

samples:

(1)

, . . . , x

(m)

}

. We could use the ﬁrst sample

(1)

of the dataset as an unbiased estimator:

(1)

. In that case,

(

) =

so the estimator is unbiased no matter how many data points are seen. This, of

course, implies that the estimate is asymptotically unbiased. However, this is not

a consistent estimator as it is not the case that

→ θ as m → ∞.

5.5 Maximum Likelihood Estimation

Previously, we have seen some deﬁnitions of common estimators and analyzed

their properties. But where did these estimators come from? Rather than guessing

that some function might make a good estimator and then analyzing its bias and

variance, we would like to have some principle from which we can derive speciﬁc

functions that are good estimators for diﬀerent models.

The most common such principle is the maximum likelihood principle.

Consider a set of

examples

(1)

, . . . , x

(m)

}

drawn independently from

the true but unknown data generating distribution p

data

(x).

Let

model

(

;

) be a parametric family of probability distributions over the

same space indexed by

. In other words,

model

(

;

) maps any conﬁguration

to a real number estimating the true probability p

data

(x).

The maximum likelihood estimator for θ is then deﬁned as

= arg max

model

(X; θ) (5.56)

= arg max

i=1

model

(i)

; θ) (5.57)

This product over many probabilities can be inconvenient for a variety of reasons.

For example, it is prone to numerical underﬂow. To obtain a more convenient

but equivalent optimization problem, we observe that taking the logarithm of the

likelihood does not change its

arg max

but does conveniently transform a product

131

CHAPTER 5. MACHINE LEARNING BASICS

into a sum:

= arg max

i=1

log p

model

(i)

; θ). (5.58)

Because the argmax does not change when we rescale the cost function, we can

divide by

to obtain a version of the criterion that is expressed as an expectation

with respect to the empirical distribution ˆp

data

deﬁned by the training data:

= arg max

x∼ˆp

data

log p

model

(x; θ). (5.59)

One way to interpret maximum likelihood estimation is to view it as minimizing

the dissimilarity between the empirical distribution

ˆp

data

deﬁned by the training

set and the model distribution, with the degree of dissimilarity between the two

measured by the KL divergence. The KL divergence is given by

(ˆp

data

model

) = E

x∼ˆp

data

[log ˆp

data

(x) −log p

model

(x)] . (5.60)

The term on the left is a function only of the data generating process, not the

model. This means when we train the model to minimize the KL divergence, we

need only minimize

− E

x∼ˆp

data

[log p

model

(x)] (5.61)

which is of course the same as the maximization in Eq. 5.59.

Minimizing this KL divergence corresponds exactly to minimizing the cross-

entropy between the distributions. Many authors use the term “cross-entropy” to

identify speciﬁcally the negative log-likelihood of a Bernoulli or softmax distribution,

but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross

entropy between the empirical distribution deﬁned by the training set and the

model. For example, mean squared error is the cross-entropy between the empirical

distribution and a Gaussian model.

We can thus see maximum likelihood as an attempt to make the model dis-

tribution match the empirical distribution

ˆp

data

. Ideally, we would like to match

the true data generating distribution

data

, but we have no direct access to this

distribution.

While the optimal

is the same regardless of whether we are maximizing the

likelihood or minimizing the KL divergence, the values of the objective functions

are diﬀerent. In software, we often phrase both as minimizing a cost function.

Maximum likelihood thus becomes minimization of the negative log-likelihood

(NLL), or equivalently, minimization of the cross entropy. The perspective of

maximum likelihood as minimum KL divergence becomes helpful in this case

because the KL divergence has a known minimum value of zero. The negative

log-likelihood can actually become negative when x is real-valued.

132

CHAPTER 5. MACHINE LEARNING BASICS

5.5.1 Conditional Log-Likelihood and Mean Squared Error

The maximum likelihood estimator can readily be generalized to the case where

our goal is to estimate a conditional probability

(

y | x

;

) in order to predict

given

. This is actually the most common situation because it forms the basis for

most supervised learning. If

represents all our inputs and

all our observed

targets, then the conditional maximum likelihood estimator is

= arg max

P (Y | X; θ). (5.62)

If the examples are assumed to be i.i.d., then this can be decomposed into

= arg max

i=1

log P (y

(i)

| x

(i)

; θ). (5.63)

Example: Linear Regression as Maximum Likelihood

Linear regression,

introduced earlier in Sec. 5.1.4, may be justiﬁed as a maximum likelihood procedure.

Previously, we motivated linear regression as an algorithm that learns to take an

input

and produce an output value

ˆy

. The mapping from

ˆy

is chosen to

minimize mean squared error, a criterion that we introduced more or less arbitrarily.

We now revisit linear regression from the point of view of maximum likelihood

estimation. Instead of producing a single prediction

ˆy

, we now think of the model

as producing a conditional distribution

(

y | x

). We can imagine that with an

inﬁnitely large training set, we might see several training examples with the same

input value

but diﬀerent values of

. The goal of the learning algorithm is now to

ﬁt the distribution

(

y | x

) to all of those diﬀerent

values that are all compatible

with

. To derive the same linear regression algorithm we obtained before, we

deﬁne

(

y | x

) =

(

;

ˆy

(

;

)

, σ

). The function

ˆy

(

;

) gives the prediction of

the mean of the Gaussian. In this example, we assume that the variance is ﬁxed to

some constant

chosen by the user. We will see that this choice of the functional

form of

(

y | x

) causes the maximum likelihood estimation procedure to yield the

same learning algorithm as we developed before. Since the examples are assumed

to be i.i.d., the conditional log-likelihood (Eq. 5.63) is given by

i=1

log p(y

(i)

| x

(i)

; θ) (5.64)

= −m log σ −

log(2π) −

i=1

|ˆy

(i)

− y

(i)

2σ

(5.65)

133

CHAPTER 5. MACHINE LEARNING BASICS

where

ˆy

(i)

is the output of the linear regression on the

-th input

(i)

and

is the

number of the training examples. Comparing the log-likelihood with the mean

squared error,

MSE

train

i=1

||ˆy

(i)

− y

(i)

, (5.66)

we immediately see that maximizing the log-likelihood with respect to

yields

the same estimate of the parameters

as does minimizing the mean squared error.

The two criteria have diﬀerent values but the same location of the optimum. This

justiﬁes the use of the MSE as a maximum likelihood estimation procedure. As we

will see, the maximum likelihood estimator has several desirable properties.

5.5.2 Properties of Maximum Likelihood

The main appeal of the maximum likelihood estimator is that it can be shown to

be the best estimator asymptotically, as the number of examples

m → ∞

, in terms

of its rate of convergence as m increases.

Under appropriate conditions, maximum likelihood estimator has the property

of consistency (see Sec. 5.4.5 above), meaning that as the number of training

examples approaches inﬁnity, the maximum likelihood estimate of a parameter

converges to the true value of the parameter. These conditions are:

•

The true distribution

data

must lie within the model family

model

(

;

Otherwise, no estimator can recover p

data

•

The true distribution

data

must correspond to exactly one value of

. Other-

wise, maximum likelihood can recover the correct

data

, but will not be able

to determine which value of θ was used by the data generating processing.

There are other inductive principles besides the maximum likelihood estimator,

many of which share the property of being consistent estimators. However, consis-

tent estimators can diﬀer in their statistic eﬃciency, meaning that one consistent

estimator may obtain lower generalization error for a ﬁxed number of samples

or equivalently, may require fewer examples to obtain a ﬁxed level of generalization

error.

Statistical eﬃciency is typically studied in the parametric case (like in linear

regression) where our goal is to estimate the value of a parameter (and assuming

it is possible to identify the true parameter), not the value of a function. A way to

measure how close we are to the true parameter is by the expected mean squared

error, computing the squared diﬀerence between the estimated and true parameter

134

CHAPTER 5. MACHINE LEARNING BASICS

values, where the expectation is over

training samples from the data generating

distribution. That parametric mean squared error decreases as

increases, and

for

large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no

consistent estimator has a lower mean squared error than the maximum likelihood

estimator.

For these reasons (consistency and eﬃciency), maximum likelihood is often

considered the preferred estimator to use for machine learning. When the number

of examples is small enough to yield overﬁtting behavior, regularization strategies

such as weight decay may be used to obtain a biased version of maximum likelihood

that has less variance when training data is limited.

5.6 Bayesian Statistics

So far we have discussed frequentist statistics and approaches based on estimating a

single value of

, then making all predictions thereafter based on that one estimate.

Another approach is to consider all possible values of

when making a prediction.

The latter is the domain of Bayesian statistics.

As discussed in Sec. 5.4.1, the frequentist perspective is that the true parameter

value

is ﬁxed but unknown, while the point estimate

is a random variable on

account of it being a function of the dataset (which is seen as random).

The Bayesian perspective on statistics is quite diﬀerent. The Bayesian uses

probability to reﬂect degrees of certainty of states of knowledge. The dataset is

directly observed and so is not random. On the other hand, the true parameter

is unknown or uncertain and thus is represented as a random variable.

Before observing the data, we represent our knowledge of

using the prior

probability distribution,

(

) (sometimes referred to as simply “the prior”). Gen-

erally, the machine learning practitioner selects a prior distribution that is quite

broad (i.e. with high entropy) to reﬂect a high degree of uncertainty in the value of

before observing any data. For example, one might assume a priori that

lies

in some ﬁnite range or volume, with a uniform distribution. Many priors instead

reﬂect a preference for “simpler” solutions (such as smaller magnitude coeﬃcients,

or a function that is closer to being constant).

Now consider that we have a set of data samples

(1)

, . . . , x

(m)

}

. We can

recover the eﬀect of data on our belief about

by combining the data likelihood

p(x

(1)

, . . . , x

(m)

| θ) with the prior via Bayes’ rule:

p(θ | x

(1)

, . . . , x

(m)

) =

p(x

(1)

, . . . , x

(m)

| θ)p(θ)

p(x

(1)

, . . . , x

(m)

)

(5.67)

135

CHAPTER 5. MACHINE LEARNING BASICS

In the scenarios where Bayesian estimation is typically used, the prior begins as a

relatively uniform or Gaussian distribution with high entropy, and the observation

of the data usually causes the posterior to lose entropy and concentrate around a

few highly likely values of the parameters.

Relative to maximum likelihood estimation, Bayesian estimation oﬀers two

important diﬀerences. First, unlike the maximum likelihood approach that makes

predictions using a point estimate of

, the Bayesian approach is to make predictions

using a full distribution over

. For example, after observing

examples, the

predicted distribution over the next data sample, x

(m+1)

, is given by

p(x

(m+1)

| x

(1)

, . . . , x

(m)

) =

p(x

(m+1)

| θ)p(θ | x

(1)

, . . . , x

(m)

) dθ. (5.68)

Here each value of

with positive probability density contributes to the prediction

of the next example, with the contribution weighted by the posterior density itself.

After having observed

(1)

, . . . , x

(m)

}

, if we are still quite uncertain about the

value of

, then this uncertainty is incorporated directly into any predictions we

might make.

In Sec. 5.4, we discussed how the frequentist approach addresses the uncertainty

in a given point estimate of

by evaluating its variance. The variance of the

estimator is an assessment of how the estimate might change with alternative

samplings of the observed data. The Bayesian answer to the question of how to deal

with the uncertainty in the estimator is to simply integrate over it, which tends to

protect well against overﬁtting. This integral is of course just an application of

the laws of probability, making the Bayesian approach simple to justify, while the

frequentist machinery for constructing an estimator is based on the rather ad hoc

decision to summarize all knowledge contained in the dataset with a single point

estimate.

The second important diﬀerence between the Bayesian approach to estimation

and the maximum likelihood approach is due to the contribution of the Bayesian

prior distribution. The prior has an inﬂuence by shifting probability mass density

towards regions of the parameter space that are preferred a priori. In practice,

the prior often expresses a preference for models that are simpler or more smooth.

Critics of the Bayesian approach identify the prior as a source of subjective human

judgment impacting the predictions.

Bayesian methods typically generalize much better when limited training data

is available, but typically suﬀer from high computational cost when the number of

training examples is large.

136

CHAPTER 5. MACHINE LEARNING BASICS

Example: Bayesian Linear Regression

Here we consider the Bayesian esti-

mation approach to learning the linear regression parameters. In linear regression,

we learn a linear mapping from an input vector

x ∈ R

to predict the value of a

scalar y ∈ R. The prediction is parametrized by the vector w ∈ R

ˆy = w

x. (5.69)

Given a set of

training samples (

(train)

, y

(train)

), we can express the prediction

of y over the entire training set as:

(train)

= X

(train)

w. (5.70)

Expressed as a Gaussian conditional distribution on y

(train)

, we have

p(y

(train)

| X

(train)

, w) = N(y

(train)

; X

(train)

w, I) (5.71)

∝ exp



−

(train)

− X

(train)

− X

(train)



(5.72)

where we follow the standard MSE formulation in assuming that the Gaussian

variance on

is one. In what follows, to reduce the notational burden, we refer to

(train)

, y

(train)

) as simply (X, y).

To determine the posterior distribution over the model parameter vector

, we

ﬁrst need to specify a prior distribution. The prior should reﬂect our naive belief

about the value of these parameters. While it is sometimes diﬃcult or unnatural

to express our prior beliefs in terms of the parameters of the model, in practice we

typically assume a fairly broad distribution expressing a high degree of uncertainty

about

. For real-valued parameters it is common to use a Gaussian as a prior

distribution:

p(w) = N(w; µ

, Λ

) ∝ exp



−

(w − µ

)

−1

(w − µ

)



(5.73)

where

and

are the prior distribution mean vector and covariance matrix

respectively.

With the prior thus speciﬁed, we can now proceed in determining the posterior

distribution over the model parameters.

p(w | X, y) ∝ p(y | X, w)p(w) (5.74)

Unless there is a reason to assume a particular covariance structure, we typically assume a

diagonal covariance matrix Λ

= diag(λ

137

CHAPTER 5. MACHINE LEARNING BASICS

∝ exp



−

(y − Xw)



exp



−

(w − µ

)

−1

(w − µ

)



(5.75)

∝ exp



−



−2y

Xw + w

−1

w − 2µ

−1





(5.76)

We now deﬁne



X + Λ

−1



−1

and

= Λ



y + Λ

−1



. Using

these new variables, we ﬁnd that the posterior may be rewritten as a Gaussian

distribution:

p(w | X, y) ∝ exp



−

(w − µ

)

−1

(w − µ

) +

−1



(5.77)

∝ exp



−

(w − µ

)

−1

(w − µ

)



. (5.78)

All terms that do not include the parameter vector

have been omitted; they

are implied by the fact that the distribution must be normalized to integrate to 1.

Eq. 3.23 shows how to normalize a multivariate Gaussian distribution.

Examining this posterior distribution allows us to gain some intuition for the

eﬀect of Bayesian inference. In most situations, we set

. If we set

then

gives the same estimate of

as does frequentist linear regression with

a weight decay penalty of

αw

. One diﬀerence is that the Bayesian estimate

is undeﬁned if

alpha

is set to zero—-we are not allowed to begin the Bayesian

learning process with an inﬁnitely wide prior on

. The more important diﬀerence

is that the Bayesian estimate provides a covariance matrix, showing how likely all

the diﬀerent values of w are, rather than providing only the estimate µ

5.6.1 Maximum A Posteriori (MAP) Estimation

While the most principled approach is to make predictions using the full Bayesian

posterior distribution over the parameter

, it is still often desirable to have a

single point estimate. One common reason for desiring a point estimate is that

most operations involving the Bayesian posterior for most interesting models are

intractable, and a point estimate oﬀers a tractable approximation. Rather than

simply returning to the maximum likelihood estimate, we can still gain some of

the beneﬁt of the Bayesian approach by allowing the prior to inﬂuence the choice

of the point estimate. One rational way to do this is to choose the maximum a

posteriori (MAP) point estimate. The MAP estimate chooses the point of maximal

138

CHAPTER 5. MACHINE LEARNING BASICS

posterior probability (or maximal probability density in the more common case of

continuous θ):

MAP

= arg max

p(θ | x) = arg max

log p(x | θ) + log p(θ). (5.79)

We recognize, above on the right hand side,

log p

(

x | θ

), i.e. the standard log-

likelihood term, and log p(θ), corresponding to the prior distribution.

As an example, consider a linear regression model with a Gaussian prior on the

weights

. If this prior is given by

(

;

), then the log-prior term in Eq.

5.79 is proportional to the familiar

λw

weight decay penalty, plus a term that

does not depend on

and does not aﬀect the learning process. MAP Bayesian

inference with a Gaussian prior on the weights thus corresponds to weight decay.

As with full Bayesian inference, MAP Bayesian inference has the advantage of

leveraging information that is brought by the prior and cannot be found in the

training data. This additional information helps to reduce the variance in the

MAP point estimate (in comparison to the ML estimate). However, it does so at

the price of increased bias.

Many regularized estimation strategies, such as maximum likelihood learning

regularized with weight decay, can be interpreted as making the MAP approxima-

tion to Bayesian inference. This view applies when the regularization consists of

adding an extra term to the objective function that corresponds to

log p

(

). Not

all regularization penalties correspond to MAP Bayesian inference. For example,

some regularizer terms may not be the logarithm of a probability distribution.

Other regularization terms depend on the data, which of course a prior probability

distribution is not allowed to do.

MAP Bayesian inference provides a straightforward way to design complicated

yet interpretable regularization terms. For example, a more complicated penalty

term can be derived by using a mixture of Gaussians, rather than a single Gaussian

distribution, as the prior (Nowlan and Hinton, 1992).

5.7 Supervised Learning Algorithms

Recall from Sec. 5.1.3 that supervised learning algorithms are, roughly speaking,

learning algorithms that learn to associate some input with some output, given a

training set of examples of inputs

and outputs

. In many cases the outputs

may be diﬃcult to collect automatically and must be provided by a human

“supervisor,” but the term still applies even when the training set targets were

collected automatically.

139

CHAPTER 5. MACHINE LEARNING BASICS

5.7.1 Probabilistic Supervised Learning

Most supervised learning algorithms in this book are based on estimating a

probability distribution

(

y | x

). We can do this simply by using maximum

likelihood estimation to ﬁnd the best parameter vector

for a parametric family

of distributions p(y | x; θ).

We have already seen that linear regression corresponds to the family

p(y | x; θ) = N(y; θ

x, I). (5.80)

We can generalize linear regression to the classiﬁcation scenario by deﬁning a

diﬀerent family of probability distributions. If we have two classes, class 0 and

class 1, then we need only specify the probability of one of these classes. The

probability of class 1 determines the probability of class 0, because these two values

must add up to 1.

The normal distribution over real-valued numbers that we used for linear

regression is parametrized in terms of a mean. Any value we supply for this mean

is valid. A distribution over a binary variable is slightly more complicated, because

its mean must always be between 0 and 1. One way to solve this problem is to use

the logistic sigmoid function to squash the output of the linear function into the

interval (0, 1) and interpret that value as a probability:

p(y = 1 | x; θ) = σ(θ

x). (5.81)

This approach is known as logistic regression (a somewhat strange name since we

use the model for classiﬁcation rather than regression).

In the case of linear regression, we were able to ﬁnd the optimal weights by

solving the normal equations. Logistic regression is somewhat more diﬃcult. There

is no closed-form solution for its optimal weights. Instead, we must search for

them by maximizing the log-likelihood. We can do this by minimizing the negative

log-likelihood (NLL) using gradient descent.

This same strategy can be applied to essentially any supervised learning problem,

by writing down a parametric family of conditional probability distributions over

the right kind of input and output variables.

5.7.2 Support Vector Machines

One of the most inﬂuential approaches to supervised learning is the support vector

machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to

logistic regression in that it is driven by a linear function

. Unlike logistic

140

CHAPTER 5. MACHINE LEARNING BASICS

regression, the support vector machine does not provide probabilities, but only

outputs a class identity. The SVM predicts that the positive class is present when

is positive. Likewise, it predicts that the negative class is present when

x + b is negative.

One key innovation associated with support vector machines is the kernel trick.

The kernel trick consists of observing that many machine learning algorithms can

be written exclusively in terms of dot products between examples. For example, it

can be shown that the linear function used by the support vector machine can be

re-written as

x + b = b +

i=1

(i)

(5.82)

where

(i)

is a training example and

is a vector of coeﬃcients. Rewriting the

learning algorithm this way allows us to replace

by the output of a given feature

function

(

) and the dot product with a function

(

x, x

(i)

) =

(

)

·φ

(

(i)

) called

a kernel. The

operator represents an inner product analogous to

(

)

(

(i)

For some feature spaces, we may not use literally the vector inner product. In

some inﬁnite dimensional spaces, we need to use other kinds of inner products, for

example, inner products based on integration rather than summation. A complete

development of these kinds of inner products is beyond the scope of this book.

After replacing dot products with kernel evaluations, we can make predictions

using the function

f(x) = b +

k(x, x

(i)

). (5.83)

This function is nonlinear with respect to

, but the relationship between

(

)

and

(

) is linear. Also, the relationship between

and

(

) is linear. The

kernel-based function is exactly equivalent to preprocessing the data by applying

φ(x) to all inputs, then learning a linear model in the new transformed space.

The kernel trick is powerful for two reasons. First, it allows us to learn models

that are nonlinear as a function of

using convex optimization techniques that are

guaranteed to converge eﬃciently. This is possible because we consider

ﬁxed and

optimize only

, i.e., the optimization algorithm can view the decision function

as being linear in a diﬀerent space. Second, the kernel function

often admits

an implementation that is signiﬁcantly more computational eﬃcient than naively

constructing two φ(x) vectors and explicitly taking their dot product.

In some cases,

(

) can even be inﬁnite dimensional, which would result in

an inﬁnite computational cost for the naive, explicit approach. In many cases,

(

x, x

) is a nonlinear, tractable function of

even when

(

) is intractable. As

141

CHAPTER 5. MACHINE LEARNING BASICS

an example of an inﬁnite-dimensional feature space with a tractable kernel, we

construct a feature mapping

(

) over the non-negative integers

. Suppose that

this mapping returns a vector containing

ones followed by inﬁnitely many zeros.

We can write a kernel function

(

x, x

(i)

) =

min

(

x, x

(i)

) that is exactly equivalent

to the corresponding inﬁnite-dimensional dot product.

The most commonly used kernel is the Gaussian kernel

k(u, v) = N(u − v; 0, σ

I) (5.84)

where

(

;

µ, Σ

) is the standard normal density. This kernel is also known as

the radial basis function (RBF) kernel, because its value decreases along lines in

space radiating outward from

. The Gaussian kernel corresponds to a dot

product in an inﬁnite-dimensional space, but the derivation of this space is less

straightforward than in our example of the min kernel over the integers.

We can think of the Gaussian kernel as performing a kind of template matching.

A training example

associated with training label

becomes a template for class

. When a test point

is near

according to Euclidean distance, the Gaussian

kernel has a large response, indicating that

is very similar to the

template.

The model then puts a large weight on the associated training label

. Overall,

the prediction will combine many such training labels weighted by the similarity

of the corresponding training examples.

Support vector machines are not the only algorithm that can be enhanced

using the kernel trick. Many other linear models can be enhanced in this way. The

category of algorithms that employ the kernel trick is known as kernel machines

or kernel methods (Williams and Rasmussen, 1996; Schölkopf et al., 1999).

A major drawback to kernel machines is that the cost of evaluating the decision

function is linear in the number of training examples, because the

-th example

contributes a term

(

x, x

(i)

) to the decision function. Support vector machines

are able to mitigate this by learning an

vector that contains mostly zeros.

Classifying a new example then requires evaluating the kernel function only for

the training examples that have non-zero

. These training examples are known

as support vectors.

Kernel machines also suﬀer from a high computational cost of training when

the dataset is large. We will revisit this idea in Sec. 5.9. Kernel machines with

generic kernels struggle to generalize well. We will explain why in Sec. 5.11. The

modern incarnation of deep learning was designed to overcome these limitations of

kernel machines. The current deep learning renaissance began when Hinton et al.

(2006) demonstrated that a neural network could outperform the RBF kernel SVM

on the MNIST benchmark.

142

CHAPTER 5. MACHINE LEARNING BASICS

5.7.3 Other Simple Supervised Learning Algorithms

We have already brieﬂy encountered another non-probabilistic supervised learning

algorithm, nearest neighbor regression. More generally,

-nearest neighbors is

a family of techniques that can be used for classiﬁcation or regression. As a

non-parametric learning algorithm,

-nearest neighbors is not restricted to a ﬁxed

number of parameters. We usually think of the

-nearest neighbors algorithm

as not having any parameters, but rather implementing a simple function of the

training data. In fact, there is not even really a training stage or learning process.

Instead, at test time, when we want to produce an output

for a new test input

we ﬁnd the

-nearest neighbors to

in the training data

. We then return the

average of the corresponding

values in the training set. This works for essentially

any kind of supervised learning where we can deﬁne an average over

values. In

the case of classiﬁcation, we can average over one-hot code vectors

with

= 1

and

= 0 for all other values of

. We can then interpret the average over these

one-hot codes as giving a probability distribution over classes. As a non-parametric

learning algorithm,

-nearest neighbor can achieve very high capacity. For example,

suppose we have a multiclass classiﬁcation task and measure performance with 0-1

loss. In this setting, 1-nearest neighbor converges to double the Bayes error as the

number of training examples approaches inﬁnity. The error in excess of the Bayes

error results from choosing a single neighbor by breaking ties between equally

distant neighbors randomly. When there is inﬁnite training data, all test points

will have inﬁnitely many training set neighbors at distance zero. If we allow the

algorithm to use all of these neighbors to vote, rather than randomly choosing one

of them, the procedure converges to the Bayes error rate. The high capacity of

-nearest neighbors allows it to obtain high accuracy given a large training set.

However, it does so at high computational cost, and it may generalize very badly

given a small, ﬁnite training set. One weakness of

-nearest neighbors is that it

cannot learn that one feature is more discriminative than another. For example,

imagine we have a regression task with

x ∈ R

100

drawn from an isotropic Gaussian

distribution, but only a single variable

is relevant to the output. Suppose

further that this feature simply encodes the output directly, i.e. that

in all

cases. Nearest neighbor regression will not be able to detect this simple pattern.

The nearest neighbor of most points

will be determined by the large number of

features

through

100

, not by the lone feature

. Thus the output on small

training sets will essentially be random.

143

CHAPTER 5. MACHINE LEARNING BASICS

111

0 1

011

11111110

110

010

1110 1111

110

010 011

111

Figure 5.7: Diagrams describing how a decision tree works. (Top) Each node of the tree

chooses to send the input example to the child node on the left (0) or or the child node on

the right (1). Internal nodes are drawn as circles and leaf nodes as squares. Each node is

displayed with a binary string identiﬁer corresponding to its position in the tree, obtained

by appending a bit to its parent identiﬁer (0=choose left or top, 1=choose right or bottom).

(Bottom) The tree divides space into regions. The 2D plane shows how a decision tree

might divide

. The nodes of the tree are plotted in this plane, with each internal node

drawn along the dividing line it uses to categorize examples, and leaf nodes drawn in the

center of the region of examples they receive. The result is a piecewise-constant function,

with one piece per leaf. Each leaf requires at least one training example to deﬁne, so it is

not possible for the decision tree to learn a function that has more local maxima than the

number of training examples.

144

CHAPTER 5. MACHINE LEARNING BASICS

Another type of learning algorithm that also breaks the input space into regions

and has separate parameters for each region is the decision tree (Breiman et al.,

1984) and its many variants. As shown in Fig. 5.7, each node of the decision tree

is associated with a region in the input space, and internal nodes break that region

into one sub-region for each child of the node (typically using an axis-aligned

cut). Space is thus sub-divided into non-overlapping regions, with a one-to-one

correspondence between leaf nodes and input regions. Each leaf node usually maps

every point in its input region to the same output. Decision trees are usually

trained with specialized algorithms that are beyond the scope of this book. The

learning algorithm can be considered non-parametric if it is allowed to learn a tree

of arbitrary size, though decision trees are usually regularized with size constraints

that turn them into parametric models in practice. Decision trees as they are

typically used, with axis-aligned splits and constant outputs within each node,

struggle to solve some problems that are easy even for logistic regression. For

example, if we have a two-class problem and the positive class occurs wherever

> x

, the decision boundary is not axis-aligned. The decision tree will thus

need to approximate the decision boundary with many nodes, implementing a step

function that constantly walks back and forth across the true decision function

with axis-aligned steps.

As we have seen, nearest neighbor predictors and decision trees have many

limitations. Nonetheless, they are useful learning algorithms when computational

resources are constrained. We can also build intuition for more sophisticated

learning algorithms by thinking about the similarities and diﬀerences between

sophisticated algorithms and k-NN or decision tree baselines.

See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine

learning textbooks for more material on traditional supervised learning algorithms.

5.8 Unsupervised Learning Algorithms

Recall from Sec. 5.1.3 that unsupervised algorithms are those that experience only

“features” but not a supervision signal. The distinction between supervised and

unsupervised algorithms is not formally and rigidly deﬁned because there is no

objective test for distinguishing whether a value is a feature or a target provided by

a supervisor. Informally, unsupervised learning refers to most attempts to extract

information from a distribution that do not require human labor to annotate

examples. The term is usually associated with density estimation, learning to

draw samples from a distribution, learning to denoise data from some distribution,

ﬁnding a manifold that the data lies near, or clustering the data into groups of

145

CHAPTER 5. MACHINE LEARNING BASICS

related examples.

A classic unsupervised learning task is to ﬁnd the “best” representation of the

data. By ‘best’ we can mean diﬀerent things, but generally speaking we are looking

for a representation that preserves as much information about

as possible while

obeying some penalty or constraint aimed at keeping the representation simpler or

more accessible than x itself.

There are multiple ways of deﬁning a simpler representation. Three of the

most common include lower dimensional representations, sparse representations

and independent representations. Low-dimensional representations attempt to

compress as much information about

as possible in a smaller representation.

Sparse representations (Barlow, 1989; Olshausen and Field, 1996; Hinton and

Ghahramani, 1997) embed the dataset into a representation whose entries are

mostly zeroes for most inputs. The use of sparse representations typically requires

increasing the dimensionality of the representation, so that the representation

becoming mostly zeroes does not discard too much information. This results in an

overall structure of the representation that tends to distribute data along the axes

of the representation space. Independent representations attempt to disentangle

the sources of variation underlying the data distribution such that the dimensions

of the representation are statistically independent.

Of course these three criteria are certainly not mutually exclusive. Low-

dimensional representations often yield elements that have fewer or weaker de-

pendencies than the original high-dimensional data. This is because one way to

reduce the size of a representation is to ﬁnd and remove redundancies. Identifying

and removing more redundancy allows the dimensionality reduction algorithm to

achieve more compression while discarding less information.

The notion of representation is one of the central themes of deep learning and

therefore one of the central themes in this book. In this section, we develop some

simple examples of representation learning algorithms. Together, these example

algorithms show how to operationalize all three of the criteria above. Most of the

remaining chapters introduce additional representation learning algorithms that

develop these criteria in diﬀerent ways or introduce other criteria.

5.8.1 Principal Components Analysis

In Sec. 2.12, we saw that the principal components analysis algorithm provides a

means of compressing data. We can also view PCA as an unsupervised learning

algorithm that learns a representation of data. This representation is based on

two of the criteria for a simple representation described above. PCA learns a

146

CHAPTER 5. MACHINE LEARNING BASICS

−20 −10 0 10 20

−20

−10

−20 −10 0 10 20

−20

−10

Figure 5.8: PCA learns a linear projection that aligns the direction of greatest variance

with the axes of the new space. (Left) The original data consists of samples of

. In this

space, the variance might occur along directions that are not axis-aligned. (Right) The

transformed data

now varies most along the axis

. The direction of second

most variance is now along z

representation that has lower dimensionality than the original input. It also learns

a representation whose elements have no linear correlation with each other. This

is a ﬁrst step toward the criterion of learning representations whose elements are

statistically independent. To achieve full independence, a representation learning

algorithm must also remove the nonlinear relationships between variables.

PCA learns an orthogonal, linear transformation of the data that projects an

input

to a representation

as shown in Fig. 5.8. In Sec. 2.12, we saw that we

could learn a one-dimensional representation that best reconstructs the original

data (in the sense of mean squared error) and that this representation actually

corresponds to the ﬁrst principal component of the data. Thus we can use PCA

as a simple and eﬀective dimensionality reduction method that preserves as much

of the information in the data as possible (again, as measured by least-squares

reconstruction error). In the following, we will study how the PCA representation

decorrelates the original data representation X.

Let us consider the

m × n

-dimensional design matrix

. We will assume that

the data has a mean of zero,

[

] =

. If this is not the case, the data can easily

be centered by subtracting the mean from all examples in a preprocessing step.

The unbiased sample covariance matrix associated with X is given by:

Var[x] =

m −1

X. (5.85)

147

CHAPTER 5. MACHINE LEARNING BASICS

PCA ﬁnds a representation (through linear transformation)

where

Var[z] is diagonal.

In Sec. 2.12, we saw that the principal components of a design matrix

are

given by the eigenvectors of X

X. From this view,

X = W ΛW

. (5.86)

In this section, we exploit an alternative derivation of the principal components. The

principal components may also be obtained via the singular value decomposition.

Speciﬁcally, they are the right singular vectors of

. To see this, let

be the

right singular vectors in the decomposition

UΣW

. We then recover the

original eigenvector equation with W as the eigenvector basis:

X =



UΣW



UΣW

= W Σ

. (5.87)

The SVD is helpful to show that PCA results in a diagonal

Var

[

]. Using the

SVD of X, we can express the variance of X as:

Var[x] =

m −1

X (5.88)

m −1

(UΣW

)

UΣW

(5.89)

m −1

W Σ

UΣW

(5.90)

m −1

W Σ

, (5.91)

where we use the fact that

because the

matrix of the singular value

deﬁnition is deﬁned to be orthonormal. This shows that if we take

, we

can ensure that the covariance of z is diagonal as required:

Var[z] =

m −1

Z (5.92)

m −1

XW (5.93)

m −1

W W

(5.94)

m −1

, (5.95)

where this time we use the fact that

W W

, again from the deﬁnition of the

SVD.

148

CHAPTER 5. MACHINE LEARNING BASICS

The above analysis shows that when we project the data

, via the linear

transformation

, the resulting representation has a diagonal covariance matrix

(as given by

) which immediately implies that the individual elements of

are

mutually uncorrelated.

This ability of PCA to transform data into a representation where the elements

are mutually uncorrelated is a very important property of PCA. It is a simple

example of a representation that attempt to

disentangle the unknown factors

of variation

underlying the data. In the case of PCA, this disentangling takes

the form of ﬁnding a rotation of the input space (described by

) that aligns the

principal axes of variance with the basis of the new representation space associated

with z.

While correlation is an important category of dependency between elements of

the data, we are also interested in learning representations that disentangle more

complicated forms of feature dependencies. For this, we will need more than what

can be done with a simple linear transformation.

5.8.2 k-means Clustering

Another example of a simple representation learning algorithm is

-means clustering.

The

-means clustering algorithm divides the training set into

diﬀerent clusters

of examples that are near each other. We can thus think of the algorithm as

providing a

-dimensional one-hot code vector

representing an input

. If

belongs to cluster

, then

= 1 and all other entries of the representation

are

zero.

The one-hot code provided by

-means clustering is an example of a sparse

representation, because the majority of its entries are zero for every input. Later,

we will develop other algorithms that learn more ﬂexible sparse representations,

where more than one entry can be non-zero for each input

. One-hot codes

are an extreme example of sparse representations that lose many of the beneﬁts

of a distributed representation. The one-hot code still confers some statistical

advantages (it naturally conveys the idea that all examples in the same cluster are

similar to each other) and it confers the computational advantage that the entire

representation may be captured by a single integer.

The

-means algorithm works by initializing

diﬀerent centroids

{µ

(1)

, . . . , µ

(k)

}

to diﬀerent values, then alternating between two diﬀerent steps until convergence.

In one step, each training example is assigned to cluster

, where

is the index of

the nearest centroid

(i)

. In the other step, each centroid

(i)

is updated to the

mean of all training examples x

(j)

assigned to cluster i.

149

CHAPTER 5. MACHINE LEARNING BASICS

One diﬃculty pertaining to clustering is that the clustering problem is inherently

ill-posed, in the sense that there is no single criterion that measures how well a

clustering of the data corresponds to the real world. We can measure properties of

the clustering such as the average Euclidean distance from a cluster centroid to the

members of the cluster. This allows us to tell how well we are able to reconstruct

the training data from the cluster assignments. We do not know how well the

cluster assignments correspond to properties of the real world. Moreover, there

may be many diﬀerent clusterings that all correspond well to some property of

the real world. We may hope to ﬁnd a clustering that relates to one feature but

obtain a diﬀerent, equally valid clustering that is not relevant to our task. For

example, suppose that we run two clustering algorithms on a dataset consisting of

images of red trucks, images of red cars, images of gray trucks, and images of gray

cars. If we ask each clustering algorithm to ﬁnd two clusters, one algorithm may

ﬁnd a cluster of cars and a cluster of trucks, while another may ﬁnd a cluster of

red vehicles and a cluster of gray vehicles. Suppose we also run a third clustering

algorithm, which is allowed to determine the number of clusters. This may assign

the examples to four clusters, red cars, red trucks, gray cars, and gray trucks. This

new clustering now at least captures information about both attributes, but it has

lost information about similarity. Red cars are in a diﬀerent cluster from gray

cars, just as they are in a diﬀerent cluster from gray trucks. The output of the

clustering algorithm does not tell us that red cars are more similar to gray cars

than they are to gray trucks. They are diﬀerent from both things, and that is all

we know.

These issues illustrate some of the reasons that we may prefer a distributed

representation to a one-hot representation. A distributed representation could have

two attributes for each vehicle—one representing its color and one representing

whether it is a car or a truck. It is still not entirely clear what the optimal

distributed representation is (how can the learning algorithm know whether the

two attributes we are interested in are color and car-versus-truck rather than

manufacturer and age?) but having many attributes reduces the burden on the

algorithm to guess which single attribute we care about, and allows us to measure

similarity between objects in a ﬁne-grained way by comparing many attributes

instead of just testing whether one attribute matches.

5.9 Stochastic Gradient Descent

Nearly all of deep learning is powered by one very important algorithm: stochastic

gradient descent or SGD. Stochastic gradient descent is an extension of the gradient

150

CHAPTER 5. MACHINE LEARNING BASICS

descent algorithm introduced in Sec. 4.3.

A recurring problem in machine learning is that large training sets are necessary

for good generalization, but large training sets are also more computationally

expensive.

The cost function used by a machine learning algorithm often decomposes as a

sum over training examples of some per-example loss function. For example, the

negative conditional log-likelihood of the training data can be written as

J(θ) = E

x,y∼ˆp

data

L(x, y, θ) =

i=1

L(x

(i)

, y

(i)

, θ) (5.96)

where L is the per-example loss L(x, y, θ) = −log p(y | x; θ).

For these additive cost functions, gradient descent requires computing

∇

J(θ) =

i=1

∇

L(x

(i)

, y

(i)

, θ). (5.97)

The computational cost of this operation is

(

). As the training set size grows to

billions of examples, the time to take a single gradient step becomes prohibitively

long.

The insight of stochastic gradient descent is that the gradient is an expectation.

The expectation may be approximately estimated using a small set of samples.

Speciﬁcally, on each step of the algorithm, we can sample a minibatch of examples

(1)

, . . . , x

)

}

drawn uniformly from the training set. The minibatch size

is typically chosen to be a relatively small number of examples, ranging from

1 to a few hundred. Crucially,

is usually held ﬁxed as the training set size

grows. We may ﬁt a training set with billions of examples using updates computed

on only a hundred examples.

The estimate of the gradient is formed as

g =

∇

i=1

L(x

(i)

, y

(i)

, θ). (5.98)

using examples from the minibatch B. The stochastic gradient descent algorithm

then follows the estimated gradient downhill:

θ ← θ − g, (5.99)

where  is the learning rate.

151

CHAPTER 5. MACHINE LEARNING BASICS

Gradient descent in general has often been regarded as slow or unreliable. In

the past, the application of gradient descent to non-convex optimization problems

was regarded as foolhardy or unprincipled. Today, we know that the machine

learning models described in Part II work very well when trained with gradient

descent. The optimization algorithm may not be guaranteed to arrive at even a

local minimum in a reasonable amount of time, but it often ﬁnds a very low value

of the cost function quickly enough to be useful.

Stochastic gradient descent has many important uses outside the context of

deep learning. It is the main way to train large linear models on very large

datasets. For a ﬁxed model size, the cost per SGD update does not depend on the

training set size

. In practice, we often use a larger model as the training set size

increases, but we are not forced to do so. The number of updates required to reach

convergence usually increases with training set size. However, as

approaches

inﬁnity, the model will eventually converge to its best possible test error before

SGD has sampled every example in the training set. Increasing

further will not

extend the amount of training time needed to reach the model’s best possible test

error. From this point of view, one can argue that the asymptotic cost of training

a model with SGD is O(1) as a function of m.

Prior to the advent of deep learning, the main way to learn nonlinear models

was to use the kernel trick in combination with a linear model. Many kernel learning

algorithms require constructing an

m×m

matrix

i,j

(

(i)

, x

(j)

). Constructing

this matrix has computational cost

(

), which is clearly undesirable for datasets

with billions of examples. In academia, starting in 2006, deep learning was

initially interesting because it was able to generalize to new examples better

than competing algorithms when trained on medium-sized datasets with tens of

thousands of examples. Soon after, deep learning garnered additional interest in

industry, because it provided a scalable way of training nonlinear models on large

datasets.

Stochastic gradient descent and many enhancements to it are described further

in Chapter 8.

5.10 Building a Machine Learning Algorithm

Nearly all deep learning algorithms can be described as particular instances of

a fairly simple recipe: combine a speciﬁcation of a dataset, a cost function, an

optimization procedure and a model.

For example, the linear regression algorithm combines a dataset consisting of

152

CHAPTER 5. MACHINE LEARNING BASICS

X and y, the cost function

J(w, b) = −E

x,y∼ˆp

data

log p

model

(y | x), (5.100)

the model speciﬁcation

model

(

y | x

) =

(

;

1), and, in most cases, the

optimization algorithm deﬁned by solving for where the gradient of the cost is zero

using the normal equations.

By realizing that we can replace any of these components mostly independently

from the others, we can obtain a very wide variety of algorithms.

The cost function typically includes at least one term that causes the learning

process to perform statistical estimation. The most common cost function is the

negative log-likelihood, so that minimizing the cost function causes maximum

likelihood estimation.

The cost function may also include additional terms, such as regularization

terms. For example, we can add weight decay to the linear regression cost function

to obtain

J(w, b) = λ||w||

− E

x,y∼p

data

log p

model

(y | x). (5.101)

This still allows closed-form optimization.

If we change the model to be nonlinear, then most cost functions can no longer

be optimized in closed form. This requires us to choose an iterative numerical

optimization procedure, such as gradient descent.

The recipe for constructing a learning algorithm by combining models, costs, and

optimization algorithms supports both supervised and unsupervised learning. The

linear regression example shows how to support supervised learning. Unsupervised

learning can be supported by deﬁning a dataset that contains only

and providing

an appropriate unsupervised cost and model. For example, we can obtain the ﬁrst

PCA vector by specifying that our loss function is

J(w) = E

x∼p

data

||x −r(x; w)||

(5.102)

while our model is deﬁned to have

with norm one and reconstruction function

r(x) = w

xw.

In some cases, the cost function may be a function that we cannot actually

evaluate, for computational reasons. In these cases, we can still approximately

minimize it using iterative numerical optimization so long as we have some way of

approximating its gradients.

Most machine learning algorithms make use of this recipe, though it may not

immediately be obvious. If a machine learning algorithm seems especially unique or

153

CHAPTER 5. MACHINE LEARNING BASICS

hand-designed, it can usually be understood as using a special-case optimizer. Some

models such as decision trees or

-means require special-case optimizers because

their cost functions have ﬂat regions that make them inappropriate for minimization

by gradient-based optimizers. Recognizing that most machine learning algorithms

can be described using this recipe helps to see the diﬀerent algorithms as part of a

taxonomy of methods for doing related tasks that work for similar reasons, rather

than as a long list of algorithms that each have separate justiﬁcations.

5.11 Challenges Motivating Deep Learning

The simple machine learning algorithms described in this chapter work very well on

a wide variety of important problems. However, they have not succeeded in solving

the central problems in AI, such as recognizing speech or recognizing objects.

The development of deep learning was motivated in part by the failure of

traditional algorithms to generalize well on such AI tasks.

This section is about how the challenge of generalizing to new examples becomes

exponentially more diﬃcult when working with high-dimensional data, and how

the mechanisms used to achieve generalization in traditional machine learning

are insuﬃcient to learn complicated functions in high-dimensional spaces. Such

spaces also often impose high computational costs. Deep learning was designed to

overcome these and other obstacles.

5.11.1 The Curse of Dimensionality

Many machine learning problems become exceedingly diﬃcult when the number

of dimensions in the data is high. This phenomenon is known as the curse

of dimensionality. Of particular concern is that the number of possible distinct

conﬁgurations of a set of variables increases exponentially as the number of variables

increases.

154

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.9: As the number of relevant dimensions of the data increases (from left to

right), the number of conﬁgurations of interest may grow exponentially. (Left) In this

one-dimensional example, we have one variable for which we only care to distinguish 10

regions of interest. With enough examples falling within each of these regions (each region

corresponds to a cell in the illustration), learning algorithms can easily generalize correctly.

A straightforward way to generalize is to estimate the value of the target function within

each region (and possibly interpolate between neighboring regions). (Center) With 2

dimensions (center) it is more diﬃcult to distinguish 10 diﬀerent values of each variable.

We need to keep track of up to 10

10=100 regions, and we need at least that many

examples to cover all those regions. (Right) With 3 dimensions this grows to 10

= 1000

regions and at least that many examples. For

dimensions and

values to be distinguished

along each axis, we seem to need

(

) regions and examples. This is an instance of the

curse of dimensionality. Figure graciously provided by Nicolas Chapados.

The curse of dimensionality arises in many places in computer science, and

especially so in machine learning.

One challenge posed by the curse of dimensionality is a statistical challenge.

As illustrated in Fig. 5.9, a statistical challenge arises because the number of

possible conﬁgurations of

is much larger than the number of training examples.

To understand the issue, let us consider that the input space is organized into a

grid, like in the ﬁgure. In low dimensions we can describe this space with a low

number of grid cells that are mostly occupied by the data. When generalizing to a

new data point, we can usually tell what to do simply by inspecting the training

examples that lie in the same cell as the new input. For example, if estimating

the probability density at some point

, we can just return the number of training

examples in the same unit volume cell as

, divided by the total number of training

examples. If we wish to classify an example, we can return the most common class

of training examples in the same cell. If we are doing regression we can average

the target values observed over the examples in that cell. But what about the

cells for which we have seen no example? Because in high-dimensional spaces the

number of conﬁgurations is going to be huge, much larger than our number of

examples, most conﬁgurations will have no training example associated with it.

155

CHAPTER 5. MACHINE LEARNING BASICS

How could we possibly say something meaningful about these new conﬁgurations?

Many traditional machine learning algorithms simply assume that the output at a

new point should be approximately the same as the output at the nearest training

point.

5.11.2 Local Constancy and Smoothness Regularization

In order to generalize well, machine learning algorithms need to be guided by prior

beliefs about what kind of function they should learn. Previously, we have seen

these priors incorporated as explicit beliefs in the form of probability distributions

over parameters of the model. More informally, we may also discuss prior beliefs as

directly inﬂuencing the function itself and only indirectly acting on the parameters

via their eﬀect on the function. Additionally, we informally discuss prior beliefs as

being expressed implicitly, by choosing algorithms that are biased toward choosing

some class of functions over another, even though these biases may not be expressed

(or even possible to express) in terms of a probability distribution representing our

degree of belief in various functions.

Among the most widely used of these implicit “priors” is the smoothness prior

or local constancy prior. This prior states that the function we learn should not

change very much within a small region.

Many simpler algorithms rely exclusively on this prior to generalize well, and

as a result they fail to scale to the statistical challenges involved in solving AI-

level tasks. Throughout this book, we will describe how deep learning introduces

additional (explicit and implicit) priors in order to reduce the generalization

error on sophisticated tasks. Here, we explain why the smoothness prior alone is

insuﬃcient for these tasks.

There are many diﬀerent ways to implicitly or explicitly express a prior belief

that the learned function should be smooth or locally constant. All of these diﬀerent

methods are designed to encourage the learning process to learn a function

∗

that

satisﬁes the condition

∗

(x) ≈ f

∗

(x + ) (5.103)

for most conﬁgurations

and small change



. In other words, if we know a good

answer for an input

(for example, if

is a labeled training example) then that

answer is probably good in the neighborhood of

. If we have several good answers

in some neighborhood we would combine them (by some form of averaging or

interpolation) to produce an answer that agrees with as many of them as much as

possible.

An extreme example of the local constancy approach is the

-nearest neighbors

156

CHAPTER 5. MACHINE LEARNING BASICS

family of learning algorithms. These predictors are literally constant over each

region containing all the points

that have the same set of

nearest neighbors in

the training set. For

= 1, the number of distinguishable regions cannot be more

than the number of training examples.

While the

-nearest neighbors algorithm copies the output from nearby training

examples, most kernel machines interpolate between training set outputs associated

with nearby training examples. An important class of kernels is the family of local

kernels where

(

u, v

) is large when

and decreases as

and

grow farther

apart from each other. A local kernel can be thought of as a similarity function

that performs template matching, by measuring how closely a test example

resembles each training example

(i)

. Much of the modern motivation for deep

learning is derived from studying the limitations of local template matching and

how deep models are able to succeed in cases where local template matching fails

(Bengio et al., 2006b).

Decision trees also suﬀer from the limitations of exclusively smoothness-based

learning because they break the input space into as many regions as there are

leaves and use a separate parameter (or sometimes many parameters for extensions

of decision trees) in each region. If the target function requires a tree with at

least

leaves to be represented accurately, then at least

training examples are

required to ﬁt the tree. A multiple of

is needed to achieve some level of statistical

conﬁdence in the predicted output.

In general, to distinguish

(

) regions in input space, all of these methods

require

(

) examples. Typically there are

(

) parameters, with

(1) parameters

associated with each of the

(

) regions. The case of a nearest neighbor scenario,

where each training example can be used to deﬁne at most one region, is illustrated

in Fig. 5.10.

Is there a way to represent a complex function that has many more regions

to be distinguished than the number of training examples? Clearly, assuming

only smoothness of the underlying function will not allow a learner to do that.

For example, imagine that the target function is a kind of checkerboard. A

checkerboard contains many variations but there is a simple structure to them.

Imagine what happens when the number of training examples is substantially

smaller than the number of black and white squares on the checkerboard. Based

on only local generalization and the smoothness or local constancy prior, we would

be guaranteed to correctly guess the color of a new point if it lies within the same

checkerboard square as a training example. There is no guarantee that the learner

could correctly extend the checkerboard pattern to points lying in squares that do

not contain training examples. With this prior alone, the only information that an

157

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.10: Illustration of how the nearest neighbor algorithm breaks up the input space

into regions. An example (represented here by a circle) within each region deﬁnes the

region boundary (represented here by the lines). The

value associated with each example

deﬁnes what the output should be for all points within the corresponding region. The

regions deﬁned by nearest neighbor matching form a geometric pattern called a Voronoi

diagram. The number of these contiguous regions cannot grow faster than the number

of training examples. While this ﬁgure illustrates the behavior of the nearest neighbor

algorithm speciﬁcally, other machine learning algorithms that rely exclusively on the

local smoothness prior for generalization exhibit similar behaviors: each training example

only informs the learner about how to generalize in some neighborhood immediately

surrounding that example.

158

CHAPTER 5. MACHINE LEARNING BASICS

example tells us is the color of its square, and the only way to get the colors of the

entire checkerboard right is to cover each of its cells with at least one example.

The smoothness assumption and the associated non-parametric learning algo-

rithms work extremely well so long as there are enough examples for the learning

algorithm to observe high points on most peaks and low points on most valleys

of the true underlying function to be learned. This is generally true when the

function to be learned is smooth enough and varies in few enough dimensions.

In high dimensions, even a very smooth function can change smoothly but in a

diﬀerent way along each dimension. If the function additionally behaves diﬀerently

in diﬀerent regions, it can become extremely complicated to describe with a set of

training examples. If the function is complicated (we want to distinguish a huge

number of regions compared to the number of examples), is there any hope to

generalize well?

The answer to both of these questions is yes. The key insight is that a very

large number of regions, e.g.,

), can be deﬁned with

(

) examples, so long

as we introduce some dependencies between the regions via additional assumptions

about the underlying data generating distribution. In this way, we can actually

generalize non-locally (Bengio and Monperrus, 2005; Bengio et al., 2006c). Many

diﬀerent deep learning algorithms provide implicit or explicit assumptions that are

reasonable for a broad range of AI tasks in order to capture these advantages.

Other approaches to machine learning often make stronger, task-speciﬁc as-

sumptions. For example, we could easily solve the checkerboard task by providing

the assumption that the target function is periodic. Usually we do not include such

strong, task-speciﬁc assumptions into neural networks so that they can generalize

to a much wider variety of structures. AI tasks have structure that is much too

complex to be limited to simple, manually speciﬁed properties such as periodicity,

so we want learning algorithms that embody more general-purpose assumptions.

The core idea in deep learning is that we assume that the data was generated

by the

composition of factors

or features, potentially at multiple levels in a

hierarchy. Many other similarly generic assumptions can further improve deep

learning algorithms. These apparently mild assumptions allow an exponential gain

in the relationship between the number of examples and the number of regions

that can be distinguished. These exponential gains are described more precisely in

Sec. 6.4.1, Sec. 15.4, and Sec. 15.5. The exponential advantages conferred by the

use of deep, distributed representations counter the exponential challenges posed

by the curse of dimensionality.

159

CHAPTER 5. MACHINE LEARNING BASICS

5.11.3 Manifold Learning

An important concept underlying many ideas in machine learning is that of a

manifold.

A manifold is a connected region. Mathematically, it is a set of points, associated

with a neighborhood around each point. From any given point, the manifold locally

appears to be a Euclidean space. In everyday life, we experience the surface of the

world as a 2-D plane, but it is in fact a spherical manifold in 3-D space.

The deﬁnition of a neighborhood surrounding each point implies the existence

of transformations that can be applied to move on the manifold from one position

to a neighboring one. In the example of the world’s surface as a manifold, one can

walk north, south, east, or west.

Although there is a formal mathematical meaning to the term “manifold,”

in machine learning it tends to be used more loosely to designate a connected

set of points that can be approximated well by considering only a small number

of degrees of freedom, or dimensions, embedded in a higher-dimensional space.

Each dimension corresponds to a local direction of variation. See Fig. 5.11 for an

example of training data lying near a one-dimensional manifold embedded in two-

dimensional space. In the context of machine learning, we allow the dimensionality

of the manifold to vary from one point to another. This often happens when a

manifold intersects itself. For example, a ﬁgure eight is a manifold that has a single

dimension in most places but two dimensions at the intersection at the center.

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

Figure 5.11: Data sampled from a distribution in a two-dimensional space that is actually

concentrated near a one-dimensional manifold, like a twisted string. The solid line indicates

the underlying manifold that the learner should infer.

160

CHAPTER 5. MACHINE LEARNING BASICS

Many machine learning problems seem hopeless if we expect the machine

learning algorithm to learn functions with interesting variations across all of

. Manifold learning algorithms surmount this obstacle by assuming that most

consists of invalid inputs, and that interesting inputs occur only along

a collection of manifolds containing a small subset of points, with interesting

variations in the output of the learned function occurring only along directions

that lie on the manifold, or with interesting variations happening only when we

move from one manifold to another. Manifold learning was introduced in the case

of continuous-valued data and the unsupervised learning setting, although this

probability concentration idea can be generalized to both discrete data and the

supervised learning setting: the key assumption remains that probability mass is

highly concentrated.

The assumption that the data lies along a low-dimensional manifold may not

always be correct or useful. We argue that in the context of AI tasks, such as

those that involve processing images, sounds, or text, the manifold assumption is

at least approximately correct. The evidence in favor of this assumption consists

of two categories of observations.

The ﬁrst observation in favor of the manifold hypothesis is that the probability

distribution over images, text strings, and sounds that occur in real life is highly

concentrated. Uniform noise essentially never resembles structured inputs from

these domains. Fig. 5.12 shows how, instead, uniformly sampled points look like the

patterns of static that appear on analog television sets when no signal is available.

Similarly, if you generate a document by picking letters uniformly at random, what

is the probability that you will get a meaningful English-language text? Almost

zero, again, because most of the long sequences of letters do not correspond to a

natural language sequence: the distribution of natural language sequences occupies

a very small volume in the total space of sequences of letters.

161

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.12: Sampling images uniformly at random (by randomly picking each pixel

according to a uniform distribution) gives rise to noisy images. Although there is a non-

zero probability to generate an image of a face or any other object frequently encountered

in AI applications, we never actually observe this happening in practice. This suggests

that the images encountered in AI applications occupy a negligible proportion of the

volume of image space.

Of course, concentrated probability distributions are not suﬃcient to show

that the data lies on a reasonably small number of manifolds. We must also

establish that the examples we encounter are connected to each other by other

162

CHAPTER 5. MACHINE LEARNING BASICS

examples, with each example surrounded by other highly similar examples that

may be reached by applying transformations to traverse the manifold. The second

argument in favor of the manifold hypothesis is that we can also imagine such

neighborhoods and transformations, at least informally. In the case of images, we

can certainly think of many possible transformations that allow us to trace out a

manifold in image space: we can gradually dim or brighten the lights, gradually

move or rotate objects in the image, gradually alter the colors on the surfaces of

objects, etc. It remains likely that there are multiple manifolds involved in most

applications. For example, the manifold of images of human faces may not be

connected to the manifold of images of cat faces.

These thought experiments supporting the manifold hypotheses convey some in-

tuitive reasons supporting it. More rigorous experiments (Cayton, 2005; Narayanan

and Mitter, 2010; Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al.,

2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger

and Saul, 2004) clearly support the hypothesis for a large class of datasets of

interest in AI.

When the data lies on a low-dimensional manifold, it can be most natural

for machine learning algorithms to represent the data in terms of coordinates on

the manifold, rather than in terms of coordinates in

. In everyday life, we can

think of roads as 1-D manifolds embedded in 3-D space. We give directions to

speciﬁc addresses in terms of address numbers along these 1-D roads, not in terms

of coordinates in 3-D space. Extracting these manifold coordinates is challenging,

but holds the promise to improve many machine learning algorithms. This general

principle is applied in many contexts. Fig. 5.13 shows the manifold structure of a

dataset consisting of faces. By the end of this book, we will have developed the

methods necessary to learn such a manifold structure. In Fig. 20.6, we will see

how a machine learning algorithm can successfully accomplish this goal.

This concludes Part I, which has provided the basic concepts in mathematics

and machine learning which are employed throughout the remaining parts of the

book. You are now prepared to embark upon your study of deep learning.

163

CHAPTER 5. MACHINE LEARNING BASICS

Figure 5.13: Training examples from the QMUL Multiview Face Dataset (Gong et al., 2000)

for which the subjects were asked to move in such a way as to cover the two-dimensional

manifold corresponding to two angles of rotation. We would like learning algorithms to

be able to discover and disentangle such manifold coordinates. Fig. 20.6 illustrates such a

feat.

164