Chapter 15

Representation Learning

In this chapter, we ﬁrst discuss what it means to learn representations and how

the notion of representation can be useful to design deep architectures. We discuss

how learning algorithms share statistical strength across diﬀerent tasks, including

using information from unsupervised tasks to perform supervised tasks. Shared

representations are useful to handle multiple modalities or domains, or to transfer

learned knowledge to tasks for which few or no examples are given but a task

representation exists. Finally, we step back and argue about the reasons for the

success of representation learning, starting with the theoretical advantages of

distributed representations (Hinton et al., 1986) and deep representations and

ending with the more general idea of underlying assumptions about the data

generating process, in particular about underlying causes of the observed data.

Many information processing tasks can be very easy or very diﬃcult depending

on how the information is represented. This is a general principle applicable to

daily life, computer science in general, and to machine learning. For example, it

is straightforward for a person to divide 210 by 6 using long division. The task

becomes considerably less straightforward if it is instead posed using the Roman

numeral representation of the numbers. Most modern people asked to divide CCX

by VI would begin by converting the numbers to the Arabic numeral representation,

permitting long division procedures that make use of the place value system. More

concretely, we can quantify the asymptotic runtime of various operations using

appropriate or inappropriate representations. For example, inserting a number

into the correct position in a sorted list of numbers is an

(

) operation if the

list is represented as a linked list, but only

(

log n

) if the list is represented as a

red-black tree.

In the context of machine learning, what makes one representation better than

529

CHAPTER 15. REPRESENTATION LEARNING

another? Generally speaking, a good representation is one that makes a subsequent

learning task easier. The choice of representation will usually depend on the choice

of the subsequent learning task.

We can think of feedforward networks trained by supervised learning as per-

forming a kind of representation learning. Speciﬁcally, the last layer of the network

is typically a linear classiﬁer, such as a softmax regression classiﬁer. The rest of

the network learns to provide a representation to this classiﬁer. Training with a

supervised criterion naturally leads to the representation at every hidden layer (but

more so near the top hidden layer) taking on properties that make the classiﬁcation

task easier. For example, classes that were not linearly separable in the input

features may become linearly separable in the last hidden layer. In principle, the

last layer could be another kind of model, such as a nearest neighbor classiﬁer

(Salakhutdinov and Hinton, 2007a). The features in the penultimate layer should

learn diﬀerent properties depending on the type of the last layer.

Supervised training of feedforward networks does not involve explicitly imposing

any condition on the learned intermediate features. Other kinds of representation

learning algorithms are often explicitly designed to shape the representation in

some particular way. For example, suppose we want to learn a representation that

makes density estimation easier. Distributions with more independences are easier

to model, so we could design an objective function that encourages the elements

of the representation vector

to be independent. Just like supervised networks,

unsupervised deep learning algorithms have a main training objective but also

learn a representation as a side eﬀect. Regardless of how a representation was

obtained, it can can be used for another task. Alternatively, multiple tasks (some

supervised, some unsupervised) can be learned together with some shared internal

representation.

Most representation learning problems face a tradeoﬀ between preserving as

much information about the input as possible and attaining nice properties (such

as independence).

Representation learning is particularly interesting because it provides one

way to perform unsupervised and semi-supervised learning. We often have very

large amounts of unlabeled training data and relatively little labeled training

data. Training with supervised learning techniques on the labeled subset often

results in severe overﬁtting. Semi-supervised learning oﬀers the chance to resolve

this overﬁtting problem by also learning from the unlabeled data. Speciﬁcally,

we can learn good representations for the unlabeled data, and then use these

representations to solve the supervised learning task.

Humans and animals are able to learn from very few labeled examples. We do

530

CHAPTER 15. REPRESENTATION LEARNING

not yet know how this is possible. Many factors could explain improved human

performance—for example, the brain may use very large ensembles of classiﬁers

or Bayesian inference techniques. One popular hypothesis is that the brain is

able to leverage unsupervised or semi-supervised learning. There are many ways

to leverage unlabeled data. In this chapter, we focus on the hypothesis that the

unlabeled data can be used to learn a good representation.

15.1 Greedy Layer-Wise Unsupervised Pretraining

Unsupervised learning played a key historical role in the revival of deep neural

networks, allowing for the ﬁrst time to train a deep supervised network without

requiring architectural specializations like convolution or recurrence. We call this

procedure unsupervised pretraining, or more precisely, greedy layer-wise unsuper-

vised pretraining. This procedure is a canonical example of how a representation

learned for one task (unsupervised learning, trying to capture the shape of the

input distribution) can sometimes be useful for another task (supervised learning

with the same input domain).

Greedy layer-wise unsupervised pretraining relies on a single-layer represen-

tation learning algorithm such as an RBM, a single-layer autoencoder, a sparse

coding model, or another model that learns latent representations. Each layer is

pretrained using unsupervised learning, taking the output of the previous layer

and producing as output a new representation of the data, whose distribution (or

its relation to other variables such as categories to predict) is hopefully simpler.

See Algorithm 15.1 for a formal description.

Greedy layer-wise training procedures based on unsupervised criteria have long

been used to sidestep the diﬃculty of jointly training the layers of a deep neural net

for a supervised task. This approach dates back at least as far as the Neocognitron

(Fukushima, 1975). The deep learning renaissance of 2006 began with the discovery

that this greedy learning procedure could be used to ﬁnd a good initialization for

a joint learning procedure over all the layers, and that this approach could be used

to successfully train even fully connected architectures (Hinton et al., 2006; Hinton

and Salakhutdinov, 2006; Hinton, 2006; Bengio et al., 2007; Ranzato et al., 2007a).

Prior to this discovery, only convolutional deep networks or networks whose depth

resulted from recurrence were regarded as feasible to train. Today, we now know

that greedy layer-wise pretraining is not required to train fully connected deep

architectures, but the unsupervised pretraining approach was the ﬁrst method to

succeed.

Greedy layer-wise pretraining is called greedy because it is a greedy algorithm,

531

CHAPTER 15. REPRESENTATION LEARNING

meaning that it optimizes each piece of the solution independently, one piece at a

time, rather than jointly optimizing all pieces. It is called layer-wise because these

independent pieces are the layers of the network. Speciﬁcally, greedy layer-wise

pretraining proceeds one layer at a time, training the

-th layer while keeping the

previous ones ﬁxed. In particular, the lower layers (which are trained ﬁrst) are not

adapted after the upper layers are introduced. It is called unsupervised because each

layer is trained with an unsupervised representation learning algorithm. However

it is also called pretraining, because it is supposed to be only a ﬁrst step before

a joint training algorithm is applied to ﬁne-tune all the layers together. In the

context of a supervised learning task, it can be viewed as a regularizer (in some

experiments, pretraining decreases test error without decreasing training error)

and a form of parameter initialization.

It is common to use the word “pretraining” to refer not only to the pretraining

stage itself but to the entire two phase protocol that combines the pretraining

phase and a supervised learning phase. The supervised learning phase may involve

training a simple classiﬁer on top of the features learned in the pretraining phase,

or it may involve supervised ﬁne-tuning of the entire network learned in the

pretraining phase. No matter what kind of unsupervised learning algorithm or

what model type is employed, in the vast majority of cases, the overall training

scheme is nearly the same. While the choice of unsupervised learning algorithm

will obviously impact the details, most applications of unsupervised pretraining

follow this basic protocol.

Greedy layer-wise unsupervised pretraining can also be used as initialization

for other unsupervised learning algorithms, such as deep autoencoders (Hinton

and Salakhutdinov, 2006) and probabilistic models with many layers of latent

variables. Such models include deep belief networks (Hinton et al., 2006) and deep

Boltzmann machines (Salakhutdinov and Hinton, 2009a). These deep generative

models will be described in Chapter 20.

As discussed in Sec. 8.7.4, it is also possible to have greedy layer-wise super-

vised pretraining. This builds on the premise that training a shallow network is

easier than training a deep one, which seems to have been validated in several

contexts (Erhan et al., 2010).

15.1.1 When and Why Does Unsupervised Pretraining Work?

On many tasks, greedy layer-wise unsupervised pretraining can yield substantial

improvements in test error for classiﬁcation tasks. This observation was responsible

for the renewed interested in deep neural networks starting in 2006 (Hinton et al.,

532

CHAPTER 15. REPRESENTATION LEARNING

Algorithm 15.1 Greedy layer-wise unsupervised pretraining protocol.

Given the following: Unsupervised feature learning algorithm

, which takes a

training set of examples and returns an encoder or feature function

. The raw

input data is

, with one row per example and

(1)

(

) is the output of the ﬁrst

stage encoder on

and the dataset used by the second level unsupervised feature

learner. In the case where ﬁne-tuning is performed, we use a learner

which takes

an initial function

, input examples

(and in the supervised ﬁne-tuning case,

associated targets Y ), and returns a tuned function. The number of stages is m.

f ← Identity function

X = X

for k = 1, . . . , m do

(k)

= L(

f ← f

(k)

◦ f

X ← f

(k)

(

end for

if ﬁne-tuning then

f ← T (f, X, Y )

end if

Return f

2006; Bengio et al., 2007; Ranzato et al., 2007a). On many other tasks, however,

unsupervised pretraining either does not confer a beneﬁt or even causes noticeable

harm. Ma et al. (2015) studied the eﬀect of pretraining on machine learning

models for chemical activity prediction and found that, on average, pretraining was

slightly harmful, but for many tasks was signiﬁcantly helpful. Because unsupervised

pretraining is sometimes helpful but often harmful it is important to understand

when and why it works in order to determine whether it is applicable to a particular

task.

At the outset, it is important to clarify that most of this discussion is restricted

to greedy unsupervised pretraining in particular. There are other, completely

diﬀerent paradigms for performing semi-supervised learning with neural networks,

such as virtual adversarial training described in Sec. 7.13. It is also possible to

train an autoencoder or generative model at the same time as the supervised model.

Examples of this single-stage approach include the discriminative RBM (Larochelle

and Bengio, 2008) and the ladder network (Rasmus et al., 2015), in which the total

objective is an explicit sum of the two terms (one using the labels and one only

using the input).

Unsupervised pretraining combines two diﬀerent ideas. First, it makes use of

533

CHAPTER 15. REPRESENTATION LEARNING

the idea that the choice of initial parameters for a deep neural network can have

a signiﬁcant regularizing eﬀect on the model (and, to a lesser extent, that it can

improve optimization). Second, it makes use of the more general idea that learning

about the input distribution can help to learn about the mapping from inputs to

outputs.

Both of these ideas involve many complicated interactions between several

parts of the machine learning algorithm that are not entirely understood.

The ﬁrst idea, that the choice of initial parameters for a deep neural network

can have a strong regularizing eﬀect on its performance, is the least well understood.

At the time that pretraining became popular, it was understood as initializing the

model in a location that would cause it to approach one local minimum rather than

another. Today, local minima are no longer considered to be a serious problem

for neural network optimization. We now know that our standard neural network

training procedures usually do not arrive at a critical point of any kind. It remains

possible that pretraining initializes the model in a location that would otherwise

be inaccessible—for example, a region that is surrounded by areas where the cost

function varies so much from one example to another that minibatches give only

a very noisy estimate of the gradient, or a region surrounded by areas where the

Hessian matrix is so poorly conditioned that gradient descent methods must use

very small steps. However, our ability to characterize exactly what aspects of the

pretrained parameters are retained during the supervised training stage is limited.

This is one reason that modern approaches typically use simultaneous unsupervised

learning and supervised learning rather than two sequential stages. One may

also avoid struggling with these complicated ideas about how optimization in the

supervised learning stage preserves information from the unsupervised learning

stage by simply freezing the parameters for the feature extractors and using

supervised learning only to add a classiﬁer on top of the learned features.

The other idea, that a learning algorithm can use information learned in the

unsupervised phase to perform better in the supervised learning stage, is better

understood. The basic idea is that some features that are useful for the unsupervised

task may also be useful for the supervised learning task. For example, if we train

a generative model of images of cars and motorcycles, it will need to know about

wheels, and about how many wheels should be in an image. If we are fortunate,

the representation of the wheels will take on a form that is easy for the supervised

learner to access. This is not yet understood at a mathematical, theoretical level,

so it is not always possible to predict which tasks will beneﬁt from unsupervised

learning in this way. Many aspects of this approach are highly dependent on

the speciﬁc models used. For example, if we wish to add a linear classiﬁer on

534

CHAPTER 15. REPRESENTATION LEARNING

top of pretrained features, the features must make the underlying classes linearly

separable. These properties often occur naturally but do not always do so. This

is another reason that simultaneous supervised and unsupervised learning can be

preferable—the constraints imposed by the output layer are naturally included

from the start.

From the point of view of unsupervised pretraining as learning a representation,

we can expect unsupervised pretraining to be more eﬀective when the initial

representation is poor. One key example of this is the use of word embeddings.

Words represented by one-hot vectors are not very informative because every two

distinct one-hot vectors are the same distance from each other (squared

distance

of 2). Learned word embeddings naturally encode similarity between words by their

distance from each other. Because of this, unsupervised pretraining is especially

useful when processing words. It is less useful when processing images, perhaps

because images already lie in a rich vector space where distances provide a low

quality similarity metric.

From the point of view of unsupervised pretraining as a regularizer, we can

expect unsupervised pretraining to be most helpful when the number of labeled

examples is very small. Because the source of information added by unsupervised

pretraining is the unlabeled data, we may also expect unsupervised pretraining

to perform best when the number of unlabeled examples is very large. The

advantage of semi-supervised learning via unsupervised pretraining with many

unlabeled examples and few labeled examples was made particularly clear in

2011 with unsupervised pretraining winning two international transfer learning

competitions (Mesnil et al., 2011; Goodfellow et al., 2011), in settings where the

number of labeled examples in the target task was small (from a handful to dozens

of examples per class). These eﬀects were also documented in carefully controlled

experiments by Paine et al. (2014).

Other factors are likely to be involved. For example, unsupervised pretraining

is likely to be most useful when the function to be learned is extremely complicated.

Unsupervised learning diﬀers from regularizers like weight decay because it does not

bias the learner toward discovering a simple function but rather toward discovering

feature functions that are useful for the unsupervised learning task. If the true

underlying functions are complicated and shaped by regularities of the input

distribution, unsupervised learning can be a more appropriate regularizer.

These caveats aside, we now analyze some success cases where unsupervised

pretraining is known to cause an improvement, and explain what is known about

why this improvement occurs. Unsupervised pretraining has usually been used

to improve classiﬁers, and is usually most interesting from the point of view of

535

CHAPTER 15. REPRESENTATION LEARNING

−4000 −3000 −2000 −1000 0 1000 2000 3000 4000

−1500

−1000

−500

500

1000

1500

With pretraining

Without pretraining

Figure 15.1: Visualization via nonlinear projection of the learning trajectories of diﬀerent

neural networks in

function space

(not parameter space, to avoid the issue of many-to-

one mappings from parameter vectors to functions), with diﬀerent random initializations

and with or without unsupervised pretraining. Each point corresponds to a diﬀerent

neural network, at a particular time during its training process. This ﬁgure is adapted

with permission from Erhan et al. (2010). A coordinate in function space is an inﬁnite-

dimensional vector associating every input

with an output

. Erhan et al. (2010) made

a linear projection to high-dimensional space by concatenating the

for many speciﬁc

points. They then made a further nonlinear projection to 2-D by Isomap (Tenenbaum

et al., 2000). Color indicates time. All networks are initialized near the center of the plot

(corresponding to the region of functions that produce approximately uniform distributions

over the class

for most inputs). Over time, learning moves the function outward, to

points that make strong predictions. Training consistently terminates in one region when

using pretraining and in another, non-overlapping region when not using pretraining.

Isomap tries to preserve global relative distances (and hence volumes) so the small region

corresponding to pretrained models may indicate that the pretraining-based estimator

has reduced variance.

536

CHAPTER 15. REPRESENTATION LEARNING

reducing test set error. However, unsupervised pretraining can help tasks other

than classiﬁcation, and can act to improve optimization rather than being merely

a regularizer. For example, it can improve both train and test reconstruction error

for deep autoencoders (Hinton and Salakhutdinov, 2006).

Erhan et al. (2010) performed many experiments to explain several successes of

unsupervised pretraining. Both improvements to training error and improvements

to test error may be explained in terms of unsupervised pretraining taking the

parameters into a region that would otherwise be inaccessible. Neural network

training is non-deterministic, and converges to a diﬀerent function every time it

is run. Training may halt at a point where the gradient becomes small, a point

where early stopping ends training to prevent overﬁtting, or at a point where the

gradient is large but it is diﬃcult to ﬁnd a downhill step due to problems such as

stochasticity or poor conditioning of the Hessian. Neural networks that receive

unsupervised pretraining consistently halt in the same region of function space,

while neural networks without pretraining consistently halt in another region. See

Fig. 15.1 for a visualization of this phenomenon. The region where pretrained

networks arrive is smaller, suggesting that pretraining reduces the variance of the

estimation process, which can in turn reduce the risk of severe over-ﬁtting. In

other words, unsupervised pretraining initializes neural network parameters into

a region that they do not escape, and the results following this initialization are

more consistent and less likely to be very bad than without this initialization.

Erhan et al. (2010) also provide some answers as to

when

pretraining works

best—the mean and variance of the test error were most reduced by pretraining for

deeper networks. Keep in mind that these experiments were performed before the

invention and popularization of modern techniques for training very deep networks

(rectiﬁed linear units, dropout and batch normalization) so less is known about the

eﬀect of unsupervised pretraining in conjunction with contemporary approaches.

An important question is how unsupervised pretraining can act as a regularizer.

One hypothesis is that pretraining encourages the learning algorithm to discover

features that relate to the underlying causes that generate the observed data.

This is an important idea motivating many other algorithms besides unsupervised

pretraining, and is described further in Sec. 15.3.

Compared to other ways of incorporating this belief by using unsupervised

learning, unsupervised pretraining has the disadvantage that it operates with

two separate training phases. One reason that these two training phases are

disadvantageous is that there is not a single hyperparameter that predictably

reduces or increases the strength of the regularization arising from the unsupervised

pretraining. Instead, there are very many hyperparameters, whose eﬀect may be

537

CHAPTER 15. REPRESENTATION LEARNING

measured after the fact but is often diﬃcult to predict ahead of time. When we

perform unsupervised and supervised learning simultaneously, instead of using the

pretraining strategy, there is a single hyperparameter, usually a coeﬃcient attached

to the unsupervised cost, that determines how strongly the unsupervised objective

will regularize the supervised model. One can always predictably obtain less

regularization by decreasing this coeﬃcient. In the case of unsupervised pretraining,

there is not a way of ﬂexibly adapting the strength of the regularization—either

the supervised model is initialized to pretrained parameters, or it is not.

Another disadvantage of having two separate training phases is that each phase

has its own hyperparameters. The performance of the second phase usually cannot

be predicted during the ﬁrst phase, so there is a long delay between proposing

hyperparameters for the ﬁrst phase and being able to update them using feedback

from the second phase. The most principled approach is to use validation set error

in the supervised phase in order to select the hyperparameters of the pretraining

phase, as discussed in Larochelle et al. (2009). In practice, some hyperparameters,

like the number of pretraining iterations, are more conveniently set during the

pretraining phase, using early stopping on the unsupervised objective, which is

not ideal but computationally much cheaper than using the supervised objective.

Today, unsupervised pretraining has been largely abandoned, except in the

ﬁeld of natural language processing, where the natural representation of words as

one-hot vectors conveys no similarity information and where very large unlabeled

sets are available. In that case, the advantage of pretraining is that one can pretrain

once on a huge unlabeled set (for example with a corpus containing billions of

words), learn a good representation (typically of words, but also of sentences), and

then use this representation or ﬁne-tune it for a supervised task for which the

training set contains substantially fewer examples. This approach was pioneered

by by Collobert and Weston (2008b), Turian et al. (2010), and Collobert et al.

(2011a) and remains in common use today.

Deep learning techniques based on supervised learning, regularized with dropout

or batch normalization, are able to achieve human-level performance on very many

tasks, but only with extremely large labeled datasets. These same techniques

outperform unsupervised pretraining on medium-sized datasets such as CIFAR-10

and MNIST, which have roughly 5,000 labeled examples per class. On extremely

small datasets, such as the alternative splicing dataset, Bayesian methods outper-

form methods based on unsupervised pretraining (Srivastava, 2013). For these

reasons, the popularity of unsupervised pretraining has declined. Nevertheless,

unsupervised pretraining remains an important milestone in the history of deep

learning research and continues to inﬂuence contemporary approaches. The idea of

538

CHAPTER 15. REPRESENTATION LEARNING

pretraining has been generalized to supervised pretraining discussed in Sec. 8.7.4,

as a very common approach for transfer learning. Supervised pretraining for

transfer learning is popular (Oquab et al., 2014; Yosinski et al., 2014) for use with

convolutional networks pretrained on the ImageNet dataset. Practitioners publish

the parameters of these trained networks for this purpose, just like pretrained word

vectors are published for natural language tasks (Collobert et al., 2011a; Mikolov

et al., 2013a).

15.2 Transfer Learning and Domain Adaptation

Transfer learning and domain adaptation refer to the situation where what has been

learned in one setting (i.e., distribution

) is exploited to improve generalization

in another setting (say distribution

). This generalizes the idea presented in the

previous section, where we transferred representations between an unsupervised

learning task and a supervised learning task.

In transfer learning, the learner must perform two or more diﬀerent tasks,

but we assume that many of the factors that explain the variations in

are

relevant to the variations that need to be captured for learning

. This is typically

understood in a supervised learning context, where the input is the same but the

target may be of a diﬀerent nature. For example, we may learn about one set of

visual categories, such as cats and dogs, in the ﬁrst setting, then learn about a

diﬀerent set of visual categories, such as ants and wasps, in the second setting. If

there is signiﬁcantly more data in the ﬁrst setting (sampled from

), then that

may help to learn representations that are useful to quickly generalize from only

very few examples drawn from

. Many visual categories

low-level notions

of edges and visual shapes, the eﬀects of geometric changes, changes in lighting, etc.

In general, transfer learning, multi-task learning (Sec. 7.7), and domain adaptation

can be achieved via representation learning when there exist features that are

useful for the diﬀerent settings or tasks, corresponding to underlying factors that

appear in more than one setting. This is illustrated in Fig. 7.2, with shared lower

layers and task-dependent upper layers.

However, sometimes, what is shared among the diﬀerent tasks is not the

semantics of the input but the semantics of the output. For example, a speech

recognition system needs to produce valid sentences at the output layer, but

the earlier layers near the input may need to recognize very diﬀerent versions of

the same phonemes or sub-phonemic vocalizations depending on which person

is speaking. In cases like these, it makes more sense to share the upper layers

(near the output) of the neural network, and have a task-speciﬁc preprocessing, as

539

CHAPTER 15. REPRESENTATION LEARNING

illustrated in Fig. 15.2.

Selection switch

(1)

(2)

(3)

(shared)

(1)

(2)

(3)

Figure 15.2: Example architecture for multi-task or transfer learning when the output

variable y has the same semantics for all tasks while the input variable x has a diﬀerent

meaning (and possibly even a diﬀerent dimension) for each task (or, for example, each

user), called

(1)

(2)

and

(3)

for three tasks. The lower levels (up to the selection

switch) are task-speciﬁc, while the upper levels are shared. The lower levels learn to

translate their task-speciﬁc input into a generic set of features.

In the related case of domain adaptation, the task (and the optimal input-to-

output mapping) remains the same between each setting, but the input distribution

is slightly diﬀerent. For example, consider the task of sentiment analysis, which

consists of determining whether a comment expresses positive or negative sentiment.

Comments posted on the web come from many categories. A domain adaptation

scenario can arise when a sentiment predictor trained on customer reviews of

media content such as books, videos and music is later used to analyze comments

about consumer electronics such as televisions or smartphones. One can imagine

that there is an underlying function that tells whether any statement is positive,

neutral or negative, but of course the vocabulary and style may vary from one

domain to another, making it more diﬃcult to generalize across domains. Simple

unsupervised pretraining (with denoising autoencoders) has been found to be very

successful for sentiment analysis with domain adaptation (Glorot et al., 2011b).

A related problem is that of concept drift, which we can view as a form of transfer

learning due to gradual changes in the data distribution over time. Both concept

drift and transfer learning can be viewed as particular forms of multi-task learning.

540

CHAPTER 15. REPRESENTATION LEARNING

While the phrase “multi-task learning” typically refers to supervised learning tasks,

the more general notion of transfer learning is applicable to unsupervised learning

and reinforcement learning as well.

In all of these cases, the objective is to take advantage of data from the ﬁrst

setting to extract information that may be useful when learning or even when

directly making predictions in the second setting. The core idea of representation

learning is that the same representation may be useful in both settings. Using the

same representation in both settings allows the representation to beneﬁt from the

training data that is available for both tasks.

As mentioned before, unsupervised deep learning for transfer learning has found

success in some machine learning competitions (Mesnil et al., 2011; Goodfellow

et al., 2011). In the ﬁrst of these competitions, the experimental setup is the

following. Each participant is ﬁrst given a dataset from the ﬁrst setting (from

distribution

), illustrating examples of some set of categories. The participants

must use this to learn a good feature space (mapping the raw input to some

representation), such that when we apply this learned transformation to inputs

from the transfer setting (distribution

), a linear classiﬁer can be trained and

generalize well from very few labeled examples. One of the most striking results

found in this competition is that as an architecture makes use of deeper and

deeper representations (learned in a purely unsupervised way from data collected

in the ﬁrst setting,

), the learning curve on the new categories of the second

(transfer) setting

becomes much better. For deep representations, fewer labeled

examples of the transfer tasks are necessary to achieve the apparently asymptotic

generalization performance.

Two extreme forms of transfer learning are one-shot learning and zero-shot

learning, sometimes also called zero-data learning. Only one labeled example of

the transfer task is given for one-shot learning, while no labeled examples are given

at all for the zero-shot learning task.

One-shot learning (Fei-Fei et al., 2006) is possible because the representation

learns to cleanly separate the underlying classes during the ﬁrst stage. During the

transfer learning stage, only one labeled example is needed to infer the label of many

possible test examples that all cluster around the same point in representation

space. This works to the extent that the factors of variation corresponding to

these invariances have been cleanly separated from the other factors, in the learned

representation space, and we have somehow learned which factors do and do not

matter when discriminating objects of certain categories.

As an example of a zero-shot learning setting, consider the problem of having

a learner read a large collection of text and then solve object recognition problems.

541

CHAPTER 15. REPRESENTATION LEARNING

It may be possible to recognize a speciﬁc object class even without having seen an

image of that object, if the text describes the object well enough. For example,

having read that a cat has four legs and pointy ears, the learner might be able to

guess that an image is a cat, without having seen a cat before.

Zero-data learning (Larochelle et al., 2008) and zero-shot learning (Palatucci

et al., 2009; Socher et al., 2013b) are only possible because additional information

has been exploited during training. We can think of the zero-data learning scenario

as including three random variables: the traditional inputs

, the traditional

outputs or targets

, and an additional random variable describing the task,

The model is trained to estimate the conditional distribution

(

y | x, T

) where

is a description of the task we wish the model to perform. In our example of

recognizing cats after having read about cats, the output is a binary variable

with

= 1 indicating “yes” and

= 0 indicating “no.” The task variable

then

represents questions to be answered such as “Is there a cat in this image?” If we

have a training set containing unsupervised examples of objects that live in the

same space as

, we may be able to infer the meaning of unseen instances of

In our example of recognizing cats without having seen an image of the cat, it is

important that we have had unlabeled text data containing sentences such as “cats

have four legs” or “cats have pointy ears.”

Zero-shot learning requires

to be represented in a way that allows some sort

of generalization. For example,

cannot be just a one-hot code indicating an

object category. Socher et al. (2013b) provide instead a distributed representation

of object categories by using a learned word embedding for the word associated

with each category.

A similar phenomenon happens in machine translation (Klementiev et al., 2012;

Mikolov et al., 2013b; Gouws et al., 2014): we have words in one language, and

the relationships between words can be learned from unilingual corpora; on the

other hand, we have translated sentences which relate words in one language with

words in the other. Even though we may not have labeled examples translating

word

in language

to word

in language

, we can generalize and guess a

translation for word

because we have learned a distributed representation for

words in language

, a distributed representation for words in language

, and

created a link (possibly two-way) relating the two spaces, via training examples

consisting of matched pairs of sentences in both languages. This transfer will be

most successful if all three ingredients (the two representations and the relations

between them) are learned jointly.

Zero-shot learning is a particular form of transfer learning. The same principle

explains how one can perform multi-modal learning, capturing a representation in

542

CHAPTER 15. REPRESENTATION LEARNING

= f

(x)

test

= f

(y)

yspace

Relationship between embedded points within one of the domains

Maps between representation spaces

xspace

(x, y) pairs in the training set

: encoder function for x

: encoder function for y

Figure 15.3: Transfer learning between two domains

and

enables zero-shot learning.

Labeled or unlabeled examples of

allow one to learn a representation function

and

similarly with examples of

to learn

. Each application of the

and

functions

appears as an upward arrow, with the style of the arrows indicating which function is

applied. Distance in

space provides a similarity metric between any pair of points

space that may be more meaningful than distance in

space. Likewise, distance

space provides a similarity metric between any pair of points in

space. Both

of these similarity functions are indicated with dotted bidirectional arrows. Labeled

examples (dashed horizontal lines) are pairs (

x, y

) which allow one to learn a one-way

or two-way map (solid bidirectional arrow) between the representations

(

) and the

representations

(

) and anchor these representations to each other. Zero-data learning

is then enabled as follows. One can associate an image

test

to a word

test

, even if no

image of that word was ever presented, simply because word-representations

(

test

)

and image-representations

(

test

) can be related to each other via the maps between

representation spaces. It works because, although that image and that word were never

paired, their respective feature vectors

(

test

) and

(

test

) have been related to each

other. Figure inspired from suggestion by Hrant Khachatrian.

543

CHAPTER 15. REPRESENTATION LEARNING

one modality, a representation in the other, and the relationship (in general a joint

distribution) between pairs (

x, y

) consisting of one observation

in one modality

and another observation

in the other modality (Srivastava and Salakhutdinov,

2012). By learning all three sets of parameters (from

to its representation, from

to its representation, and the relationship between the two representations),

concepts in one representation are anchored in the other, and vice-versa, allowing

one to meaningfully generalize to new pairs. The procedure is illustrated in

Fig. 15.3.

15.3 Semi-Supervised Disentangling of Causal Factors

An important question about representation learning is “what makes one repre-

sentation better than another?” One hypothesis is that an ideal representation

is one in which the features within the representation correspond to the under-

lying causes of the observed data, with separate features or directions in feature

space corresponding to diﬀerent causes, so that the representation disentangles the

causes from one another. This hypothesis motivates approaches in which we ﬁrst

seek a good representation for

(

). Such a representation may also be a good

representation for computing

(

y | x

) if

is among the most salient causes of

. This idea has guided a large amount of deep learning research since at least

the 1990s (Becker and Hinton, 1992; Hinton and Sejnowski, 1999), in more detail.

For other arguments about when semi-supervised learning can outperform pure

supervised learning, we refer the reader to Sec. 1.2 of Chapelle et al. (2006).

In other approaches to representation learning, we have often been concerned

with a representation that is easy to model—for example, one whose entries are

sparse, or independent from each other. A representation that cleanly separates

the underlying causal factors may not necessarily be one that is easy to model.

However, a further part of the hypothesis motivating semi-supervised learning

via unsupervised representation learning is that for many AI tasks, these two

properties coincide: once we are able to obtain the underlying explanations for

what we observe, it generally becomes easy to isolate individual attributes from

the others. Speciﬁcally, if a representation

represents many of the underlying

causes of the observed

, and the outputs

are among the most salient causes,

then it is easy to predict y from h.

First, let us see how semi-supervised learning can fail because unsupervised

learning of

(

) is of no help to learn

(

y | x

). Consider for example the case

where

(

) is uniformly distributed and we want to learn

(

) =

[

y | x

]. Clearly,

observing a training set of x values alone gives us no information about p(y | x).

544

CHAPTER 15. REPRESENTATION LEARNING

p(x)

y=1 y=2 y=3

Mixture model

Figure 15.4: Example of a density over

that is a mixture over three components.

The component identity is an underlying explanatory factor,

. Because the mixture

components (e.g., natural object classes in image data) are statistically salient, just

modeling

(

) in an unsupervised way with no labeled example already reveals the factor

Next, let us see a simple example of how semi-supervised learning can succeed.

Consider the situation where

arises from a mixture, with one mixture component

per value of

, as illustrated in Fig. 15.4. If the mixture components are well-

separated, then modeling

(

) reveals precisely where each component is, and a

single labeled example of each class will then be enough to perfectly learn

(

y | x

But more generally, what could make p(y | x) and p(x) be tied together?

is closely associated with one of the causal factors of

, then

(

) and

(

y | x

) will be strongly tied, and unsupervised representation learning that

tries to disentangle the underlying factors of variation is likely to be useful as a

semi-supervised learning strategy.

Consider the assumption that

is one of the causal factors of

, and let

represent all those factors. The true generative process can be conceived as

structured according to this directed graphical model, with h as the parent of x:

p(h, x) = p(x | h)p(h). (15.1)

As a consequence, the data has marginal probability

p(x) = E

p(x | h). (15.2)

From this straightforward observation, we conclude that the best possible model

(from a generalization point of view) is the one that uncovers the above “true”

545

CHAPTER 15. REPRESENTATION LEARNING

structure, with

as a latent variable that explains the observed variations in

The “ideal” representation learning discussed above should thus recover these latent

factors. If

is one of these (or closely related to one of them), then it will be

very easy to learn to predict

from such a representation. We also see that the

conditional distribution of

given

is tied by Bayes rule to the components in

the above equation:

p(y | x) =

p(x | y)p(y)

p(x)

. (15.3)

Thus the marginal

(

) is intimately tied to the conditional

(

y | x

) and knowledge

of the structure of the former should be helpful to learn the latter. Therefore, in

situations respecting these assumptions, semi-supervised learning should improve

performance.

An important research problem regards the fact that most observations are

formed by an extremely large number of underlying causes. Suppose

, but

the unsupervised learner does not know which

. The brute force solution is for

an unsupervised learner to learn a representation that captures

all

the reasonably

salient generative factors

and disentangles them from each other, thus making

it easy to predict y from h, regardless of which h

is associated with y.

In practice, the brute force solution is not feasible because it is not possible

to capture all or most of the factors of variation that inﬂuence an observation.

For example, in a visual scene, should the representation always encode all of

the smallest objects in the background? It is a well-documented psychological

phenomenon that human beings fail to perceive changes in their environment that

are not immediately relevant to the task they are performing—see, e.g., Simons

and Levin (1998). An important research frontier in semi-supervised learning is

determining

what

to encode in each situation. Currently, two of the main strategies

for dealing with a large number of underlying causes are to use a supervised learning

signal at the same time as the unsupervised learning signal so that the model will

choose to capture the most relevant factors of variation, or to use much larger

representations if using purely unsupervised learning.

An emerging strategy for unsupervised learning is to modify the deﬁnition of

which underlying causes are most salient. Historically, autoencoders and generative

models have been trained to optimize a ﬁxed criterion, often similar to mean

squared error. These ﬁxed criteria determine which causes are considered salient.

For example, mean squared error applied to the pixels of an image implicitly

speciﬁes that an underlying cause is only salient if it signiﬁcantly changes the

brightness of a large number of pixels. This can be problematic if the task we

wish to solve involves interacting with small objects. See Fig. 15.5 for an example

546

CHAPTER 15. REPRESENTATION LEARNING

Input Reconstruction

Figure 15.5: An autoencoder trained with mean squared error for a robotics task has

failed to reconstruct a ping pong ball. The existence of the ping pong ball and all of its

spatial coordinates are important underlying causal factors that generate the image and

are relevant to the robotics task. Unfortunately, the autoencoder has limited capacity,

and the training with mean squared error did not identify the ping pong ball as being

salient enough to encode. Images graciously provided by Chelsea Finn.

of a robotics task in which an autoencoder has failed to learn to encode a small

ping pong ball. This same robot is capable of successfully interacting with larger

objects, such as baseballs, which are more salient according to mean squared error.

Other deﬁnitions of salience are possible. For example, if a group of pixels

follow a highly recognizable pattern, even if that pattern does not involve extreme

brightness or darkness, then that pattern could be considered extremely salient.

One way to implement such a deﬁnition of salience is to use a recently developed

approach called generative adversarial networks (Goodfellow et al., 2014c). In

this approach, a generative model is trained to fool a feedforward classiﬁer. The

feedforward classiﬁer attempts to recognize all samples from the generative model

as being fake, and all samples from the training set as being real. In this framework,

any structured pattern that the feedforward network can recognize is highly salient.

The generative adversarial network will be described in more detail in Sec. 20.10.4.

For the purposes of the present discussion, it is suﬃcient to understand that they

learn

how to determine what is salient. Lotter et al. (2015) showed that models

trained to generate images of human heads will often neglect to generate the ears

when trained with mean squared error, but will successfully generate the ears when

trained with the adversarial framework. Because the ears are not extremely bright

or dark compared to the surrounding skin, they are not especially salient according

to mean squared error loss, but their highly recognizable shape and consistent

547

CHAPTER 15. REPRESENTATION LEARNING

Ground Truth MSE Adversarial

Figure 15.6: Predictive generative networks provide an example of the importance of

learning which features are salient. In this example, the predictive generative network

has been trained to predict the appearance of a 3-D model of a human head at a speciﬁc

viewing angle. (Left) Ground truth. This is the correct image, that the network should

emit. (Center) Image produced by a predictive generative network trained with mean

squared error alone. Because the ears do not cause an extreme diﬀerence in brightness

compared to the neighboring skin, they were not suﬃciently salient for the model to learn

to represent them. (Right) Image produced by a model trained with a combination of

mean squared error and adversarial loss. Using this learned cost function, the ears are

salient because they follow a predictable pattern. Learning which underlying causes are

important and relevant enough to model is an important active area of research. Figures

graciously provided by Lotter et al. (2015).

position means that a feedforward network can easily learn to detect them, making

them highly salient under the generative adversarial framework. See Fig. 15.6

for example images. Generative adversarial networks are only one step toward

determining which factors should be represented. We expect that future research

will discover better ways of determining which factors to represent, and develop

mechanisms for representing diﬀerent factors depending on the task.

A beneﬁt of learning the underlying causal factors, as pointed out by Schölkopf

et al. (2012), is that if the true generative process has

as an eﬀect and

a cause, then modeling

(

x | y

) is robust to changes in

(

). If the cause-eﬀect

relationship was reversed, this would not be true, since by Bayes rule,

(

x | y

)

would be sensitive to changes in

(

). Very often, when we consider changes in

distribution due to diﬀerent domains, temporal non-stationarity, or changes in

the nature of the task,

the causal mechanisms remain invariant

(“the laws

of the universe are constant”) while the marginal distribution over the underlying

causes can change. Hence, better generalization and robustness to all kinds of

changes can be expected via learning a generative model that attempts to recover

548

CHAPTER 15. REPRESENTATION LEARNING

the causal factors h and p(x | h).

15.4 Distributed Representation

Distributed representations of concepts—representations composed of many ele-

ments that can be set separately from each other—are one of the most important

tools for representation learning. Distributed representations are powerful because

they can use

features with

values to describe

diﬀerent concepts. As we

have seen throughout this book, both neural networks with multiple hidden units

and probabilistic models with multiple latent variables make use of the strategy of

distributed representation. We now introduce an additional observation. Many

deep learning algorithms are motivated by the assumption that the hidden units

can learn to represent the underlying causal factors that explain the data, as

discussed in Sec. 15.3. Distributed representations are natural for this approach,

because each direction in representation space can correspond to the value of a

diﬀerent underlying conﬁguration variable.

An example of a distributed representation is a vector of

binary features,

which can take 2

conﬁgurations, each potentially corresponding to a diﬀerent

region in input space, as illustrated in Fig. 15.7. This can be compared with

a symbolic representation, where the input is associated with a single symbol or

category. If there are

symbols in the dictionary, one can imagine

feature

detectors, each corresponding to the detection of the presence of the associated

category. In that case only

diﬀerent conﬁgurations of the representation space

are possible, carving

diﬀerent regions in input space, as illustrated in Fig. 15.8.

Such a symbolic representation is also called a one-hot representation, since it can

be captured by a binary vector with

bits that are mutually exclusive (only one

of them can be active). A symbolic representation is a speciﬁc example of the

broader class of non-distributed representations, which are representations that

may contain many entries but without signiﬁcant meaningful separate control over

each entry.

Examples of learning algorithms based on non-distributed representations

include:

•

Clustering methods, including the

-means algorithm: each input point is

assigned to exactly one cluster.

• k

-nearest neighbors algorithms: one or a few templates or prototype examples

are associated with a given input. In the case of

k >

1, there are multiple

549

CHAPTER 15. REPRESENTATION LEARNING

h =[1, 1, 1]

h =[0, 1, 1]

h =[1, 0, 1]

h =[1, 1, 0]

h =[0, 1, 0]

h =[0, 0, 1]

h =[1, 0, 0]

Figure 15.7: Illustration of how a learning algorithm based on a distributed representation

breaks up the input space into regions. In this example, there are three binary features

, and

. Each feature is deﬁned by thresholding the output of a learned, linear

transformation. Each feature divides

into two half-planes. Let

be the set of input

points for which

= 1 and

−

be the set of input points for which

= 0. In this

illustration, each line represents the decision boundary for one

, with the corresponding

arrow pointing to the

side of the boundary. The representation as a whole takes

on a unique value at each possible intersection of these half-planes. For example, the

representation value [1

corresponds to the region

∩ h

. Compare this to

the non-distributed representations in Fig. 15.8. In the general case of

input dimensions,

a distributed representation divides

by intersecting half-spaces rather than half-planes.

The distributed representation with

features assigns unique codes to

(

) diﬀerent

regions, while the nearest neighbor algorithm with

examples assigns unique codes to only

regions. The distributed representation is thus able to distinguish exponentially many

more regions than the non-distributed one. Keep in mind that not all

values are feasible

(there is no

in this example) and that a linear classiﬁer on top of the distributed

representation is not able to assign diﬀerent class identities to every neighboring region;

even a deep linear-threshold network has a VC dimension of only

(

w log w

) where

is the number of weights (Sontag, 1998). The combination of a powerful representation

layer and a weak classiﬁer layer can be a strong regularizer; a classiﬁer trying to learn

the concept of “person” versus “not a person” does not need to assign a diﬀerent class to

an input represented as “woman with glasses” than it assigns to an input represented as

“man without glasses.” This capacity constraint encourages each classiﬁer to focus on few

and encourages h to learn to represent the classes in a linearly separable way.

550

CHAPTER 15. REPRESENTATION LEARNING

values describing each input, but they can not be controlled separately from

each other, so this does not qualify as a true distributed representation.

•

Decision trees: only one leaf (and the nodes on the path from root to leaf) is

activated when an input is given.

•

Gaussian mixtures and mixtures of experts: the templates (cluster centers)

or experts are now associated with a

degree

of activation. As with the

-nearest neighbors algorithm, each input is represented with multiple values,

but those values cannot readily be controlled separately from each other.

•

Kernel machines with a Gaussian kernel (or other similarly local kernel):

although the degree of activation of each “support vector” or template example

is now continuous-valued, the same issue arises as with Gaussian mixtures.

•

Language or translation models based on

-grams. The set of contexts

(sequences of symbols) is partitioned according to a tree structure of suﬃxes.

A leaf may correspond to the last two words being

and

, for example.

Separate parameters are estimated for each leaf of the tree (with some sharing

being possible).

For some of these non-distributed algorithms, the output is not constant by

parts but instead interpolates between neighboring regions. The relationship

between the number of parameters (or examples) and the number of regions they

can deﬁne remains linear.

An important related concept that distinguishes a distributed representation

from a symbolic one is that

generalization arises due to shared attributes

between diﬀerent concepts. As pure symbols, “

cat

” and “

dog

” are as far from each

other as any other two symbols. However, if one associates them with a meaningful

distributed representation, then many of the things that can be said about cats

can generalize to dogs and vice-versa. For example, our distributed representation

may contain entries such as “

has_fur

” or “

number_of_legs

” that have the same

value for the embedding of both “

cat

” and “

dog

.” Neural language models that

operate on distributed representations of words generalize much better than other

models that operate directly on one-hot representations of words, as discussed

in Sec. 12.4. Distributed representations induce a rich similarity space, in which

semantically close concepts (or inputs) are close in distance, a property that is

absent from purely symbolic representations.

When and why can there be a statistical advantage from using a distributed

representation as part of a learning algorithm? Distributed representations can

551

CHAPTER 15. REPRESENTATION LEARNING

Figure 15.8: Illustration of how the nearest neighbor algorithm breaks up the input space

into diﬀerent regions. The nearest neighbor algorithm provides an example of a learning

algorithm based on a non-distributed representation. Diﬀerent non-distributed algorithms

may have diﬀerent geometry, but they typically break the input space into regions,

with

a separate set of parameters for each region

. The advantage of a non-distributed

approach is that, given enough parameters, it can ﬁt the training set without solving a

diﬃcult optimization algorithm, because it is straightforward to choose a diﬀerent output

independently

for each region. The disadvantage is that such non-distributed models

generalize only locally via the smoothness prior, making it diﬃcult to learn a complicated

function with more peaks and troughs than the available number of examples. Contrast

this with a distributed representation, Fig. 15.7.

552

CHAPTER 15. REPRESENTATION LEARNING

have a statistical advantage when an apparently complicated structure can be

compactly represented using a small number of parameters. Some traditional non-

distributed learning algorithms generalize only due to the smoothness assumption,

which states that if

u ≈ v

, then the target function

to be learned has the

property that

(

)

≈ f

(

), in general. There are many ways of formalizing such an

assumption, but the end result is that if we have an example (

x, y

) for which we

know that

(

)

≈ y

, then we choose an estimator

that approximately satisﬁes

these constraints while changing as little as possible when we move to a nearby

input



. This assumption is clearly very useful, but it suﬀers from the curse of

dimensionality: in order to learn a target function that increases and decreases

many times in many diﬀerent regions,

we may need a number of examples that is

at least as large as the number of distinguishable regions. One can think of each of

these regions as a category or symbol: by having a separate degree of freedom for

each symbol (or region), we can learn an arbitrary decoder mapping from symbol

to value. However, this does not allow us to generalize to new symbols for new

regions.

If we are lucky, there may be some regularity in the target function, besides being

smooth. For example, a convolutional network with max-pooling can recognize an

object regardless of its location in the image, even though spatial translation of

the object may not correspond to smooth transformations in the input space.

Let us examine a special case of a distributed representation learning algorithm,

that extracts binary features by thresholding linear functions of the input. Each

binary feature in this representation divides

into a pair of half-spaces, as

illustrated in Fig. 15.7. The exponentially large number of intersections of

of the corresponding half-spaces determines how many regions this distributed

representation learner can distinguish. How many regions are generated by an

arrangement of

hyperplanes in

? By applying a general result concerning the

intersection of hyperplanes (Zaslavsky, 1975), one can show (Pascanu et al., 2014b)

that the number of regions this binary feature representation can distinguish is

j=0





= O(n

). (15.4)

Therefore, we see a growth that is exponential in the input size and polynomial in

the number of hidden units.

Potentially, we may want to learn a function whose behavior is distinct in exponentially many

regions: in a

-dimensional space with at least 2 diﬀerent values to distinguish per dimension, we

might want f to diﬀer in 2

diﬀerent regions, requiring O(2

) training examples.

553

CHAPTER 15. REPRESENTATION LEARNING

This provides a geometric argument to explain the generalization power of

distributed representation: with

(

) parameters (for

linear-threshold features

) we can distinctly represent

(

) regions in input space. If instead we made

no assumption at all about the data, and used a representation with one unique

symbol for each region, and separate parameters for each symbol to recognize its

corresponding portion of

, then specifying

(

) regions would require

(

)

examples. More generally, the argument in favor of the distributed representation

could be extended to the case where instead of using linear threshold units we

use nonlinear, possibly continuous, feature extractors for each of the attributes in

the distributed representation. The argument in this case is that if a parametric

transformation with

parameters can learn about

regions in input space, with

k  r

, and if obtaining such a representation was useful to the task of interest, then

we could potentially generalize much better in this way than in a non-distributed

setting where we would need

(

) examples to obtain the same features and

associated partitioning of the input space into

regions. Using fewer parameters to

represent the model means that we have fewer parameters to ﬁt, and thus require

far fewer training examples to generalize well.

A further part of the argument for why models based on distributed represen-

tations generalize well is that their capacity remains limited despite being able to

distinctly encode so many diﬀerent regions. For example, the VC dimension of a

neural network of linear threshold units is only

(

w log w

), where

is the number

of weights (Sontag, 1998). This limitation arises because, while we can assign very

many unique codes to representation space, we cannot use absolutely all of the code

space, nor can we learn arbitrary functions mapping from the representation space

to the output

using a linear classiﬁer. The use of a distributed representation

combined with a linear classiﬁer thus expresses a prior belief that the classes to

be recognized are linearly separable as a function of the underlying causal factors

captured by

. We will typically want to learn categories such as the set of all

images of all green objects or the set of all images of cars, but not categories that

require nonlinear, XOR logic. For example, we typically do not want to partition

the data into the set of all red cars and green trucks as one class and the set of all

green cars and red trucks as another class.

The ideas discussed so far have been abstract, but they may be experimentally

validated. Zhou et al. (2015) ﬁnd that hidden units in a deep convolutional

network trained on the ImageNet and Places benchmark datasets learn features

that are very often interpretable, corresponding to a label that humans would

naturally assign. In practice it is certainly not always the case that hidden units

learn something that has a simple linguistic name, but it is interesting to see this

emerge near the top levels of the best computer vision deep networks. What such

554

CHAPTER 15. REPRESENTATION LEARNING

Figure 15.9: A generative model has learned a distributed representation that disentangles

the concept of gender from the concept of wearing glasses. If we begin with the repre-

sentation of the concept of a man with glasses, then subtract the vector representing the

concept of a man without glasses, and ﬁnally add the vector representing the concept

of a woman without glasses, we obtain the vector representing the concept of a woman

with glasses. The generative model correctly decodes all of these representation vectors to

images that may be recognized as belonging to the correct class. Images reproduced with

permission from Radford et al. (2015).

features have in common is that one could imagine

learning about each of them

without having to see all the conﬁgurations of all the others

. Radford

et al. (2015) demonstrated that a generative model can learn a representation of

images of faces, with separate directions in representation space capturing diﬀerent

underlying factors of variation. Fig. 15.9 demonstrates that one direction in

representation space corresponds to whether the person is male or female, while

another corresponds to whether the person is wearing glasses. These features were

discovered automatically, not ﬁxed a priori. There is no need to have labels for

the hidden unit classiﬁers: gradient descent on an objective function of interest

naturally learns semantically interesting features, so long as the task requires

such features. We can learn about the distinction between male and female, or

about the presence or absence of glasses, without having to characterize all of

the conﬁgurations of the

n −

1 other features by examples covering all of these

combinations of values. This form of statistical separability is what allows one to

generalize to new conﬁgurations of a person’s features that have never been seen

during training.

555

CHAPTER 15. REPRESENTATION LEARNING

15.5 Exponential Gains from Depth

We have seen in Sec. 6.4.1 that multilayer perceptrons are universal approximators,

and that some functions can be represented by exponentially smaller deep networks

compared to shallow networks. This decrease in model size leads to improved

statistical eﬃciency. In this section, we describe how similar results apply more

generally to other kinds of models with distributed hidden representations.

In Sec. 15.4, we saw an example of a generative model that learned about the

explanatory factors underlying images of faces, including the person’s gender and

whether they are wearing glasses. The generative model that accomplished this

task was based on a deep neural network. It would not be reasonable to expect a

shallow network, such as a linear network, to learn the complicated relationship

between these abstract explanatory factors and the pixels in the image. In this and

other AI tasks, the factors that can be chosen almost independently in order to

generate data are more likely to be very high-level and related in highly nonlinear

ways to the input. We argue that this demands

deep

distributed representations,

where the higher level features (seen as functions of the input) or factors (seen as

generative causes) are obtained through the composition of many nonlinearities.

It has been proven in many diﬀerent settings that organizing computation

through the composition of many nonlinearities and a hierarchy of reused features

can give an exponential boost to statistical eﬃciency, on top of the exponential

boost given by using a distributed representation. Many kinds of networks (e.g.,

with saturating nonlinearities, Boolean gates, sum/products, or RBF units) with

a single hidden layer can be shown to be universal approximators. A model

family that is a universal approximator can approximate a large class of functions

(including all continuous functions) up to any non-zero tolerance level, given enough

hidden units. However, the required number of hidden units may be very large.

Theoretical results concerning the expressive power of deep architectures state that

there are families of functions that can be represented eﬃciently by an architecture

of depth

, but would require an exponential number of hidden units (with respect

to the input size) with insuﬃcient depth (depth 2 or depth k − 1).

In Sec. 6.4.1, we saw that deterministic feedforward networks are universal

approximators of functions. Many structured probabilistic models with a single

hidden layer of latent variables, including restricted Boltzmann machines and deep

belief networks, are universal approximators of probability distributions (Le Roux

and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause et al.,

2013).

In Sec. 6.4.1, we saw that a suﬃciently deep feedforward network can have an

556

CHAPTER 15. REPRESENTATION LEARNING

exponential advantage over a network that is too shallow. Such results can also

be obtained for other models such as probabilistic models. One such probabilistic

model is the sum-product network or SPN (Poon and Domingos, 2011). These

models use polynomial circuits to compute the probability distribution over a

set of random variables. Delalleau and Bengio (2011) showed that there exist

probability distributions for which a minimum depth of SPN is required to avoid

needing an exponentially large model. Later, Martens and Medabalimi (2014)

showed that there are signiﬁcant diﬀerences between every two ﬁnite depths of

SPN, and that some of the constraints used to make SPNs tractable may limit

their representational power.

Another interesting development is a set of theoretical results for the expressive

power of families of deep circuits related to convolutional nets, highlighting an

exponential advantage for the deep circuit even when the shallow circuit is allowed

to only approximate the function computed by the deep circuit (Cohen et al.,

2015). By comparison, previous theoretical work made claims regarding only the

case where the shallow circuit must exactly replicate particular functions.

15.6 Providing Clues to Discover Underlying Causes

To close this chapter, we come back to one of our original questions: what makes

one representation better than another? One answer, ﬁrst introduced in Sec. 15.3,

is that an ideal representation is one that disentangles the underlying causal factors

of variation that generated the data, especially those factors that are relevant to our

applications. Most strategies for representation learning are based on introducing

clues that help the learning to ﬁnd these underlying factors of variations. The clues

can help the learner separate these observed factors from the others. Supervised

learning provides a very strong clue: a label

, presented with each

, that usually

speciﬁes the value of at least one of the factors of variation directly. More generally,

to make use of abundant unlabeled data, representation learning makes use of

other, less direct, hints about the underlying factors. These hints take the form of

implicit prior beliefs that we, the designers of the learning algorithm, impose in

order to guide the learner. Results such as the no free lunch theorem show that

regularization strategies are necessary to obtain good generalization. While it is

impossible to ﬁnd a universally superior regularization strategy, one goal of deep

learning is to ﬁnd a set of fairly generic regularization strategies that are applicable

to a wide variety of AI tasks, similar to the tasks that people and animals are able

to solve.

We provide here a list of these generic regularization strategies. The list is

557

CHAPTER 15. REPRESENTATION LEARNING

clearly not exhaustive, but gives some concrete examples of ways that learning

algorithms can be encouraged to discover features that correspond to underlying

factors. This list was introduced in Sec. 3.1 of Bengio et al. (2013d) and has been

partially expanded here.

• Smoothness

: This is the assumption that

(

d

)

≈ f

(

) for unit

and small . This assumption allows the learner to generalize from training

examples to nearby points in input space. Many machine learning algorithms

leverage this idea, but it is insuﬃcient to overcome the curse of dimensionality.

• Linearity

: Many learning algorithms assume that relationships between

some variables are linear. This allows the algorithm to make predictions even

very far from the observed data, but can sometimes lead to overly extreme

predictions. Most simple machine learning algorithms that do not make the

smoothness assumption instead make the linearity assumption. These are

in fact diﬀerent assumptions—linear functions with large weights applied

to high-dimensional spaces may not be very smooth. See Goodfellow et al.

(2014b) for a further discussion of the limitations of the linearity assumption.

• Multiple explanatory factors

: Many representation learning algorithms

are motivated by the assumption that the data is generated by multiple

underlying explanatory factors, and that most tasks can be solved easily

given the state of each of these factors. Sec. 15.3 describes how this view

motivates semi-supervised learning via representation learning. Learning

the structure of

(

) requires learning some of the same features that are

useful for modeling

(

y | x

) because both refer to the same underlying

explanatory factors. Sec. 15.4 describes how this view motivates the use of

distributed representations, with separate directions in representation space

corresponding to separate factors of variation.

• Causal factors

: the model is constructed in such a way that it treats the

factors of variation described by the learned representation

as the causes

of the observed data

, and not vice-versa. As discussed in Sec. 15.3, this

is advantageous for semi-supervised learning and makes the learned model

more robust when the distribution over the underlying causes changes or

when we use the model for a new task.

• Depth

, or

a hierarchical organization of explanatory factors

: High-

level, abstract concepts can be deﬁned in terms of simple concepts, forming a

hierarchy. From another point of view, the use of a deep architecture expresses

our belief that the task should be accomplished via a multi-step program,

558

CHAPTER 15. REPRESENTATION LEARNING

with each step referring back to the output of the processing accomplished

via previous steps.

• Shared factors across tasks

: In the context where we have many tasks,

corresponding to diﬀerent

variables sharing the same input

or where

each task is associated with a subset or a function

(i)

(

) of a global input

, the assumption is that each

is associated with a diﬀerent subset from a

common pool of relevant factors

. Because these subsets overlap, learning

all the

(

| x

) via a shared intermediate representation

(

h | x

) allows

sharing of statistical strength between the tasks.

• Manifolds

: Probability mass concentrates, and the regions in which it con-

centrates are locally connected and occupy a tiny volume. In the continuous

case, these regions can be approximated by low-dimensional manifolds with

a much smaller dimensionality than the original space where the data lives.

Many machine learning algorithms behave sensibly only on this manifold

(Goodfellow et al., 2014b). Some machine learning algorithms, especially

autoencoders, attempt to explicitly learn the structure of the manifold.

• Natural clustering: Many machine learning algorithms assume that each

connected manifold in the input space may be assigned to a single class. The

data may lie on many disconnected manifolds, but the class remains constant

within each one of these. This assumption motivates a variety of learning

algorithms, including tangent propagation, double backprop, the manifold

tangent classiﬁer and adversarial training.

• Temporal and spatial coherence

: Slow feature analysis and related

algorithms make the assumption that the most important explanatory factors

change slowly over time, or at least that it is easier to predict the true

underlying explanatory factors than to predict raw observations such as pixel

values. See Sec. 13.3 for further description of this approach.

• Sparsity

: Most features should presumably not be relevant to describing

most inputs—there is no need to use a feature that detects elephant trunks

when representing an image of a cat. It is therefore reasonable to impose a

prior that any feature that can be interpreted as “present” or “absent” should

be absent most of the time.

• Simplicity of Factor Dependencies

: In good high-level representations,

the factors are related to each other through simple dependencies. The

simplest possible is marginal independence,

(

) =

(

), but linear

559

CHAPTER 15. REPRESENTATION LEARNING

dependencies or those captured by a shallow autoencoder are also reasonable

assumptions. This can be seen in many laws of physics, and is assumed

when plugging a linear predictor or a factorized prior on top of a learned

representation.

The concept of representation learning ties together all of the many forms

of deep learning. Feedforward and recurrent networks, autoencoders and deep

probabilistic models all learn and exploit representations. Learning the best

possible representation remains an exciting avenue of research.

560