Chapter 12

Applications

In this chapter, we describe how to use deep learning to solve applications in com-

puter vision, speech recognition, natural language processing, and other application

areas of commercial interest. We begin by discussing the large scale neural network

implementations required for most serious AI applications. Next, we review several

speciﬁc application areas that deep learning has been used to solve. While one

goal of deep learning is to design algorithms that are capable of solving a broad

variety of tasks, so far some degree of specialization is needed. For example, vision

tasks require processing a large number of input features (pixels) per example.

Language tasks require modeling a large number of possible values (words in the

vocabulary) per input feature.

12.1 Large Scale Deep Learning

Deep learning is based on the philosophy of connectionism: while an individual

biological neuron or an individual feature in a machine learning model is not

intelligent, a large population of these neurons or features acting together can

exhibit intelligent behavior. It truly is important to emphasize the fact that the

number of neurons must be large. One of the key factors responsible for the

improvement in neural network’s accuracy and the improvement of the complexity

of tasks they can solve between the 1980s and today is the dramatic increase in

the size of the networks we use. As we saw in Sec. 1.2.3, network sizes have grown

exponentially for the past three decades, yet artiﬁcial neural networks are only as

large as the nervous systems of insects.

Because the size of neural networks is of paramount importance, deep learning

446

CHAPTER 12. APPLICATIONS

requires high performance hardware and software infrastructure.

12.1.1 Fast CPU Implementations

Traditionally, neural networks were trained using the CPU of a single machine.

Today, this approach is generally considered insuﬃcient. We now mostly use GPU

computing or the CPUs of many machines networked together. Before moving to

these expensive setups, researchers worked hard to demonstrate that CPUs could

not manage the high computational workload required by neural networks.

A description of how to implement eﬃcient numerical CPU code is beyond

the scope of this book, but we emphasize here that careful implementation for

speciﬁc CPU families can yield large improvements. For example, in 2011, the best

CPUs available could run neural network workloads faster when using ﬁxed-point

arithmetic rather than ﬂoating-point arithmetic. By creating a carefully tuned

ﬁxed-point implementation, Vanhoucke et al. (2011) obtained a 3

speedup over

a strong ﬂoating-point system. Each new model of CPU has diﬀerent performance

characteristics, so sometimes ﬂoating-point implementations can be faster too.

The important principle is that careful specialization of numerical computation

routines can yield a large payoﬀ. Other strategies, besides choosing whether to use

ﬁxed or ﬂoating point, include optimizing data structures to avoid cache misses

and using vector instructions. Many machine learning researchers neglect these

implementation details, but when the performance of an implementation restricts

the size of the model, the accuracy of the model suﬀers.

12.1.2 GPU Implementations

Most modern neural network implementations are based on graphics processing

units. Graphics processing units (GPUs) are specialized hardware components

that were originally developed for graphics applications. The consumer market for

video gaming systems spurred development of graphics processing hardware. The

performance characteristics needed for good video gaming systems turn out to be

beneﬁcial for neural networks as well.

Video game rendering requires performing many operations in parallel quickly.

Models of characters and environments are speciﬁed in terms of lists of 3-D

coordinates of vertices. Graphics cards must perform matrix multiplication and

division on many vertices in parallel to convert these 3-D coordinates into 2-D

on-screen coordinates. The graphics card must then perform many computations

at each pixel in parallel to determine the color of each pixel. In both cases, the

447

CHAPTER 12. APPLICATIONS

computations are fairly simple and do not involve much branching compared to

the computational workload that a CPU usually encounters. For example, each

vertex in the same rigid object will be multiplied by the same matrix; there is no

need to evaluate an if statement per-vertex to determine which matrix to multiply

by. The computations are also entirely independent of each other, and thus may

be parallelized easily. The computations also involve processing massive buﬀers of

memory, containing bitmaps describing the texture (color pattern) of each object

to be rendered. Together, this results in graphics cards having been designed to

have a high degree of parallelism and high memory bandwidth, at the cost of

having a lower clock speed and less branching capability relative to traditional

CPUs.

Neural network algorithms require the same performance characteristics as the

real-time graphics algorithms described above. Neural networks usually involve

large and numerous buﬀers of parameters, activation values, and gradient values,

each of which must be completely updated during every step of training. These

buﬀers are large enough to fall outside the cache of a traditional desktop computer

so the memory bandwidth of the system often becomes the rate limiting factor.

GPUs oﬀer a compelling advantage over CPUs due to their high memory bandwidth.

Neural network training algorithms typically do not involve much branching or

sophisticated control, so they are appropriate for GPU hardware. Since neural

networks can be divided into multiple individual “neurons” that can be processed

independently from the other neurons in the same layer, neural networks easily

beneﬁt from the parallelism of GPU computing.

GPU hardware was originally so specialized that it could only be used for

graphics tasks. Over time, GPU hardware became more ﬂexible, allowing custom

subroutines to be used to transform the coordinates of vertices or assign colors

to pixels. In principle, there was no requirement that these pixel values actually

be based on a rendering task. These GPUs could be used for scientiﬁc computing

by writing the output of a computation to a buﬀer of pixel values. Steinkrau

et al. (2005) implemented a two-layer fully connected neural network on a GPU

and reported a 3X speedup over their CPU-based baseline. Shortly thereafter,

Chellapilla et al. (2006) demonstrated that the same technique could be used to

accelerate supervised convolutional networks.

The popularity of graphics cards for neural network training exploded after the

advent of general purpose GPUs. These GP-GPUs could execute arbitrary code,

not just rendering subroutines. NVIDIA’s CUDA programming language provided

a way to write this arbitrary code in a C-like language. With their relatively

convenient programming model, massive parallelism, and high memory bandwidth,

448

CHAPTER 12. APPLICATIONS

GP-GPUs now oﬀer an ideal platform for neural network programming. This

platform was rapidly adopted by deep learning researchers soon after it became

available (Raina et al., 2009; Ciresan et al., 2010).

Writing eﬃcient code for GP-GPUs remains a diﬃcult task best left to spe-

cialists. The techniques required to obtain good performance on GPU are very

diﬀerent from those used on CPU. For example, good CPU-based code is usually

designed to read information from the cache as much as possible. On GPU, most

writable memory locations are not cached, so it can actually be faster to compute

the same value twice, rather than compute it once and read it back from memory.

GPU code is also inherently multi-threaded and the diﬀerent threads must be

coordinated with each other carefully. For example, memory operations are faster

if they can be coalesced. Coalesced reads or writes occur when several threads can

each read or write a value that they need simultaneously, as part of a single memory

transaction. Diﬀerent models of GPUs are able to coalesce diﬀerent kinds of read

or write patterns. Typically, memory operations are easier to coalesce if among

threads, thread

accesses byte

of memory, and

is a multiple of some power

of 2. The exact speciﬁcations diﬀer between models of GPU. Another common

consideration for GPUs is making sure that each thread in a group executes the

same instruction simultaneously. This means that branching can be diﬃcult on

GPU. Threads are divided into small groups called warps. Each thread in a warp

executes the same instruction during each cycle, so if diﬀerent threads within the

same warp need to execute diﬀerent code paths, these diﬀerent code paths must

be traversed sequentially rather than in parallel.

Due to the diﬃculty of writing high performance GPU code, researchers should

structure their workﬂow to avoid needing to write new GPU code in order to test

new models or algorithms. Typically, one can do this by building a software library

of high performance operations like convolution and matrix multiplication, then

specifying models in terms of calls to this library of operations. For example, the

machine learning library Pylearn2 (Goodfellow et al., 2013c) speciﬁes all of its

machine learning algorithms in terms of calls to Theano (Bergstra et al., 2010;

Bastien et al., 2012) and cuda-convnet (Krizhevsky, 2010), which provide these

high-performance operations. This factored approach can also ease support for

multiple kinds of hardware. For example, the same Theano program can run on

either CPU or GPU, without needing to change any of the calls to Theano itself.

Other libraries like TensorFlow (Abadi et al., 2015) and Torch (Collobert et al.,

2011b) provide similar features.

449

CHAPTER 12. APPLICATIONS

12.1.3 Large Scale Distributed Implementations

In many cases, the computational resources available on a single machine are

insuﬃcient. We therefore want to distribute the workload of training and inference

across many machines.

Distributing inference is simple, because each input example we want to process

can be run by a separate machine. This is known as data parallelism.

It is also possible to get model parallelism, where multiple machines work

together on a single datapoint, with each machine running a diﬀerent part of the

model. This is feasible for both inference and training.

Data parallelism during training is somewhat harder. We can increase the size

of the minibatch used for a single SGD step, but usually we get less than linear

returns in terms of optimization performance. It would be better to allow multiple

machines to compute multiple gradient descent steps in parallel. Unfortunately,

the standard deﬁnition of gradient descent is as a completely sequential algorithm:

the gradient at step t is a function of the parameters produced by step t −1.

This can be solved using asynchronous stochastic gradient descent (Bengio

et al., 2001; Recht et al., 2011). In this approach, several processor cores share

the memory representing the parameters. Each core reads parameters without a

lock, then computes a gradient, then increments the parameters without a lock.

This reduces the average amount of improvement that each gradient descent step

yields, because some of the cores overwrite each other’s progress, but the increased

rate of production of steps causes the learning process to be faster overall. Dean

et al. (2012) pioneered the multi-machine implementation of this lock-free approach

to gradient descent, where the parameters are managed by a parameter server

rather than stored in shared memory. Distributed asynchronous gradient descent

remains the primary strategy for training large deep networks and is used by

most major deep learning groups in industry (Chilimbi et al., 2014; Wu et al.,

2015). Academic deep learning researchers typically cannot aﬀord the same scale

of distributed learning systems but some research has focused on how to build

distributed networks with relatively low-cost hardware available in the university

setting (Coates et al., 2013).

12.1.4 Model Compression

In many commercial applications, it is much more important that the time and

memory cost of running inference in a machine learning model be low than that

the time and memory cost of training be low. For applications that do not require

450

CHAPTER 12. APPLICATIONS

personalization, it is possible to train a model once, then deploy it to be used by

billions of users. In many cases, the end user is more resource-constrained than

the developer. For example, one might train a speech recognition network with a

powerful computer cluster, then deploy it on mobile phones.

A key strategy for reducing the cost of inference is model compression (Buciluˇa

et al., 2006). The basic idea of model compression is to replace the original,

expensive model with a smaller model that requires less memory and runtime to

store and evaluate.

Model compression is applicable when the size of the original model is driven

primarily by a need to prevent overﬁtting. In most cases, the model with the

lowest generalization error is an ensemble of several independently trained models.

Evaluating all

ensemble members is expensive. Sometimes, even a single model

generalizes better if it is large (for example, if it is regularized with dropout).

These large models learn some function

(

), but do so using many more

parameters than are necessary for the task. Their size is necessary only due to

the limited number of training examples. As soon as we have ﬁt this function

(

), we can generate a training set containing inﬁnitely many examples, simply

by applying

to randomly sampled points

. We then train the new, smaller,

model to match

(

) on these points. In order to most eﬃciently use the capacity

of the new, small model, it is best to sample the new

points from a distribution

resembling the actual test inputs that will be supplied to the model later. This can

be done by corrupting training examples or by drawing points from a generative

model trained on the original training set.

Alternatively, one can train the smaller model only on the original training

points, but train it to copy other features of the model, such as its posterior

distribution over the incorrect classes (Hinton et al., 2014, 2015).

12.1.5 Dynamic Structure

One strategy for accelerating data processing systems in general is to build systems

that have dynamic structure in the graph describing the computation needed to

process an input. Data processing systems can dynamically determine which

subset of many neural networks should be run on a given input. Individual neural

networks can also exhibit dynamic structure internally by determining which subset

of features (hidden units) to compute given information from the input. This

form of dynamic structure inside neural networks is sometimes called conditional

computation (Bengio, 2013; Bengio et al., 2013b). Since many components of the

architecture may be relevant only for a small amount of possible inputs, the system

451

CHAPTER 12. APPLICATIONS

can run faster by computing these features only when they are needed.

Dynamic structure of computations is a basic computer science principle applied

generally throughout the software engineering discipline. The simplest versions

of dynamic structure applied to neural networks are based on determining which

subset of some group of neural networks (or other machine learning models) should

be applied to a particular input.

A venerable strategy for accelerating inference in a classiﬁer is to use a cascade

of classiﬁers. The cascade strategy may be applied when the goal is to detect the

presence of a rare object (or event). To know for sure that the object is present,

we must use a sophisticated classiﬁer with high capacity, that is expensive to run.

However, because the object is rare, we can usually use much less computation

to reject inputs as not containing the object. In these situations, we can train

a sequence of classiﬁers. The ﬁrst classiﬁers in the sequence have low capacity,

and are trained to have high recall. In other words, they are trained to make sure

we do not wrongly reject an input when the object is present. The ﬁnal classiﬁer

is trained to have high precision. At test time, we run inference by running the

classiﬁers in a sequence, abandoning any example as soon as any one element in

the cascade rejects it. Overall, this allows us to verify the presence of objects with

high conﬁdence, using a high capacity model, but does not force us to pay the cost

of full inference for every example. There are two diﬀerent ways that the cascade

can achieve high capacity. One way is to make the later members of the cascade

individually have high capacity. In this case, the system as a whole obviously has

high capacity, because some of its individual members do. It is also possible to

make a cascade in which every individual model has low capacity but the system

as a whole has high capacity due to the combination of many small models. Viola

and Jones (2001) used a cascade of boosted decision trees to implement a fast and

robust face detector suitable for use in handheld digital cameras. Their classiﬁer

localizes a face using essentially a sliding window approach in which many windows

are examined and rejected if they do not contain faces. Another version of cascades

uses the earlier models to implement a sort of hard attention mechanism: the

early members of the cascade localize an object and later members of the cascade

perform further processing given the location of the object. For example, Google

transcribes address numbers from Street View imagery using a two-step cascade

that ﬁrst locates the address number with one machine learning model and then

transcribes it with another (Goodfellow et al., 2014d).

Decision trees themselves are an example of dynamic structure, because each

node in the tree determines which of its subtrees should be evaluated for each input.

A simple way to accomplish the union of deep learning and dynamic structure

452

CHAPTER 12. APPLICATIONS

is to train a decision tree in which each node uses a neural network to make the

splitting decision (Guo and Gelfand, 1992), though this has typically not been

done with the primary goal of accelerating inference computations.

In the same spirit, one can use a neural network, called the gater to select which

one out of several expert networks will be used to compute the output, given the

current input. The ﬁrst version of this idea is called the mixture of experts (Nowlan,

1990; Jacobs et al., 1991), in which the gater outputs a set of probabilities or

weights (obtained via a softmax nonlinearity), one per expert, and the ﬁnal output

is obtained by the weighted combination of the output of the experts. In that

case, the use of the gater does not oﬀer a reduction in computational cost, but if a

single expert is chosen by the gater for each example, we obtain the hard mixture

of experts (Collobert et al., 2001, 2002), which can considerably accelerate training

and inference time. This strategy works well when the number of gating decisions is

small because it is not combinatorial. But when we want to select diﬀerent subsets

of units or parameters, it is not possible to use a “soft switch” because it requires

enumerating (and computing outputs for) all the gater conﬁgurations. To deal

with this problem, several approaches have been explored to train combinatorial

gaters. Bengio et al. (2013b) experiment with several estimators of the gradient

on the gating probabilities, while Bacon et al. (2015) and Bengio et al. (2015a) use

reinforcement learning techniques (policy gradient) to learn a form of conditional

dropout on blocks of hidden units and get an actual reduction in computational

cost without impacting negatively on the quality of the approximation.

Another kind of dynamic structure is a switch, where a hidden unit can

receive input from diﬀerent units depending on the context. This dynamic routing

approach can be interpreted as an attention mechanism (Olshausen et al., 1993).

So far, the use of a hard switch has not proven eﬀective on large-scale applications.

Contemporary approaches instead use a weighted average over many possible inputs,

and thus do not achieve all of the possible computational beneﬁts of dynamic

structure. Contemporary attention mechanisms are described in Sec. 12.4.5.1.

One major obstacle to using dynamically structured systems is the decreased

degree of parallelism that results from the system following diﬀerent code branches

for diﬀerent inputs. This means that few operations in the network can be described

as matrix multiplication or batch convolution on a minibatch of examples. We

can write more specialized sub-routines that convolve each example with diﬀerent

kernels or multiply each row of a design matrix by a diﬀerent set of columns

of weights. Unfortunately, these more specialized subroutines are diﬃcult to

implement eﬃciently. CPU implementations will be slow due to the lack of cache

coherence and GPU implementations will be slow due to the lack of coalesced

453

CHAPTER 12. APPLICATIONS

memory transactions and the need to serialize warps when members of a warp take

diﬀerent branches. In some cases, these issues can be mitigated by partitioning the

examples into groups that all take the same branch, and processing these groups

of examples simultaneously. This can be an acceptable strategy for minimizing

the time required to process a ﬁxed amount of examples in an oﬄine setting. In

a real-time setting where examples must be processed continuously, partitioning

the workload can result in load-balancing issues. For example, if we assign one

machine to process the ﬁrst step in a cascade and another machine to process

the last step in a cascade, then the ﬁrst will tend to be overloaded and the last

will tend to be underloaded. Similar issues arise if each machine is assigned to

implement diﬀerent nodes of a neural decision tree.

12.1.6 Specialized Hardware Implementations of Deep Networks

Since the early days of neural networks research, hardware designers have worked

on specialized hardware implementations that could speed up training and/or

inference of neural network algorithms. See early and more recent reviews of

specialized hardware for deep networks (Lindsey and Lindblad, 1994; Beiu et al.,

2003; Misra and Saha, 2010).

Diﬀerent forms of specialized hardware (Graf and Jackel, 1989; Mead and

Ismail, 2012; Kim et al., 2009; Pham et al., 2012; Chen et al., 2014a,b) have

been developed over the last decades, either with ASICs (application-speciﬁc inte-

grated circuit), either with digital (based on binary representations of numbers),

analog (Graf and Jackel, 1989; Mead and Ismail, 2012) (based on physical imple-

mentations of continuous values as voltages or currents) or hybrid implementations

(combining digital and analog components). In recent years more ﬂexible FPGA

(ﬁeld programmable gated array) implementations (where the particulars of the

circuit can be written on the chip after it has been built) have been developed.

Though software implementations on general-purpose processing units (CPUs

and GPUs) typically use 32 or 64 bits of precision to represent ﬂoating point

numbers, it has long been known that it was possible to use less precision, at

least at inference time (Holt and Baker, 1991; Holi and Hwang, 1993; Presley

and Haggard, 1994; Simard and Graf, 1994; Wawrzynek et al., 1996; Savich et al.,

2007). This has become a more pressing issue in recent years as deep learning

has gained in popularity in industrial products, and as the great impact of faster

hardware was demonstrated with GPUs. Another factor that motivates current

research on specialized hardware for deep networks is that the rate of progress of

a single CPU or GPU core has slowed down, and most recent improvements in

computing speed have come from parallelization across cores (either in CPUs or

454

CHAPTER 12. APPLICATIONS

GPUs). This is very diﬀerent from the situation of the 1990s (the previous neural

network era) where the hardware implementations of neural networks (which might

take two years from inception to availability of a chip) could not keep up with

the rapid progress and low prices of general-purpose CPUs. Building specialized

hardware is thus a way to push the envelope further, at a time when new hardware

designs are being developed for low-power devices such as phones, aiming for

general-public applications of deep learning (e.g., with speech, computer vision or

natural language).

Recent work on low-precision implementations of backprop-based neural nets

(Vanhoucke et al., 2011; Courbariaux et al., 2015; Gupta et al., 2015) suggests

that between 8 and 16 bits of precision can suﬃce for using or training deep

neural networks with back-propagation. What is clear is that more precision is

required during training than at inference time, and that some forms of dynamic

ﬁxed point representation of numbers can be used to reduce how many bits are

required per number. Traditional ﬁxed point numbers are restricted to a ﬁxed

range (which corresponds to a given exponent in a ﬂoating point representation).

Dynamic ﬁxed point representations share that range among a set of numbers

(such as all the weights in one layer). Using ﬁxed point rather than ﬂoating point

representations and using less bits per number reduces the hardware surface area,

power requirements and computing time needed for performing multiplications,

and multiplications are the most demanding of the operations needed to use or

train a modern deep network with backprop.

12.2 Computer Vision

Computer vision has traditionally been one of the most active research areas for

deep learning applications, because vision is a task that is eﬀortless for humans

and many animals but challenging for computers (Ballard et al., 1983). Many of

the most popular standard benchmark tasks for deep learning algorithms are forms

of object recognition or optical character recognition.

Computer vision is a very broad ﬁeld encompassing a wide variety of ways

of processing images, and an amazing diversity of applications. Applications of

computer vision range from reproducing human visual abilities, such as recognizing

faces, to creating entirely new categories of visual abilities. As an example of

the latter category, one recent computer vision application is to recognize sound

waves from the vibrations they induce in objects visible in a video (Davis et al.,

2014). Most deep learning research on computer vision has not focused on such

exotic applications that expand the realm of what is possible with imagery but

455

CHAPTER 12. APPLICATIONS

rather a small core of AI goals aimed at replicating human abilities. Most deep

learning for computer vision is used for object recognition or detection of some

form, whether this means reporting which object is present in an image, annotating

an image with bounding boxes around each object, transcribing a sequence of

symbols from an image, or labeling each pixel in an image with the identity of the

object it belongs to. Because generative modeling has been a guiding principle

of deep learning research, there is also a large body of work on image synthesis

using deep models. While image synthesis ex nihilo is usually not considered a

computer vision endeavor, models capable of image synthesis are usually useful for

image restoration, a computer vision task involving repairing defects in images or

removing objects from images.

12.2.1 Preprocessing

Many application areas require sophisticated preprocessing because the original

input comes in a form that is diﬃcult for many deep learning architectures to

represent. Computer vision usually requires relatively little of this kind of prepro-

cessing. The images should be standardized so that their pixels all lie in the same,

reasonable range, like [0,1] or [-1, 1]. Mixing images that lie in [0,1] with images

that lie in [0, 255] will usually result in failure. Formatting images to have the same

scale is the only kind of preprocessing that is strictly necessary. Many computer

vision architectures require images of a standard size, so images must be cropped or

scaled to ﬁt that size. However, even this rescaling is not always strictly necessary.

Some convolutional models accept variably-sized inputs and dynamically adjust

the size of their pooling regions to keep the output size constant (Waibel et al.,

1989). Other convolutional models have variable-sized output that automatically

scales in size with the input, such as models that denoise or label each pixel in an

image (Hadsell et al., 2007).

Dataset augmentation may be seen as a way of preprocessing the training set

only. Dataset augmentation is an excellent way to reduce the generalization error

of most computer vision models. A related idea applicable at test time is to show

the model many diﬀerent versions of the same input (for example, the same image

cropped at slightly diﬀerent locations) and have the diﬀerent instantiations of the

model vote to determine the output. This latter idea can be interpreted as an

ensemble approach, and helps to reduce generalization error.

Other kinds of preprocessing are applied to both the train and the test set with

the goal of putting each example into a more canonical form in order to reduce the

amount of variation that the model needs to account for. Reducing the amount of

variation in the data can both reduce generalization error and reduce the size of

456

CHAPTER 12. APPLICATIONS

the model needed to ﬁt the training set. Simpler tasks may be solved by smaller

models, and simpler solutions are more likely to generalize well. Preprocessing

of this kind is usually designed to remove some kind of variability in the input

data that is easy for a human designer to describe and that the human designer

is conﬁdent has no relevance to the task. When training with large datasets and

large models, this kind of preprocessing is often unnecessary, and it is best to just

let the model learn which kinds of variability it should become invariant to. For

example, the AlexNet system for classifying ImageNet only has one preprocessing

step: subtracting the mean across training examples of each pixel (Krizhevsky

et al., 2012).

12.2.1.1 Contrast Normalization

One of the most obvious sources of variation that can be safely removed for

many tasks is the amount of contrast in the image. Contrast simply refers to the

magnitude of the diﬀerence between the bright and the dark pixels in an image.

There are many ways of quantifying the contrast of an image. In the context of

deep learning, contrast usually refers to the standard deviation of the pixels in an

image or region of an image. Suppose we have an image represented by a tensor

X ∈ R

r×c×3

, with X

i,j,1

being the red intensity at row

and column

, X

i,j,2

giving

the green intensity and X

i,j,3

giving the blue intensity. Then the contrast of the

entire image is given by

3rc

i=1

j=1

k=1



i,j,k

−



(12.1)

where

X is the mean intensity of the entire image:

X =

3rc

i=1

j=1

k=1

i,j,k

. (12.2)

Global contrast normalization (GCN) aims to prevent images from having

varying amounts of contrast by subtracting the mean from each image, then

rescaling it so that the standard deviation across its pixels is equal to some

constant

. This approach is complicated by the fact that no scaling factor can

change the contrast of a zero-contrast image (one whose pixels all have equal

intensity). Images with very low but non-zero contrast often have little information

content. Dividing by the true standard deviation usually accomplishes nothing

more than amplifying sensor noise or compression artifacts in such cases. This

457

CHAPTER 12. APPLICATIONS

motivates introducing a small, positive regularization parameter

to bias the

estimate of the standard deviation. Alternately, one can constrain the denominator

to be at least



. Given an input image

, GCN produces an output image

deﬁned such that

i,j,k

= s

i,j,k

−

max



,

λ +

3rc

i=1

j=1

k=1



i,j,k

−





. (12.3)

Datasets consisting of large images cropped to interesting objects are unlikely

to contain any images with nearly constant intensity. In these cases, it is safe

to practically ignore the small denominator problem by setting

= 0 and avoid

division by 0 in extremely rare cases by setting



to an extremely low value like

−8

. This is the approach used by Goodfellow et al. (2013a) on the CIFAR-10

dataset. Small images cropped randomly are more likely to have nearly constant

intensity, making aggressive regularization more useful. Coates et al. (2011) used

 = 0 and λ = 10 on small, randomly selected patches drawn from CIFAR-10.

The scale parameter

can usually be set to 1, as done by Coates et al. (2011),

or chosen to make each individual pixel have standard deviation across examples

close to 1, as done by Goodfellow et al. (2013a).

The standard deviation in Eq. 12.3 is just a rescaling of the

norm of the

image (assuming the mean of the image has already been removed). It is preferable

to deﬁne GCN in terms of standard deviation rather than

norm because the

standard deviation includes division by the number of pixels, so GCN based on

standard deviation allows the same

to be used regardless of image size. However,

the observation that the

norm is proportional to the standard deviation can

help build a useful intuition. One can understand GCN as mapping examples to

a spherical shell. See Fig. 12.1 for an illustration. This can be a useful property

because neural networks are often better at responding to directions in space rather

than exact locations. Responding to multiple distances in the same direction

requires hidden units with collinear weight vectors but diﬀerent biases. Such

coordination can be diﬃcult for the learning algorithm to discover. Additionally,

many shallow graphical models have problems with representing multiple separated

modes along the same line. GCN avoids these problems by reducing each example

to a direction rather than a direction and a distance.

Counterintuitively, there is a preprocessing operation known as sphering and it

is not the same operation as GCN. Sphering does not refer to making the data lie

on a spherical shell, but rather to rescaling the principal components to have equal

variance, so that the multivariate normal distribution used by PCA has spherical

contours. Sphering is more commonly known as whitening.

458

CHAPTER 12. APPLICATIONS

−1.5 0.0 1.5

−1.5

0.0

1.5

Raw input

−1.5 0.0 1.5

GCN, λ = 10

−2

−1.5 0.0 1.5

GCN, λ = 0

Figure 12.1: GCN maps examples onto a sphere. (Left) Raw input data may have any

norm. (Center) GCN with

= 0 maps all non-zero examples perfectly onto a sphere.

Here we use

= 1 and



= 10

−8

. Because we use GCN based on normalizing the standard

deviation rather than the

norm, the resulting sphere is not the unit sphere. (Right)

Regularized GCN, with

λ >

0, draws examples toward the sphere but does not completely

discard the variation in their norm. We leave s and  the same as before.

Global contrast normalization will often fail to highlight image features we

would like to stand out, such as edges and corners. If we have a scene with a large

dark area and a large bright area (such as a city square with half the image in

the shadow of a building) then global contrast normalization will ensure there is a

large diﬀerence between the brightness of the dark area and the brightness of the

light area. It will not, however, ensure that edges within the dark region stand out.

This motivates local contrast normalization. Local contrast normalization

ensures that the contrast is normalized across each small window, rather than over

the image as a whole. See Fig. 12.2 for a comparison of global and local contrast

normalization.

Various deﬁnitions of local contrast normalization are possible. In all cases,

one modiﬁes each pixel by subtracting a mean of nearby pixels and dividing by

a standard deviation of nearby pixels. In some cases, this is literally the mean

and standard deviation of all pixels in a rectangular window centered on the

pixel to be modiﬁed (Pinto et al., 2008). In other cases, this is a weighted mean

and weighted standard deviation using Gaussian weights centered on the pixel to

be modiﬁed. In the case of color images, some strategies process diﬀerent color

channels separately while others combine information from diﬀerent channels to

normalize each pixel (Sermanet et al., 2012).

459

CHAPTER 12. APPLICATIONS

Input image GCN LCN

Figure 12.2: A comparison of global and local contrast normalization. Visually, the eﬀects

of global contrast normalization are subtle. It places all images on roughly the same

scale, which reduces the burden on the learning algorithm to handle multiple scales. Local

contrast normalization modiﬁes the image much more, discarding all regions of constant

intensity. This allows the model to focus on just the edges. Regions of ﬁne texture,

such as the houses in the second row, may lose some detail due to the bandwidth of the

normalization kernel being too high.

Local contrast normalization can usually be implemented eﬃciently by using

separable convolution (see Sec. 9.8) to compute feature maps of local means and

local standard deviations, then using element-wise subtraction and element-wise

division on diﬀerent feature maps.

Local contrast normalization is a diﬀerentiable operation and can also be used as

a nonlinearity applied to the hidden layers of a network, as well as a preprocessing

operation applied to the input.

As with global contrast normalization, we typically need to regularize local

contrast normalization to avoid division by zero. In fact, because local contrast

normalization typically acts on smaller windows, it is even more important to

regularize. Smaller windows are more likely to contain values that are all nearly

the same as each other, and thus more likely to have zero standard deviation.

12.2.1.2 Dataset Augmentation

As described in Sec. 7.4, it is easy to improve the generalization of a classiﬁer

by increasing the size of the training set by adding extra copies of the training

examples that have been modiﬁed with transformations that do not change the

460

CHAPTER 12. APPLICATIONS

class. Object recognition is a classiﬁcation task that is especially amenable to

this form of dataset augmentation because the class is invariant to so many

transformations and the input can be easily transformed with many geometric

operations. As described before, classiﬁers can beneﬁt from random translations,

rotations, and in some cases, ﬂips of the input to augment the dataset. In specialized

computer vision applications, more advanced transformations are commonly used

for dataset augmentation. These schemes include random perturbation of the

colors in an image (Krizhevsky et al., 2012) and nonlinear geometric distortions of

the input (LeCun et al., 1998b).

12.3 Speech Recognition

The task of speech recognition is to map an acoustic signal containing a spoken

natural language utterance into the corresponding sequence of words intended by

the speaker. Let

= (

(1)

, x

(2)

, . . . , x

(T )

) denote the sequence of acoustic input

vectors (traditionally produced by splitting the audio into 20ms frames). Most

speech recognition systems preprocess the input using specialized hand-designed

features, but some (Jaitly and Hinton, 2011) deep learning systems learn features

from raw input. Let

= (

, y

, . . . , y

) denote the target output sequence (usually

a sequence of words or characters). The automatic speech recognition (ASR) task

consists of creating a function

∗

ASR

that computes the most probable linguistic

sequence y given the acoustic sequence X:

∗

ASR

(X) = arg max

∗

(y | X = X) (12.4)

where

∗

is the true conditional distribution relating the inputs

to the targets

Since the 1980s and until about 2009–2012, state-of-the art speech recognition

systems primarily combined hidden Markov models (HMMs) and Gaussian mixture

models (GMMs). GMMs modeled the association between acoustic features and

phonemes (Bahl et al., 1987), while HMMs modeled the sequence of phonemes.

The GMM-HMM model family treats acoustic waveforms as being generated

by the following process: ﬁrst an HMM generates a sequence of phonemes and

discrete sub-phonemic states (such as the beginning, middle, and end of each

phoneme), then a GMM transforms each discrete symbol into a brief segment of

audio waveform. Although GMM-HMM systems dominated ASR until recently,

speech recognition was actually one of the ﬁrst areas where neural networks were

applied, and numerous ASR systems from the late 1980s and early 1990s used

461

CHAPTER 12. APPLICATIONS

neural nets (Bourlard and Wellekens, 1989; Waibel et al., 1989; Robinson and

Fallside, 1991; Bengio et al., 1991, 1992; Konig et al., 1996). At the time, the

performance of ASR based on neural nets approximately matched the performance

of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved

26% phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with 39

phonemes to discriminate between), which was better than or comparable to

HMM-based systems. Since then, TIMIT has been a benchmark for phoneme

recognition, playing a role similar to the role MNIST plays for object recognition.

However, because of the complex engineering involved in software systems for

speech recognition and the eﬀort that had been invested in building these systems

on the basis of GMM-HMMs, the industry did not see a compelling argument

for switching to neural networks. As a consequence, until the late 2000s, both

academic and industrial research in using neural nets for speech recognition mostly

focused on using neural nets to learn extra features for GMM-HMM systems.

Later, with

much larger and deeper models

and much larger datasets,

recognition accuracy was dramatically improved by using neural networks to

replace GMMs for the task of associating acoustic features to phonemes (or sub-

phonemic states). Starting in 2009, speech researchers applied a form of deep

learning based on unsupervised learning to speech recognition. This approach

to deep learning was based on training undirected probabilistic models called

restricted Boltzmann machines (RBMs) to model the input data. RBMs will be

described in Part III. To solve speech recognition tasks, unsupervised pretraining

was used to build deep feedforward networks whose layers were each initialized

by training an RBM. These networks take spectral acoustic representations in

a ﬁxed-size input window (around a center frame) and predict the conditional

probabilities of HMM states for that center frame. Training such deep networks

helped to signiﬁcantly improve the recognition rate on TIMIT (Mohamed et al.,

2009, 2012a), bringing down the phoneme error rate from about 26% to 20.7%.

See Mohamed et al. (2012b) for an analysis of reasons for the success of these

models. Extensions to the basic phone recognition pipeline included the addition

of speaker-adaptive features (Mohamed et al., 2011) that further reduced the

error rate. This was quickly followed up by work to expand the architecture from

phoneme recognition (which is what TIMIT is focused on) to large-vocabulary

speech recognition (Dahl et al., 2012), which involves not just recognizing phonemes

but also recognizing sequences of words from a large vocabulary. Deep networks

for speech recognition eventually shifted from being based on pretraining and

Boltzmann machines to being based on techniques such as rectiﬁed linear units and

dropout (Zeiler et al., 2013; Dahl et al., 2013). By that time, several of the major

speech groups in industry had started exploring deep learning in collaboration with

462

CHAPTER 12. APPLICATIONS

academic researchers. Hinton et al. (2012a) describe the breakthroughs achieved

by these collaborators, which are now deployed in products such as mobile phones.

Later, as these groups explored larger and larger labeled datasets and incorpo-

rated some of the methods for initializing, training, and setting up the architecture

of deep nets, they realized that the unsupervised pretraining phase was either

unnecessary or did not bring any signiﬁcant improvement.

These breakthroughs in recognition performance for word error rate in speech

recognition were unprecedented (around 30% improvement) and were following a

long period of about ten years during which error rates did not improve much with

the traditional GMM-HMM technology, in spite of the continuously growing size

of training sets (see Fig. 2.4 of Deng and Yu (2014)). This created a rapid shift in

the speech recognition community towards deep learning. In a matter of roughly

two years, most of the industrial products for speech recognition incorporated deep

neural networks and this success spurred a new wave of research into deep learning

algorithms and architectures for ASR, which is still ongoing today.

One of these innovations was the use of convolutional networks (Sainath et al.,

2013) that replicate weights across time and frequency, improving over the earlier

time-delay neural networks that replicated weights only across time. The new

two-dimensional convolutional models regard the input spectrogram not as one

long vector but as an image, with one axis corresponding to time and the other to

frequency of spectral components.

Another important push, still ongoing, has been towards end-to-end deep

learning speech recognition systems that completely remove the HMM. The ﬁrst

major breakthrough in this direction came from Graves et al. (2013) who trained a

deep LSTM RNN (see Sec. 10.10), using MAP inference over the frame-to-phoneme

alignment, as in LeCun et al. (1998b) and in the CTC framework (Graves et al.,

2006; Graves, 2012). A deep RNN (Graves et al., 2013) has state variables from

several layers at each time step, giving the unfolded graph two kinds of depth:

ordinary depth due to a stack of layers, and depth due to time unfolding. This

work brought the phoneme error rate on TIMIT to a record low of 17.7%. See

Pascanu et al. (2014a) and Chung et al. (2014) for other variants of deep RNNs,

applied in other settings.

Another contemporary step toward end-to-end deep learning ASR is to let the

system learn how to “align” the acoustic-level information with the phonetic-level

information (Chorowski et al., 2014; Lu et al., 2015).

463

CHAPTER 12. APPLICATIONS

12.4 Natural Language Processing

Natural language processing (NLP) is the use of human languages, such as English

or French, by a computer. Computer programs typically read and emit specialized

languages designed to allow eﬃcient and unambiguous parsing by simple programs.

More naturally occurring languages are often ambiguous and defy formal description.

Natural language processing includes applications such as machine translation,

in which the learner must read a sentence in one human language and emit an

equivalent sentence in another human language. Many NLP applications are based

on language models that deﬁne a probability distribution over sequences of words,

characters or bytes in a natural language.

As with the other applications discussed in this chapter, very generic neural

network techniques can be successfully applied to natural language processing.

However, to achieve excellent performance and to scale well to large applications,

some domain-speciﬁc strategies become important. To build an eﬃcient model of

natural language, we must usually use techniques that are specialized for processing

sequential data. In many cases, we choose to regard natural language as a sequence

of words, rather than a sequence of individual characters or bytes. Because the total

number of possible words is so large, word-based language models must operate on

an extremely high-dimensional and sparse discrete space. Several strategies have

been developed to make models of such a space eﬃcient, both in a computational

and in a statistical sense.

12.4.1 n-grams

A language model deﬁnes a probability distribution over sequences of tokens in

a natural language. Depending on how the model is designed, a token may be

a word, a character, or even a byte. Tokens are always discrete entities. The

earliest successful language models were based on models of ﬁxed-length sequences

of tokens called n-grams. An n-gram is a sequence of n tokens.

Models based on

-grams deﬁne the conditional probability of the

-th token

given the preceding

n −

1 tokens. The model uses products of these conditional

distributions to deﬁne the probability distribution over longer sequences:

P (x

, . . . , x

) = P (x

, . . . , x

n−1

)

t=n

P (x

| x

t−n+1

, . . . , x

t−1

). (12.5)

This decomposition is justiﬁed by the chain rule of probability. The probability

distribution over the initial sequence

(

, . . . , x

n−1

) may be modeled by a diﬀerent

model with a smaller value of n.

464

CHAPTER 12. APPLICATIONS

Training

-gram models is straightforward because the maximum likelihood

estimate can be computed simply by counting how many times each possible

gram occurs in the training set. Models based on

-grams have been the core

building block of statistical language modeling for many decades (Jelinek and

Mercer, 1980; Katz, 1987; Chen and Goodman, 1999).

For small values of

, models have particular names: unigram for

=1, bigram

for

=2, and trigram for

=3. These names derive from the Latin preﬁxes for the

corresponding numbers and the Greek suﬃx “-gram” denoting something that is

written.

Usually we train both an

-gram model and an

n−

1 gram model simultaneously.

This makes it easy to compute

P (x

| x

t−n+1

, . . . , x

) =

t−n+1

, . . . , x

)

n−1

t−n+1

, . . . , x

t−1

)

(12.6)

simply by looking up two stored probabilities. For this to exactly reproduce

inference in

, we must omit the ﬁnal character from each sequence when we

train P

n−1

As an example, we demonstrate how a trigram model computes the probability

of the sentence “

THE DOG RAN AWAY

.” The ﬁrst words of the sentence cannot be

handled by the default formula based on conditional probability because there is no

context at the beginning of the sentence. Instead, we must use the marginal prob-

ability over words at the start of the sentence. We thus evaluate

(

THE DOG RAN

Finally, the last word may be predicted using the typical case, of using the con-

ditional distribution

(

AWAY | DOG RAN

). Putting this together with Eq. 12.6, we

obtain:

P (THE DOG RAN AWAY) = P

(THE DOG RAN)P

(DOG RAN AWAY)/P

(DOG RAN).

(12.7)

A fundamental limitation of maximum likelihood for

-gram models is that

as estimated from training set counts is very likely to be zero in many cases, even

though the tuple (

t−n+1

, . . . , x

) may appear in the test set. This can cause two

diﬀerent kinds of catastrophic outcomes. When

n−1

is zero, the ratio is undeﬁned,

so the model does not even produce a sensible output. When

n−1

is non-zero but

is zero, the test log-likelihood is

−∞

. To avoid such catastrophic outcomes,

most

-gram models employ some form of smoothing. Smoothing techniques shift

probability mass from the observed tuples to unobserved ones that are similar.

See Chen and Goodman (1999) for a review and empirical comparisons. One basic

technique consists of adding non-zero probability mass to all of the possible next

symbol values. This method can be justiﬁed as Bayesian inference with a uniform

465

CHAPTER 12. APPLICATIONS

or Dirichlet prior over the count parameters. Another very popular idea is to form

a mixture model containing higher-order and lower-order

-gram models, with the

higher-order models providing more capacity and the lower-order models being

more likely to avoid counts of zero. Back-oﬀ methods look-up the lower-order

-grams if the frequency of the context

t−1

, . . . , x

t−n+1

is too small to use the

higher-order model. More formally, they estimate the distribution over

by using

contexts

t−n+k

, . . . , x

t−1

, for increasing

, until a suﬃciently reliable estimate is

found.

Classical

-gram models are particularly vulnerable to the curse of dimension-

ality. There are

|V|

possible

-grams and

|V|

is often very large. Even with a

massive training set and modest

, most

-grams will not occur in the training set.

One way to view a classical

-gram model is that it is performing nearest-neighbor

lookup. In other words, it can be viewed as a local non-parametric predictor,

similar to

-nearest neighbors. The statistical problems facing these extremely

local predictors are described in Sec. 5.11.2. The problem for a language model is

even more severe than usual, because any two diﬀerent words have the same dis-

tance from each other in one-hot vector space. It is thus diﬃcult to leverage much

information from any “neighbors”—only training examples that repeat literally the

same context are useful for local generalization. To overcome these problems, a

language model must be able to share knowledge between one word and other

semantically similar words.

To improve the statistical eﬃciency of

-gram models, class-based language

models (Brown et al., 1992; Ney and Kneser, 1993; Niesler et al., 1998) introduce

the notion of word categories and then share statistical strength between words that

are in the same category. The idea is to use a clustering algorithm to partition the

set of words into clusters or classes, based on their co-occurrence frequencies with

other words. The model can then use word class IDs rather than individual word

IDs to represent the context on the right side of the conditioning bar. Composite

models combining word-based and class-based models via mixing or back-oﬀ are

also possible. Although word classes provide a way to generalize between sequences

in which some word is replaced by another of the same class, much information is

lost in this representation.

12.4.2 Neural Language Models

Neural language models or NLMs are a class of language model designed to overcome

the curse of dimensionality problem for modeling natural language sequences by

using a distributed representation of words (Bengio et al., 2001). Unlike class-

based

-gram models, neural language models are able to recognize that two words

466

CHAPTER 12. APPLICATIONS

are similar without losing the ability to encode each word as distinct from the

other. Neural language models share statistical strength between one word (and

its context) and other similar words and contexts. The distributed representation

the model learns for each word enables this sharing by allowing the model to treat

words that have features in common similarly. For example, if the word

dog

and

the word

cat

map to representations that share many attributes, then sentences

that contain the word

cat

can inform the predictions that will be made by the

model for sentences that contain the word

dog

, and vice-versa. Because there are

many such attributes, there are many ways in which generalization can happen,

transferring information from each training sentence to an exponentially large

number of semantically related sentences. The curse of dimensionality requires the

model to generalize to a number of sentences that is exponential in the sentence

length. The model counters this curse by relating each training sentence to an

exponential number of similar sentences.

We sometimes call these word representations word embeddings. In this inter-

pretation, we view the raw symbols as points in a space of dimension equal to the

vocabulary size. The word representations embed those points in a feature space

of lower dimension. In the original space, every word is represented by a one-hot

vector, so every pair of words is at Euclidean distance

√

from each other. In the

embedding space, words that frequently appear in similar contexts (or any pair

of words sharing some “features” learned by the model) are close to each other.

This often results in words with similar meanings being neighbors. Fig. 12.3 zooms

in on speciﬁc areas of a learned word embedding space to show how semantically

similar words map to representations that are close to each other.

Neural networks in other domains also deﬁne embeddings. For example, a

hidden layer of a convolutional network provides an “image embedding.” Usually

NLP practitioners are much more interested in this idea of embeddings because

natural language does not originally lie in a real-valued vector space. The hidden

layer has provided a more qualitatively dramatic change in the way the data is

represented.

The basic idea of using distributed representations to improve models for

natural language processing is not restricted to neural networks. It may also be

used with graphical models that have distributed representations in the form of

multiple latent variables (Mnih and Hinton, 2007).

467

CHAPTER 12. APPLICATIONS

−34 −32 −30 −28 −26

−14

−13

−12

−11

−10

−9

−8

−7

−6

Canada

Europe

Ontario

North

English

Canadian

Union

African

Africa

British

France

Russian

China

Germany

French

Assembly

Japan

Iraq

South

European

35.0 35.5 36.0 36.5 37.0 37.5 38.0

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

Figure 12.3: Two-dimensional visualizations of word embeddings obtained from a neural

machine translation model (Bahdanau et al., 2015), zooming in on speciﬁc areas where

semantically related words have embedding vectors that are close to each other. Countries

appear on the left and numbers on the right. Keep in mind that these embeddings are 2-D

for the purpose of visualization. In real applications, embeddings typically have higher

dimensionality and can simultaneously capture many kinds of similarity between words.

12.4.3 High-Dimensional Outputs

In many natural language applications, we often want our models to produce

words (rather than characters) as the fundamental unit of the output. For large

vocabularies, it can be very computationally expensive to represent an output

distribution over the choice of a word, because the vocabulary size is large. In many

applications,

contains hundreds of thousands of words. The naive approach to

representing such a distribution is to apply an aﬃne transformation from a hidden

representation to the output space, then apply the softmax function. Suppose

we have a vocabulary

with size

|V|

. The weight matrix describing the linear

component of this aﬃne transformation is very large, because its output dimension

|V|

. This imposes a high memory cost to represent the matrix, and a high

computational cost to multiply by it. Because the softmax is normalized across all

|V|

outputs, it is necessary to perform the full matrix multiplication at training

time as well as test time—we cannot calculate only the dot product with the weight

vector for the correct output. The high computational costs of the output layer

thus arise both at training time (to compute the likelihood and its gradient) and

at test time (to compute probabilities for all or selected words). For specialized

loss functions, the gradient can be computed eﬃciently (Vincent et al., 2015), but

the standard cross-entropy loss applied to a traditional softmax output layer poses

468

CHAPTER 12. APPLICATIONS

many diﬃculties.

Suppose that

is the top hidden layer used to predict the output probabilities

. If we parametrize the transformation from

with learned weights

and learned biases

, then the aﬃne-softmax output layer performs the following

computations:

= b

∀i ∈ {1, . . . , |V|}, (12.8)

ˆy

|V|

. (12.9)

contains

elements then the above operation is

(

|V|n

). With

in the

thousands and

|V|

in the hundreds of thousands, this operation dominates the

computation of most neural language models.

12.4.3.1 Use of a Short List

The ﬁrst neural language models (Bengio et al., 2001, 2003) dealt with the high cost

of using a softmax over a large number of output words by limiting the vocabulary

size to 10,000 or 20,000 words. Schwenk and Gauvain (2002) and Schwenk (2007)

built upon this approach by splitting the vocabulary

into a shortlist

of most

frequent words (handled by the neural net) and a tail

V\L

of more rare words

(handled by an

-gram model). To be able to combine the two predictions, the

neural net also has to predict the probability that a word appearing after context

belongs to the tail list. This may be achieved by adding an extra sigmoid output

unit to provide an estimate of

(

i ∈ T | C

). The extra output can then be used to

achieve an estimate of the probability distribution over all words in V as follows:

P (y = i | C) =1

i∈L

P (y = i | C, i ∈ L)(1 − P (i ∈ T | C)) (12.10)

+ 1

i∈T

P (y = i | C, i ∈ T)P (i ∈ T | C) (12.11)

where

(

i | C, i ∈ L

) is provided by the neural language model and

(

i |

C, i ∈ T

) is provided by the

-gram model. With slight modiﬁcation, this approach

can also work using an extra output value in the neural language model’s softmax

layer, rather than a separate sigmoid unit.

An obvious disadvantage of the short list approach is that the potential gener-

alization advantage of the neural language models is limited to the most frequent

words, where, arguably, it is the least useful. This disadvantage has stimulated

the exploration of alternative methods to deal with high-dimensional outputs,

described below.

469

CHAPTER 12. APPLICATIONS

12.4.3.2 Hierarchical Softmax

A classical approach (Goodman, 2001) to reducing the computational burden

of high-dimensional output layers over large vocabulary sets

is to decompose

probabilities hierarchically. Instead of necessitating a number of computations

proportional to

|V|

(and also proportional to the number of hidden units,

the

|V|

factor can be reduced to as low as

log |V|

. Bengio (2002) and Morin and

Bengio (2005) introduced this factorized approach to the context of neural language

models.

One can think of this hierarchy as building categories of words, then categories

of categories of words, then categories of categories of categories of words, etc.

These nested categories form a tree, with words at the leaves. In a balanced tree,

the tree has depth

(

log |V|

). The probability of a choosing a word is given by the

product of the probabilities of choosing the branch leading to that word at every

node on a path from the root of the tree to the leaf containing the word. Fig. 12.4

illustrates a simple example. Mnih and Hinton (2009) also describe how to use

multiple paths to identify a single word in order to better model words that have

multiple meanings. Computing the probability of a word then involves summation

over all of the paths that lead to that word.

To predict the conditional probabilities required at each node of the tree, we

typically use a logistic regression model at each node of the tree, and provide the

same context

as input to all of these models. Because the correct output is

encoded in the training set, we can use supervised learning to train the logistic

regression models. This is typically done using a standard cross-entropy loss,

corresponding to maximizing the log-likelihood of the correct sequence of decisions.

Because the output log-likelihood can be computed eﬃciently (as low as

log |V|

rather than

|V|

), its gradients may also be computed eﬃciently. This includes not

only the gradient with respect to the output parameters but also the gradients

with respect to the hidden layer activations.

It is possible but usually not practical to optimize the tree structure to minimize

the expected number of computations. Tools from information theory specify how

to choose the optimal binary code given the relative frequencies of the words. To

do so, we could structure the tree so that the number of bits associated with a word

is approximately equal to the logarithm of the frequency of that word. However, in

practice, the computational savings are typically not worth the eﬀort because the

computation of the output probabilities is only one part of the total computation

in the neural language model. For example, suppose there are

fully connected

hidden layers of width

. Let

be the weighted average of the number of bits

470

CHAPTER 12. APPLICATIONS

(1)(0)

(0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) (1,0,1) (1,1,0) (1,1,1)

(1,1)(1,0)(0,1)(0,0)

Figure 12.4: Illustration of a simple hierarchy of word categories, with 8 words

, . . . , w

organized into a three level hierarchy. The leaves of the tree represent actual speciﬁc words.

Internal nodes represent groups of words. Any node can be indexed by the sequence

of binary decisions (0=left, 1=right) to reach the node from the root. Super-class (0)

contains the classes (0

0) and (0

1), which respectively contain the sets of words

, w

}

and

, w

}

, and similarly super-class (1) contains the classes (1

0) and (1

1), which

respectively contain the words (

, w

) and (

, w

). If the tree is suﬃciently balanced,

the maximum depth (number of binary decisions) is on the order of the logarithm of

the number of words

|V|

: the choice of one out of

|V|

words can be obtained by doing

(

log |V|

) operations (one for each of the nodes on the path from the root). In this example,

computing the probability of a word

can be done by multiplying three probabilities,

associated with the binary decisions to move left or right at each node on the path from

the root to a node

. Let

(

) be the

-th binary decision when traversing the tree

towards the value

. The probability of sampling an output

decomposes into a product

of conditional probabilities, using the chain rule for conditional probabilities, with each

node indexed by the preﬁx of these bits. For example, node (1

0) corresponds to the

preﬁx (b

) = 1, b

) = 0), and the probability of w

can be decomposed as follows:

P (y = w

) = P (b

= 1, b

= 0, b

= 0) (12.12)

= P (b

= 1)P (b

= 0 | b

= 1)P (b

= 0 | b

= 1, b

= 0). (12.13)

471

CHAPTER 12. APPLICATIONS

required to identify a word, with the weighting given by the frequency of these

words. In this example, the number of operations needed to compute the hidden

activations grows as as

(

) while the output computations grow as

(

As long as

≤ ln

, we can reduce computation more by shrinking

than by

shrinking

. Indeed,

is often small. Because the size of the vocabulary rarely

exceeds a million words and

log

(10

)

≈

20, it is possible to reduce

to about 20,

but

is often much larger, around 10

or more. Rather than carefully optimizing

a tree with a branching factor of 2, one can instead deﬁne a tree with depth two

and a branching factor of

|V|

. Such a tree corresponds to simply deﬁning a set

of mutually exclusive word classes. The simple approach based on a tree of depth

two captures most of the computational beneﬁt of the hierarchical strategy.

One question that remains somewhat open is how to best deﬁne these word

classes, or how to deﬁne the word hierarchy in general. Early work used existing

hierarchies (Morin and Bengio, 2005) but the hierarchy can also be learned, ideally

jointly with the neural language model. Learning the hierarchy is diﬃcult. An exact

optimization of the log-likelihood appears intractable because the choice of a word

hierarchy is a discrete one, not amenable to gradient-based optimization. However,

one could use discrete optimization to approximately optimize the partition of

words into word classes.

An important advantage of the hierarchical softmax is that it brings computa-

tional beneﬁts both at training time and at test time, if at test time we want to

compute the probability of speciﬁc words.

Of course, computing the probability of all

|V|

words will remain expensive

even with the hierarchical softmax. Another important operation is selecting the

most likely word in a given context. Unfortunately the tree structure does not

provide an eﬃcient and exact solution to this problem.

A disadvantage is that in practice the hierarchical softmax tends to give worse

test results than sampling-based methods we will describe next. This may be due

to a poor choice of word classes.

12.4.3.3 Importance Sampling

One way to speed up the training of neural language models is to avoid explicitly

computing the contribution of the gradient from all of the words that do not appear

in the next position. Every incorrect word should have low probability under the

model. It can be computationally costly to enumerate all of these words. Instead,

it is possible to sample only a subset of the words. Using the notation introduced

472

CHAPTER 12. APPLICATIONS

in Eq. 12.8, the gradient can be written as follows:

∂ log P (y | C)

∂θ

∂ log softmax

(a)

∂θ

(12.14)

∂

∂θ

log

(12.15)

∂

∂θ

− log

) (12.16)

∂a

∂θ

−

P (y = i | C)

∂a

∂θ

(12.17)

where

is the vector of pre-softmax activations (or scores), with one element

per word. The ﬁrst term is the positive phase term (pushing

up) while the

second term is the negative phase term (pushing

down for all

, with weight

(

i | C

). Since the negative phase term is an expectation, we can estimate it with

a Monte Carlo sample. However, that would require sampling from the model itself.

Sampling from the model requires computing

(

i | C

) for all

in the vocabulary,

which is precisely what we are trying to avoid.

Instead of sampling from the model, one can sample from another distribution,

called the proposal distribution (denoted

), and use appropriate weights to correct

for the bias introduced by sampling from the wrong distribution (Bengio and

Sénécal, 2003; Bengio and Sénécal, 2008). This is an application of a more general

technique called importance sampling, which will be described in more detail in

Sec. 17.2. Unfortunately, even exact importance sampling is not eﬃcient because it

requires computing weights

, where

(

i | C

), which can only be computed

if all the scores

are computed. The solution adopted for this application is called

biased importance sampling, where the importance weights are normalized to sum

to 1. When negative word n

is sampled, the associated gradient is weighted by

j=1

. (12.18)

These weights are used to give the appropriate importance to the

negative

samples from

used to form the estimated negative phase contribution to the

gradient:

|V|

i=1

P (i | C)

∂a

∂θ

≈

i=1

∂a

∂θ

. (12.19)

A unigram or a bigram distribution works well as the proposal distribution

. It is

easy to estimate the parameters of such a distribution from data. After estimating

the parameters, it is also possible to sample from such a distribution very eﬃciently.

473

CHAPTER 12. APPLICATIONS

Importance sampling is not only useful for speeding up models with large

softmax outputs. More generally, it is useful for accelerating training with large

sparse output layers, where the output is a sparse vector rather than a 1-of-

choice. An example is a bag of words. A bag of words is a sparse vector

where

indicates the presence or absence of word

from the vocabulary in the document.

Alternately,

can indicate the number of times that word

appears. Machine

learning models that emit such sparse vectors can be expensive to train for a

variety of reasons. Early in learning, the model may not actually choose to make

the output truly sparse. Moreover, the loss function we use for training might

most naturally be described in terms of comparing every element of the output to

every element of the target. This means that it is not always clear that there is a

computational beneﬁt to using sparse outputs, because the model may choose to

make the majority of the output non-zero and all of these non-zero values need to

be compared to the corresponding training target, even if the training target is zero.

Dauphin et al. (2011) demonstrated that such models can be accelerated using

importance sampling. The eﬃcient algorithm minimizes the loss reconstruction for

the “positive words” (those that are non-zero in the target) and an equal number

of “negative words.” The negative words are chosen randomly, using a heuristic to

sample words that are more likely to be mistaken. The bias introduced by this

heuristic oversampling can then be corrected using importance weights.

In all of these cases, the computational complexity of gradient estimation for

the output layer is reduced to be proportional to the number of negative samples

rather than proportional to the size of the output vector.

12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

Other approaches based on sampling have been proposed to reduce the computa-

tional cost of training neural language models with large vocabularies. An early

example is the ranking loss proposed by Collobert and Weston (2008a), which

views the output of the neural language model for each word as a score and tries to

make the score of the correct word

be ranked high in comparison to the other

scores a

. The ranking loss proposed then is

L =

max(0, 1 − a

+ a

). (12.20)

The gradient is zero for the

-th term if the score of the observed word,

, is

greater than the score of the negative word

by a margin of 1. One issue with

this criterion is that it does not provide estimated conditional probabilities, which

474

CHAPTER 12. APPLICATIONS

are useful in some applications, including speech recognition and text generation

(including conditional text generation tasks such as translation).

A more recently used training objective for neural language model is noise-

contrastive estimation, which is introduced in Sec. 18.6. This approach has been

successfully applied to neural language models (Mnih and Teh, 2012; Mnih and

Kavukcuoglu, 2013).

12.4.4 Combining Neural Language Models with n-grams

A major advantage of

-gram models over neural networks is that

-gram models

achieve high model capacity (by storing the frequencies of very many tuples)

while requiring very little computation to process an example (by looking up

only a few tuples that match the current context). If we use hash tables or trees

to access the counts, the computation used for

-grams is almost independent

of capacity. In comparison, doubling a neural network’s number of parameters

typically also roughly doubles its computation time. Exceptions include models

that avoid using all parameters on each pass. Embedding layers index only a single

embedding in each pass, so we can increase the vocabulary size without increasing

the computation time per example. Some other models, such as tiled convolutional

networks, can add parameters while reducing the degree of parameter sharing

in order to maintain the same amount of computation. However, typical neural

network layers based on matrix multiplication use an amount of computation

proportional to the number of parameters.

One easy way to add capacity is thus to combine both approaches in an ensemble

consisting of a neural language model and an

-gram language model (Bengio

et al., 2001, 2003). As with any ensemble, this technique can reduce test error if

the ensemble members make independent mistakes. The ﬁeld of ensemble learning

provides many ways of combining the ensemble members’ predictions, including

uniform weighting and weights chosen on a validation set. Mikolov et al. (2011a)

extended the ensemble to include not just two models but a large array of models.

It is also possible to pair a neural network with a maximum entropy model and

train both jointly (Mikolov et al., 2011b). This approach can be viewed as training

a neural network with an extra set of inputs that are connected directly to the

output, and not connected to any other part of the model. The extra inputs are

indicators for the presence of particular

-grams in the input context, so these

variables are very high-dimensional and very sparse. The increase in model capacity

is huge—the new portion of the architecture contains up to

|sV |

parameters—but

the amount of added computation needed to process an input is minimal because

the extra inputs are very sparse.

475

CHAPTER 12. APPLICATIONS

12.4.5 Neural Machine Translation

Machine translation is the task of reading a sentence in one natural language and

emitting a sentence with the equivalent meaning in another language. Machine

translation systems often involve many components. At a high level, there is

often one component that proposes many candidate translations. Many of these

translations will not be grammatical due to diﬀerences between the languages. For

example, many languages put adjectives after nouns, so when translated to English

directly they yield phrases such as “apple red.” The proposal mechanism suggests

many variants of the suggested translation, ideally including “red apple.” A second

component of the translation system, a language model, evaluates the proposed

translations, and can score “red apple” as better than “apple red.”

The earliest use of neural networks for machine translation was to upgrade the

language model of a translation system by using a neural language model (Schwenk

et al., 2006; Schwenk, 2010). Previously, most machine translation systems had

used an

-gram model for this component. The

-gram based models used for

machine translation include not just traditional back-oﬀ

-gram models (Jelinek

and Mercer, 1980; Katz, 1987; Chen and Goodman, 1999) but also maximum

entropy language models (Berger et al., 1996), in which an aﬃne-softmax layer

predicts the next word given the presence of frequent n-grams in the context.

Traditional language models simply report the probability of a natural language

sentence. Because machine translation involves producing an output sentence given

an input sentence, it makes sense to extend the natural language model to be

conditional. As described in Sec. 6.2.1.1, it is straightforward to extend a model

that deﬁnes a marginal distribution over some variable to deﬁne a conditional

distribution over that variable given a context

, where

might be a single variable

or a list of variables. Devlin et al. (2014) beat the state-of-the-art in some statistical

machine translation benchmarks by using an MLP to score a phrase

, t

, . . . , t

in the target language given a phrase

, s

, . . . , s

in the source language. The

MLP estimates

(

, t

, . . . , t

| s

, s

, . . . , s

). The estimate formed by this MLP

replaces the estimate provided by conditional n-gram models.

A drawback of the MLP-based approach is that it requires the sequences to be

preprocessed to be of ﬁxed length. To make the translation more ﬂexible, we would

like to use a model that can accommodate variable length inputs and variable

length outputs. An RNN provides this ability. Sec. 10.2.4 describes several ways

of constructing an RNN that represents a conditional distribution over a sequence

given some input, and Sec. 10.4 describes how to accomplish this conditioning

when the input is a sequence. In all cases, one model ﬁrst reads the input sequence

and emits a data structure that summarizes the input sequence. We call this

476

CHAPTER 12. APPLICATIONS

Decoder

Output object (English

sentence)

Intermediate, semantic representation

Source object (French sentence or image)

Encoder

Figure 12.5: The encoder-decoder architecture to map back and forth between a surface

representation (such as a sequence of words or an image) and a semantic representation.

By using the output of an encoder of data from one modality (such as the encoder mapping

from French sentences to hidden representations capturing the meaning of sentences) as

the input to a decoder for another modality (such as the decoder mapping from hidden

representations capturing the meaning of sentences to English), we can train systems that

translate from one modality to another. This idea has been applied successfully not just

to machine translation but also to caption generation from images.

summary the “context”

. The context

may be a list of vectors, or it may be a

vector or tensor. The model that reads the input to produce

may be an RNN

(Cho et al., 2014a; Sutskever et al., 2014; Jean et al., 2014) or a convolutional

network (Kalchbrenner and Blunsom, 2013). A second model, usually an RNN,

then reads the context

and generates a sentence in the target language. This

general idea of an encoder-decoder framework for machine translation is illustrated

in Fig. 12.5.

In order to generate an entire sentence conditioned on the source sentence, the

model must have a way to represent the entire source sentence. Earlier models

were only able to represent individual words or phrases. From a representation

learning point of view, it can be useful to learn a representation in which sentences

that have the same meaning have similar representations regardless of whether

they were written in the source language or the target language. This strategy was

explored ﬁrst using a combination of convolutions and RNNs (Kalchbrenner and

Blunsom, 2013). Later work introduced the use of an RNN for scoring proposed

translations (Cho et al., 2014a) and for generating translated sentences (Sutskever

et al., 2014). Jean et al. (2014) scaled these models to larger vocabularies.

477

CHAPTER 12. APPLICATIONS

12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

↵

(t1)

↵

(t1)

↵

(t)

↵

(t)

↵

(t+1)

↵

(t+1)

(t1)

(t)

(t+1)

⇥⇥

Figure 12.6: A modern attention mechanism, as introduced by Bahdanau et al. (2015), is

essentially a weighted average. A context vector

is formed by taking a weighted average

of feature vectors

(t)

with weights

(t)

. In some applications, the feature vectors

are

hidden units of a neural network, but they may also be raw input to the model. The

weights

(t)

are produced by the model itself. They are usually values in the interval

1] and are intended to concentrate around just one

(t)

so that the weighted average

approximates reading that one speciﬁc time step precisely. The weights

(t)

are usually

produced by applying a softmax function to relevance scores emitted by another portion

of the model. The attention mechanism is more expensive computationally than directly

indexing the desired

(t)

, but direct indexing cannot be trained with gradient descent. The

attention mechanism based on weighted averages is a smooth, diﬀerentiable approximation

that can be trained with existing optimization algorithms.

Using a ﬁxed-size representation to capture all the semantic details of a very

long sentence of say 60 words is very diﬃcult. It can be achieved by training a

suﬃciently large RNN well enough and for long enough, as demonstrated by Cho

et al. (2014a) and Sutskever et al. (2014). However, a more eﬃcient approach is

to read the whole sentence or paragraph (to get the context and the gist of what

is being expressed), then produce the translated words one at a time, each time

focusing on a diﬀerent part of the input sentence in order to gather the semantic

details that are required to produce the next output word. That is exactly the

idea that Bahdanau et al. (2015) ﬁrst introduced. The attention mechanism used

to focus on speciﬁc parts of the input sequence at each time step is illustrated in

Fig. 12.6.

We can think of an attention-based system as having three components:

478

CHAPTER 12. APPLICATIONS

A process that “reads” raw data (such as source words in a source sentence),

and converts them into distributed representations, with one feature vector

associated with each word position.

A list of feature vectors storing the output of the reader. This can be

understood as a “memory” containing a sequence of facts, which can be

retrieved later, not necessarily in the same order, without having to visit all

of them.

A process that “exploits” the content of the memory to sequentially perform

a task, at each time step having the ability put attention on the content of

one memory element (or a few, with a diﬀerent weight).

The third component generates the translated sentence.

When words in a sentence written in one language are aligned with correspond-

ing words in a translated sentence in another language, it becomes possible to relate

the corresponding word embeddings. Earlier work showed that one could learn a

kind of translation matrix relating the word embeddings in one language with the

word embeddings in another (Kočiský et al., 2014), yielding lower alignment error

rates than traditional approaches based on the frequency counts in the phrase table.

There is even earlier work on learning cross-lingual word vectors (Klementiev et al.,

2012). Many extensions to this approach are possible. For example, more eﬃcient

cross-lingual alignment (Gouws et al., 2014) allows training on larger datasets.

12.4.6 Historical Perspective

The idea of distributed representations for symbols was introduced by Rumelhart

et al. (1986a) in one of the ﬁrst explorations of back-propagation, with symbols

corresponding to the identity of family members and the neural network capturing

the relationships between family members, with training examples forming triplets

such as (Colin, Mother, Victoria). The ﬁrst layer of the neural network learned

a representation of each family member. For example, the features for Colin

might represent which family tree Colin was in, what branch of that tree he was

in, what generation he was from, etc. One can think of the neural network as

computing learned rules relating these attributes together in order to obtain the

desired predictions. The model can then make predictions such as inferring who is

the mother of Colin.

The idea of forming an embedding for a symbol was extended to the idea of an

embedding for a word by Deerwester et al. (1990). These embeddings were learned

using the SVD. Later, embeddings would be learned by neural networks.

479

CHAPTER 12. APPLICATIONS

The history of natural language processing is marked by transitions in the

popularity of diﬀerent ways of representing the input to the model. Following

this early work on symbols or words, some of the earliest applications of neural

networks to NLP (Miikkulainen and Dyer, 1991; Schmidhuber, 1996) represented

the input as a sequence of characters.

Bengio et al. (2001) returned the focus to modeling words and introduced

neural language models, which produce interpretable word embeddings. These

neural models have scaled up from deﬁning representations of a small set of symbols

in the 1980s to millions of words (including proper nouns and misspellings) in

modern applications. This computational scaling eﬀort led to the invention of the

techniques described above in Sec. 12.4.3.

Initially, the use of words as the fundamental units of language models yielded

improved language modeling performance (Bengio et al., 2001). To this day,

new techniques continually push both character-based models (Sutskever et al.,

2011) and word-based models forward, with recent work (Gillick et al., 2015) even

modeling individual bytes of Unicode characters.

The ideas behind neural language models have been extended into several

natural language processing applications, such as parsing (Henderson, 2003, 2004;

Collobert, 2011), part-of-speech tagging, semantic role labeling, chunking, etc,

sometimes using a single multi-task learning architecture (Collobert and Weston,

2008a; Collobert et al., 2011a) in which the word embeddings are shared across

tasks.

Two-dimensional visualizations of embeddings became a popular tool for an-

alyzing language models following the development of the t-SNE dimensionality

reduction algorithm (van der Maaten and Hinton, 2008) and its high-proﬁle appli-

cation to visualization word embeddings by Joseph Turian in 2009.

12.5 Other Applications

In this section we cover a few other types of applications of deep learning that

are diﬀerent from the standard object recognition, speech recognition and natural

language processing tasks discussed above. Part III of this book will expand

that scope even further to include tasks requiring the ability to generate rich

high-dimensional samples (unlike “the next word,” in language models).

480

CHAPTER 12. APPLICATIONS

12.5.1 Recommender Systems

One of the major families of applications of machine learning in the information

technology sector is the ability to make recommendations of items to potential

users or customers. Two major types of applications can be distinguished: online

advertising and item recommendations (often these recommendations are still for

the purpose of selling a product). Both rely on predicting the association between

a user and an item, either to predict the probability of some action (the user

buying the product, or some proxy for this action) or the expected gain (which

may depend on the value of the product) if an ad is shown or a recommendation is

made regarding that product to that user. The internet is currently ﬁnanced in

great part by various forms of online advertising. There are major parts of the

economy that rely on online shopping. Companies including Amazon and eBay

use machine learning, including deep learning, for their product recommendations.

Sometimes, the items are not products that are actually for sale. Examples include

selecting posts to display on social network news feeds, recommending movies to

watch, recommending jokes, recommending advice from experts, matching players

for video games, or matching people in dating services.

Often, this association problem is handled like a supervised learning problem:

given some information about the item and about the user, predict the proxy of

interest (user clicks on ad, user enters a rating, user clicks on a “like” button, user

buys product, user spends some amount of money on the product, user spends

time visiting a page for the product, etc). This often ends up being either a

regression problem (predicting some conditional expected value) or a probabilistic

classiﬁcation problem (predicting the conditional probability of some discrete

event).

The early work on recommender systems relied on minimal information as

inputs for these predictions: the user ID and the item ID. In this context, the only

way to generalize is to rely on the similarity between the patterns of values of the

target variable for diﬀerent users or for diﬀerent items. Suppose that user 1 and

user 2 both like items A, B and C. From this, we may infer that user 1 and user 2

have similar tastes. If user 1 likes item D, then this should be a strong cue that

user 2 will also like D. Algorithms based on this principle come under the name of

collaborative ﬁltering. Both non-parametric approaches (such as nearest-neighbor

methods based on the estimated similarity between patterns of preferences) and

parametric methods are possible. Parametric methods often rely on learning a

distributed representation (also called an embedding) for each user and for each

item. Bilinear prediction of the target variable (such as a rating) is a simple

parametric method that is highly successful and often found as a component of

481

CHAPTER 12. APPLICATIONS

state-of-the-art systems. The prediction is obtained by the dot product between

the user embedding and the item embedding (possibly corrected by constants that

depend only on either the user ID or the item ID). Let

be the matrix containing

our predictions,

a matrix with user embeddings in its rows and

a matrix with

item embeddings in its columns. Let

and

be vectors that contain respectively

a kind of bias for each user (representing how grumpy or positive that user is

in general) and for each item (representing its general popularity). The bilinear

prediction is thus obtained as follows:

u,i

= b

+ c

u,j

j,i

. (12.21)

Typically one wants to minimize the squared error between predicted ratings

u,i

and actual ratings

u,i

. User embeddings and item embeddings can then be

conveniently visualized when they are ﬁrst reduced to a low dimension (two or

three), or they can be used to compare users or items against each other, just

like word embeddings. One way to obtain these embeddings is by performing a

singular value decomposition of the matrix

of actual targets (such as ratings).

This corresponds to factorizing

UDV

(or a normalized variant) into the

product of two factors, the lower rank matrices

and

. One

problem with the SVD is that it treats the missing entries in an arbitrary way,

as if they corresponded to a target value of 0. Instead we would like to avoid

paying any cost for the predictions made on missing entries. Fortunately, the

sum of squared errors on the observed ratings can also be easily minimized by

gradient-based optimization. The SVD and the bilinear prediction of Eq. 12.21 both

performed very well in the competition for the Netﬂix prize (Bennett and Lanning,

2007), aiming at predicting ratings for ﬁlms, based only on previous ratings by

a large set of anonymous users. Many machine learning experts participated in

this competition, which took place between 2006 and 2009. It raised the level of

research in recommender systems using advanced machine learning and yielded

improvements in recommender systems. Even though it did not win by itself,

the simple bilinear prediction or SVD was a component of the ensemble models

presented by most of the competitors, including the winners (Töscher et al., 2009;

Koren, 2009).

Beyond these bilinear models with distributed representations, one of the ﬁrst

uses of neural networks for collaborative ﬁltering is based on the RBM undirected

probabilistic model (Salakhutdinov et al., 2007). RBMs were an important element

of the ensemble of methods that won the Netﬂix competition (Töscher et al., 2009;

Koren, 2009). More advanced variants on the idea of factorizing the ratings matrix

have also been explored in the neural networks community (Salakhutdinov and

482

CHAPTER 12. APPLICATIONS

Mnih, 2008).

However, there is a basic limitation of collaborative ﬁltering systems: when a

new item or a new user is introduced, its lack of rating history means that there

is no way to evaluate its similarity with other items or users (respectively), or

the degree of association between, say, that new user and existing items. This

is called the problem of cold-start recommendations. A general way of solving

the cold-start recommendation problem is to introduce extra information about

the individual users and items. For example, this extra information could be user

proﬁle information or features of each item. Systems that use such information are

called content-based recommender systems. The mapping from a rich set of user

features or item features to an embedding can be learned through a deep learning

architecture (Huang et al., 2013; Elkahky et al., 2015).

Specialized deep learning architectures such as convolutional networks have also

been applied to learn to extract features from rich content such as from musical

audio tracks, for music recommendation (van den Oörd et al., 2013). In that work,

the convolutional net takes acoustic features as input and computes an embedding

for the associated song. The dot product between this song embedding and the

embedding for a user is then used to predict whether a user will listen to the song.

12.5.1.1 Exploration Versus Exploitation

When making recommendations to users, an issue arises that goes beyond ordinary

supervised learning and into the realm of reinforcement learning. Many recommen-

dation problems are most accurately described theoretically as contextual bandits

(Langford and Zhang, 2008; Lu et al., 2010). The issue is that when we use the

recommendation system to collect data, we get a biased and incomplete view of

the preferences of users: we only see the responses of users to the items they were

recommended and not to the other items. In addition, in some cases we may not

get any information on users for whom no recommendation has been made (for

example, with ad auctions, it may be that the price proposed for an ad was below

a minimum price threshold, or does not win the auction, so the ad is not shown at

all). More importantly, we get no information about what outcome would have

resulted from recommending any of the other items. This would be like training a

classiﬁer by picking one class

ˆy

for each training example

(typically the class

with the highest probability according to the model) and then only getting as

feedback whether this was the correct class or not. Clearly, each example conveys

less information than in the supervised case where the true label

is directly

accessible, so more examples are necessary. Worse, if we are not careful, we could

end up with a system that continues picking the wrong decisions even as more

483

CHAPTER 12. APPLICATIONS

and more data is collected, because the correct decision initially had a very low

probability: until the learner picks that correct decision, it does not learn about

the correct decision. This is similar to the situation in reinforcement learning

where only the reward for the selected action is observed. In general, reinforcement

learning can involve a sequence of many actions and many rewards. The bandits

scenario is a special case of reinforcement learning, in which the learner takes only

a single action and receives a single reward. The bandit problem is easier in the

sense that the learner knows which reward is associated with which action. In

the general reinforcement learning scenario, a high reward or a low reward might

have been caused by a recent action or by an action in the distant past. The term

contextual bandits refers to the case where the action is taken in the context of

some input variable that can inform the decision. For example, we at least know

the user identity, and we want to pick an item. The mapping from context to

action is also called a policy. The feedback loop between the learner and the data

distribution (which now depends on the actions of the learner) is a central research

issue in the reinforcement learning and bandits literature.

Reinforcement learning requires choosing a tradeoﬀ between exploration and

exploitation. Exploitation refers to taking actions that come from the current, best

version of the learned policy—actions that we know will achieve a high reward.

Exploration refers to taking actions speciﬁcally in order to obtain more training

data. If we know that given context

, action

gives us a reward of 1, we do not

know whether that is the best possible reward. We may want to exploit our current

policy and continue taking action

in order to be relatively sure of obtaining a

reward of 1. However, we may also want to explore by trying action

. We do not

know what will happen if we try action

. We hope to get a reward of 2, but we

run the risk of getting a reward of 0. Either way, we at least gain some knowledge.

Exploration can be implemented in many ways, ranging from occasionally

taking random actions intended to cover the entire space of possible actions, to

model-based approaches that compute a choice of action based on its expected

reward and the model’s amount of uncertainty about that reward.

Many factors determine the extent to which we prefer exploration or exploitation.

One of the most prominent factors is the time scale we are interested in. If the

agent has only a short amount of time to accrue reward, then we prefer more

exploitation. If the agent has a long time to accrue reward, then we begin with

more exploration so that future actions can be planned more eﬀectively with more

knowledge. As time progresses and our learned policy improves, we move toward

more exploitation.

Supervised learning has no tradeoﬀ between exploration and exploitation

484

CHAPTER 12. APPLICATIONS

because the supervision signal always speciﬁes which output is correct for each

input. There is no need to try out diﬀerent outputs to determine if one is better

than the model’s current output—we always know that the label is the best output.

Another diﬃculty arising in the context of reinforcement learning, besides the

exploration-exploitation trade-oﬀ, is the diﬃculty of evaluating and comparing

diﬀerent policies. Reinforcement learning involves interaction between the learner

and the environment. This feedback loop means that it is not straightforward to

evaluate the learner’s performance using a ﬁxed set of test set input values. The

policy itself determines which inputs will be seen. Dudik et al. (2011) present

techniques for evaluating contextual bandits.

12.5.2 Knowledge Representation, Reasoning and Question An-

swering

Deep learning approaches have been very successful in language modeling, machine

translation and natural language processing due to the use of embeddings for

symbols (Rumelhart et al., 1986a) and words (Deerwester et al., 1990; Bengio et al.,

2001). These embeddings represent semantic knowledge about individual words

and concepts. A research frontier is to develop embeddings for phrases and for

relations between words and facts. Search engines already use machine learning for

this purpose but much more remains to be done to improve these more advanced

representations.

12.5.2.1 Knowledge, Relations and Question Answering

indexRelations One interesting research direction is determining how distributed

representations can be trained to capture the relations between two entities. These

relations allow us to formalize facts about objects and how objects interact with

each other.

In mathematics, a binary relation is a set of ordered pairs of objects. Pairs

that are in the set are said to have the relation while those who are not in the set

do not. For example, we can deﬁne the relation “is less than” on the set of entities

{

}

by deﬁning the set of ordered pairs

{

}

. Once this

relation is deﬁned, we can use it like a verb. Because (1

∈ S

, we say that 1 is

less than 2. Because (2

6∈ S

, we can not say that 2 is less than 1. Of course, the

entities that are related to one another need not be numbers. We could deﬁne a

relation is_a_type_of containing tuples like (dog, mammal).

In the context of AI, we think of a relation as a sentence in a syntactically

485

CHAPTER 12. APPLICATIONS

simple and highly structured language. The relation plays the role of a verb,

while two arguments to the relation play the role of its subject and object. These

sentences take the form of a triplet of tokens

(subject, verb, object) (12.22)

with values

(entity

, relation

, entity

). (12.23)

We can also deﬁne an attribute, a concept analogous to a relation, but taking

only one argument:

(entity

, attribute

). (12.24)

For example, we could deﬁne the

has_fur

attribute, and apply it to entities like

dog.

Many applications require representing relations and reasoning about them.

How should we best do this within the context of neural networks?

Machine learning models of course require training data. We can infer relations

between entities from training datasets consisting of unstructured natural language.

There are also structured databases that identify relations explicitly. A common

structure for these databases is the relational database, which stores this same

kind of information, albeit not formatted as three token sentences. When a

database is intended to convey commonsense knowledge about everyday life or

expert knowledge about an application area to an artiﬁcial intelligence system,

we call the database a knowledge base. Knowledge bases range from general

ones like

Freebase

OpenCyc

WordNet

, or

Wikibase

etc. to more specialized

knowledge bases, like

GeneOntology

Representations for entities and relations

can be learned by considering each triplet in a knowledge base as a training example

and maximizing a training objective that captures their joint distribution (Bordes

et al., 2013a).

In addition to training data, we also need to deﬁne a model family to train.

A common approach is to extend neural language models to model entities and

relations. Neural language models learn a vector that provides a distributed

representation of each word. They also learn about interactions between words,

such as which word is likely to come after a sequence of words, by learning functions

of these vectors. We can extend this approach to entities and relations by learning

an embedding vector for each relation. In fact, the parallel between modeling

Respectively available from these web sites:

freebase.com

cyc.com/opencyc

wordnet.

princeton.edu, wikiba.se

geneontology.org

486

CHAPTER 12. APPLICATIONS

language and modeling knowledge encoded as relations is so close that researchers

have trained representations of such entities by using both knowledge bases and

natural language sentences (Bordes et al., 2011, 2012; Wang et al., 2014a) or

combining data from multiple relational databases (Bordes et al., 2013b). Many

possibilities exist for the particular parametrization associated with such a model.

Early work on learning about relations between entities (Paccanaro and Hinton,

2000) posited highly constrained parametric forms (“linear relational embeddings”),

often using a diﬀerent form of representation for the relation than for the entities.

For example, Paccanaro and Hinton (2000) and Bordes et al. (2011) used vectors for

entities and matrices for relations, with the idea that a relation acts like an operator

on entities. Alternatively, relations can be considered as any other entity (Bordes

et al., 2012), allowing us to make statements about relations, but more ﬂexibility is

put in the machinery that combines them in order to model their joint distribution.

A practical short-term application of such models is link prediction: predicting

missing arcs in the knowledge graph. This is a form of generalization to new

facts, based on old facts. Most of the knowledge bases that currently exist have

been constructed through manual labor, which tends to leave many and probably

the majority of true relations absent from the knowledge base. See Wang et al.

(2014b), Lin et al. (2015) and Garcia-Duran et al. (2015) for examples of such an

application.

Evaluating the performance of a model on a link prediction task is diﬃcult

because we have only a dataset of positive examples (facts that are known to

be true). If the model proposes a fact that is not in the dataset, we are unsure

whether the model has made a mistake or discovered a new, previously unknown

fact. The metrics are thus somewhat imprecise and are based on testing how the

model ranks a held-out of set of known true positive facts compared to other facts

that are less likely to be true. A common way to construct interesting examples

that are probably negative (facts that are probably false) is to begin with a true

fact and create corrupted versions of that fact, for example by replacing one entity

in the relation with a diﬀerent entity selected at random. The popular precision at

10% metric counts how many times the model ranks a “correct” fact among the

top 10% of all corrupted versions of that fact.

Another application of knowledge bases and distributed representations for

them is word-sense disambiguation (Navigli and Velardi, 2005; Bordes et al., 2012),

which is the task of deciding which of the senses of a word is the appropriate one,

in some context.

Eventually, knowledge of relations combined with a reasoning process and

understanding of natural language could allow us to build a general question

487

CHAPTER 12. APPLICATIONS

answering system. A general question answering system must be able to process

input information and remember important facts, organized in a way that enables

it to retrieve and reason about them later. This remains a diﬃcult open problem

which can only be solved in restricted “toy” environments. Currently, the best

approach to remembering and retrieving speciﬁc declarative facts is to use an

explicit memory mechanism, as described in Sec. 10.12. Memory networks were

ﬁrst proposed to solve a toy question answering task (Weston et al., 2014). Kumar

et al. (2015) have proposed an extension that uses GRU recurrent nets to read

the input into the memory and to produce the answer given the contents of the

memory.

Deep learning has been applied to many other applications besides the ones

described here, and will surely be applied to even more after this writing. It would

be impossible to describe anything remotely resembling a comprehensive coverage

of such a topic. This survey provides a representative sample of what is possible

as of this writing.

This concludes Part II, which has described modern practices involving deep

networks, comprising all of the most successful methods. Generally speaking, these

methods involve using the gradient of a cost function to ﬁnd the parameters of a

model that approximates some desired function. With enough training data, this

approach is extremely powerful. We now turn to Part III, in which we step into the

territory of research—methods that are designed to work with less training data

or to perform a greater variety of tasks, where the challenges are more diﬃcult

and not as close to being solved as the situations we have described so far.

488