Index

0-1 loss, 104, 276

Absolute value rectiﬁcation, 192

Accuracy, 426

Activation function, 170

Active constraint, 95

AdaGrad, 307

ADALINE, see adaptive linear element

Adam, 308, 428

Adaptive linear element, 15, 24, 27

Adversarial example, 267

Adversarial training, 268, 270, 533

Aﬃne, 110

AIS, see annealed importance sampling

Almost everywhere, 71

Almost sure convergence, 130

Ancestral sampling, 583, 598

ANN, see Artiﬁcial neural network

Annealed importance sampling, 628, 670,

719

Approximate Bayesian computation, 718

Approximate inference, 586

Artiﬁcial intelligence, 1

Artiﬁcial neural network, see Neural net-

work

ASR, see automatic speech recognition

Asymptotically unbiased, 124

Audio, 102, 361, 461

Autoencoder, 4, 357, 505

Automatic speech recognition, 461

Back-propagation, 203

Back-propagation through time, 385

Backprop, see back-propagation

Bag of words, 474

Bagging, 255

Batch normalization, 266, 428

Bayes error, 117

Bayes’ rule, 70

Bayesian hyperparameter optimization, 439

Bayesian network, see directed graphical

model

Bayesian probability, 55

Bayesian statistics, 135

Belief network, see directed graphical model

Bernoulli distribution, 62

BFGS, 316

Bias, 124, 229

Bias parameter, 110

Biased importance sampling, 596

Bigram, 465

Binary relation, 485

Block Gibbs sampling, 602

Boltzmann distribution, 573

Boltzmann machine, 573, 656

BPTT, see back-propagation through time

Broadcasting, 34

Burn-in, 600

CAE, see contractive autoencoder

Calculus of variations, 179

Categorical distribution, see multinoulli dis-

tribution

CD, see contrastive divergence

Centering trick (DBM), 675

Central limit theorem, 63

Chain rule (calculus), 206

Chain rule of probability, 59

780

INDEX

Chess, 2

Chord, 582

Chordal graph, 582

Class-based language models, 466

Classical dynamical system, 376

Classiﬁcation, 100

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collaborative Filtering, 481

Collider, see explaining away

Color images, 361

Complex cell, 366

Computational graph, 204

Computer vision, 455

Concept drift, 541

Condition number, 279

Conditional computation, see dynamic struc-

ture

Conditional independence, xiii, 60

Conditional probability, 59

Conditional RBM, 687

Connectionism, 17, 446

Connectionist temporal classiﬁcation, 463

Consistency, 130, 516

Constrained optimization, 93, 237

Content-based addressing, 422

Content-based recommender systems, 483

Context-speciﬁc independence, 576

Contextual bandits, 483

Continuation methods, 328

Contractive autoencoder, 524

Contrast, 457

Contrastive divergence, 291, 613, 674

Convex optimization, 141

Convolution, 331, 685

Convolutional network, 16

Convolutional neural network, 252,

331

, 428,

463

Coordinate descent, 322, 673

Correlation, 61

Cost function, see objective function

Covariance, xiii, 61

Covariance matrix, 62

Coverage, 427

Critical temperature, 606

Cross-correlation, 333

Cross-entropy, 75, 132

Cross-validation, 122

CTC, see connectionist temporal classiﬁca-

tion

Curriculum learning, 329

Curse of dimensionality, 154

Cyc, 2

D-separation, 575

DAE, see denoising autoencoder

Data generating distribution, 111, 131

Data generating process, 111

Data parallelism, 450

Dataset, 105

Dataset augmentation, 270, 460

DBM, see deep Boltzmann machine

DCGAN, 554, 555, 703

Decision tree, 145, 551

Decoder, 4

Deep belief network, 27, 532, 634, 659, 662,

686, 694

Deep Blue, 2

Deep Boltzmann machine, 24, 27, 532, 634,

655, 659, 665, 674, 686

Deep feedforward network, 167, 428

Deep learning, 2, 5

Denoising autoencoder, 513, 691

Denoising score matching, 622

Density estimation, 103

Derivative, xiii, 83

Design matrix, 106

Detector layer, 340

Determinant, xii

Diagonal matrix, 41

Diﬀerential entropy, 74, 649

Dirac delta function, 65

Directed graphical model, 77, 510, 566, 694

Directional derivative, 85

Discriminative ﬁne-tuning, see supervised

ﬁne-tuning

Discriminative RBM, 688

Distributed representation, 17, 150, 549

Domain adaptation, 539

781

INDEX

Dot product, 34, 141

Double backprop, 270

Doubly block circulant matrix, 334

Dream sleep, 612, 655

DropConnect, 265

Dropout, 257, 428, 433, 434, 674, 691

Dynamic structure, 451, 452

E-step, 637

Early stopping, 246, 249, 272, 273, 428

EBM, see energy-based model

Echo state network, 24, 27, 406

Eﬀective capacity, 114

Eigendecomposition, 42

Eigenvalue, 42

Eigenvector, 42

ELBO, see evidence lower bound

Element-wise product, see Hadamard prod-

uct, see Hadamard product

EM, see expectation maximization

Embedding, 519

Empirical distribution, 66

Empirical risk, 276

Empirical risk minimization, 276

Encoder, 4

Energy function, 572

Energy-based model, 572, 598, 656, 665

Ensemble methods, 255

Epoch, 247

Equality constraint, 94

Equivariance, 339

Error function, see objective function

ESN, see echo state network

Euclidean norm, 39

Euler-Lagrange equation, 649

Evidence lower bound, 636, 663

Example, 99

Expectation, 60

Expectation maximization, 637

Expected value, see expectation

Explaining away, 577, 634, 647

Exploitation, 484

Exploration, 484

Exponential distribution, 65

F-score, 426

Factor (graphical model), 570

Factor analysis, 493

Factor graph, 582

Factors of variation, 4

Feature, 99

Feature selection, 236

Feedforward neural network, 167

Fine-tuning, 324

Finite diﬀerences, 442

Forget gate, 306

Forward propagation, 203

Fourier transform, 361, 363

Fovea, 367

FPCD, 617

Free energy, 574, 682

Freebase, 486

Frequentist probability, 55

Frequentist statistics, 135

Frobenius norm, 46

Fully-visible Bayes network, 707

Functional derivatives, 648

FVBN, see fully-visible Bayes network

Gabor function, 369

GANs, see generative adversarial networks

Gated recurrent unit, 428

Gaussian distribution, see normal distribu-

tion

Gaussian kernel, 142

Gaussian mixture, 67, 188

GCN, see global contrast normalization

GeneOntology, 486

Generalization, 110

Generalized Lagrange function, see general-

ized Lagrangian

Generalized Lagrangian, 94

Generative adversarial networks, 691, 702

Generative moment matching networks, 705

Generator network, 695

Gibbs distribution, 571

Gibbs sampling, 584, 602

Global contrast normalization, 457

GPU, see graphics processing unit

Gradient, 84

782

INDEX

Gradient clipping, 289, 417

Gradient descent, 83, 85

Graph, xii

Graphical model, see structured probabilis-

tic model

Graphics processing unit, 447

Greedy algorithm, 324

Greedy layer-wise unsupervised pretraining,

531

Greedy supervised pretraining, 324

Grid search, 435

Hadamard product, xii, 34

Hard tanh, 196

Harmonium, see restricted Boltzmann ma-

chine

Harmony theory, 574

Helmholtz free energy, see evidence lower

bound

Hessian, 223

Hessian matrix, xiii, 87

Heteroscedastic, 187

Hidden layer, 6, 167

Hill climbing, 86

Hyperparameter optimization, 435

Hyperparameters, 120, 433

Hypothesis space, 112, 118

i.i.d. assumptions, 111, 122, 267

Identity matrix, 36

ILSVRC, see ImageNet Large-Scale Visual

Recognition Challenge

ImageNet Large-Scale Visual Recognition

Challenge, 23

Immorality, 580

Importance sampling, 595, 627, 700

Importance weighted autoencoder, 700

Independence, xiii, 60

Independent and identically distributed, see

i.i.d. assumptions

Independent component analysis, 494

Independent subspace analysis, 496

Inequality constraint, 94

Inference, 565, 586, 634, 636, 638, 641, 651,

653

Information retrieval, 528

Initialization, 301

Integral, xiii

Invariance, 343

Isotropic, 65

Jacobian matrix, xiii, 72, 86

Joint probability, 57

k-means, 365, 549

k-nearest neighbors, 143, 551

Karush-Kuhn-Tucker conditions, 95, 237

Karush–Kuhn–Tucker, 94

Kernel (convolution), 332, 333

Kernel machine, 551

Kernel trick, 141

KKT, see Karush–Kuhn–Tucker

KKT conditions, see Karush-Kuhn-Tucker

conditions

KL divergence, see Kullback-Leibler diver-

gence

Knowledge base, 2, 486

Krylov methods, 224

Kullback-Leibler divergence, xiii, 74

Label smoothing, 243

Lagrange multipliers, 94, 649

Lagrangian, see generalized Lagrangian

LAPGAN, 704

Laplace distribution, 65, 499, 500

Latent variable, 67

Layer (neural network), 167

LCN, see local contrast normalization

Leaky ReLU, 192

Leaky units, 409

Learning rate, 85

Line search, 85, 86, 93

Linear combination, 37

Linear dependence, 38

Linear factor models, 492

Linear regression, 107, 110, 140

Link prediction, 487

Lipschitz constant, 92

Lipschitz continuous, 92

Liquid state machine, 406

783

INDEX

Local conditional probability distribution,

567

Local contrast normalization, 459

Logistic regression, 3, 140, 140

Logistic sigmoid, 7, 67

Long short-term memory, 18, 25, 306,

411

428

Loop, 582

Loopy belief propagation, 588

Loss function, see objective function

norm, 39

LSTM, see long short-term memory

M-step, 637

Machine learning, 2

Machine translation, 101

Main diagonal, 33

Manifold, 160

Manifold hypothesis, 161

Manifold learning, 161

Manifold tangent classiﬁer, 270

MAP approximation, 138, 508

Marginal probability, 58

Markov chain, 598

Markov chain Monte Carlo, 598

Markov network, see undirected model

Markov random ﬁeld, see undirected model

Matrix, xi, xii, 32

Matrix inverse, 36

Matrix product, 34

Max norm, 40

Max pooling, 340

Maximum likelihood, 131

Maxout, 192, 428

MCMC, see Markov chain Monte Carlo

Mean ﬁeld, 641, 642, 674

Mean squared error, 108

Measure theory, 71

Measure zero, 71

Memory network, 419, 421

Method of steepest descent, see gradient

descent

Minibatch, 279

Missing inputs, 100

Mixing (Markov chain), 604

Mixture density networks, 188

Mixture distribution, 66

Mixture model, 188, 513

Mixture of experts, 453, 551

MLP, see multilayer perception

MNIST, 21, 22, 674

Model averaging, 255

Model compression, 451

Model identiﬁability, 284

Model parallelism, 450

Moment matching, 705

Moore-Penrose pseudoinverse, 45, 240

Moralized graph, 580

MP-DBM, see multi-prediction DBM

MRF (Markov Random Field), see undi-

rected model

MSE, see mean squared error

Multi-modal learning, 542

Multi-prediction DBM, 676

Multi-task learning, 245, 541

Multilayer perception, 5

Multilayer perceptron, 27

Multinomial distribution, 62

Multinoulli distribution, 62

n-gram, 464

NADE, 710

Naive Bayes, 3

Nat, 73

Natural image, 562

Natural language processing, 464

Nearest neighbor regression, 115

Negative deﬁnite, 89

Negative phase, 473, 609, 611

Neocognitron, 16, 24, 27, 368

Nesterov momentum, 300

Netﬂix Grand Prize, 256, 482

Neural language model, 466, 479

Neural network, 13

Neural Turing machine, 421

Neuroscience, 15

Newton’s method, 89, 310

NLM, see neural language model

NLP, see natural language processing

No free lunch theorem, 116

784

INDEX

Noise-contrastive estimation, 623

Non-parametric model, 114

Norm, xiv, 39

Normal distribution, 63, 64, 125

Normal equations, 109, 109, 112, 234

Normalized initialization, 303

Numerical diﬀerentiation, see ﬁnite diﬀer-

ences

Object detection, 456

Object recognition, 456

Objective function, 82

OMP-k, see orthogonal matching pursuit

One-shot learning, 541

Operation, 204

Optimization, 80, 82

Orthodox statistics, see frequentist statistics

Orthogonal matching pursuit, 27, 254

Orthogonal matrix, 42

Orthogonality, 41

Output layer, 167

Parallel distributed processing, 17

Parameter initialization, 301, 408

Parameter sharing, 251, 336, 374, 376, 389

Parameter tying, see Parameter sharing

Parametric model, 114

Parametric ReLU, 192

Partial derivative, 84

Partition function, 571, 608, 671

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Perceptron, 15, 27

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Perturbation analysis, see reparametrization

trick

Point estimator, 122

Policy, 483

Pooling, 331, 685

Positive deﬁnite, 89

Positive phase, 473, 609, 611, 658, 670

Precision, 426

Precision (of a normal distribution), 63, 65

Predictive sparse decomposition, 526

Preprocessing, 456

Pretraining, 324, 531

Primary visual cortex, 366

Principal components analysis, 48, 146–148,

493, 634

Prior probability distribution, 135

Probabilistic max pooling, 685

Probabilistic PCA, 493, 494, 635

Probability density function, 58

Probability distribution, 56

Probability mass function, 56

Probability mass function estimation, 103

Product of experts, 573

Product rule of probability, see chain rule

of probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 618

Quadrature pair, 370

Quasi-Newton condition, 316

Quasi-Newton methods, 316

Radial basis function, 196

Random search, 437

Random variable, 56

Ratio matching, 621

RBF, 196

RBM, see restricted Boltzmann machine

Recall, 426

Receptive ﬁeld, 338

Recommender Systems, 481

Rectiﬁed linear unit, 171, 192, 428, 510

Recurrent network, 27

Recurrent neural network, 379

Regression, 101

Regularization, 120, 120, 177, 228, 433

Regularizer, 119

REINFORCE, 691

Reinforcement learning, 25, 106, 483, 691

Relational database, 486

Reparametrization trick, 690

Representation learning, 3

Representational capacity, 114

Restricted Boltzmann machine, 357, 462,

482, 590, 634, 658, 659, 674, 678,

785

INDEX

680, 683, 685

Ridge regression, see weight decay

Risk, 275

RNN-RBM, 688

Saddle points, 285

Sample mean, 125

Scalar, xi, xii, 31

Score matching, 516, 620

Secant condition, 316

Second derivative, 86

Second derivative test, 89

Self-information, 73

Semantic hashing, 528

Semi-supervised learning, 244

Separable convolution, 363

Separation (probabilistic modeling), 575

Set, xii

SGD, see stochastic gradient descent

Shannon entropy, xiii, 73

Shortlist, 469

Sigmoid, xiv, see logistic sigmoid

Sigmoid belief network, 27

Simple cell, 366

Singular value, see singular value decompo-

sition

Singular value decomposition, 44, 148, 482

Singular vector, see singular value decom-

position

Slow feature analysis, 496

SML, see stochastic maximum likelihood

Softmax, 183, 421, 453

Softplus, xiv, 68, 196

Spam detection, 3

Sparse coding, 322, 357, 499, 634, 694

Sparse initialization, 304, 408

Sparse representation, 146, 226, 253, 508,

559

Spearmint, 439

Spectral radius, 407

Speech recognition, see automatic speech

recognition

Sphering, see whitening

Spike and slab restricted Boltzmann ma-

chine, 683

SPN, see sum-product network

Square matrix, 38

ssRBM, see spike and slab restricted Boltz-

mann machine

Standard deviation, 61

Standard error, 127

Standard error of the mean, 128, 278

Statistic, 122

Statistical learning theory, 110

Steepest descent, see gradient descent

Stochastic back-propagation, see reparametriza-

tion trick

Stochastic gradient descent, 15, 150, 279,

294, 674

Stochastic maximum likelihood, 615, 674

Stochastic pooling, 265

Structure learning, 585

Structured output, 101, 687

Structured probabilistic model, 77, 561

Sum rule of probability, 58

Sum-product network, 556

Supervised ﬁne-tuning, 532, 664

Supervised learning, 105

Support vector machine, 140

Surrogate loss function, 276

SVD, see singular value decomposition

Symmetric matrix, 41, 43

Tangent distance, 269

Tangent plane, 519

Tangent prop, 269

TDNN, see time-delay neural network

Teacher forcing, 383, 384

Tempering, 606

Template matching, 141

Tensor, xi, xii, 33

Test set, 110

Tikhonov regularization, see weight decay

Tiled convolution, 353

Time-delay neural network, 369, 375

Toeplitz matrix, 334

Topographic ICA, 496

Trace operator, 46

Training error, 110

Transcription, 101

786

INDEX

Transfer learning, 539

Transpose, xii, 33

Triangle inequality, 39

Triangulated graph, see chordal graph

Trigram, 465

Unbiased, 124

Undirected graphical model, 77, 510

Undirected model, 569

Uniform distribution, 57

Unigram, 465

Unit norm, 41

Unit vector, 41

Universal approximation theorem, 197

Universal approximator, 556

Unnormalized probability distribution, 570

Unsupervised learning, 105, 146

Unsupervised pretraining, 462, 531

V-structure, see explaining away

V1, 366

VAE, see variational autoencoder

Vapnik-Chervonenkis dimension, 114

Variance, xiii, 61, 229

Variational autoencoder, 691, 698

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower

bound

VC dimension, see Vapnik-Chervonenkis di-

mension

Vector, xi, xii, 32

Virtual adversarial examples, 268

Visible layer, 6

Volumetric data, 361

Wake-sleep, 654, 663

Weight decay, 118, 177, 231, 434

Weight space symmetry, 284

Weights, 15, 107

Whitening, 458

Wikibase, 486

Word embedding, 467

Word-sense disambiguation, 487

WordNet, 486

Zero-data learning, see zero-shot learning

Zero-shot learning, 541

787