CONTENTS
6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 203
6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Regularization for Deep Learning 228
7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230
7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 237
7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239
7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 240
7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 244
7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 251
7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 253
7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 255
7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267
7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier 268
8 Optimization for Training Deep Models 274
8.1 How Learning Differs from Pure Optimization . . . . . . . . . . . 275
8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . 282
8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 301
8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 306
8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310
8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 318
9 Convolutional Networks 331
9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 332
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . . 346
9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 348
9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363
9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 364
iii