Bibliography
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R.,
Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I.,
Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden,
P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale
machine learning on heterogeneous systems. Software available from tensorflow.org. 25,
212, 449
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for
Boltzmann machines. Cognitive Science, 9, 147–169. 573, 656
Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data
generating distribution. In ICLR’2013, arXiv:1211.4246 . 510, 516, 524
Alain, G., Bengio, Y., Yao, L., Éric Thibodeau-Laufer, Yosinski, J., and Vincent, P. (2015).
GSNs: Generative stochastic networks. arXiv:1503.05571. 513, 715
Anderson, E. (1935). The Irises of the Gaspé Peninsula. Bulletin of the American Iris
Society, 59, 2–5. 21
Ba, J., Mnih, V., and Kavukcuoglu, K. (2014). Multiple object recognition with visual
attention. arXiv:1412.7755 . 693
Bachman, P. and Precup, D. (2015). Variational generative stochastic networks with
collaborative shaping. In Proceedings of the 32nd International Conference on Machine
Learning, ICML 2015, Lille, France, 6-11 July 2015 , pages 1964–1972. 718
Bacon, P.-L., Bengio, E., Pineau, J., and Precup, D. (2015). Conditional computation in
neural networks using a decision-theoretic approach. In 2nd Multidisciplinary Conference
on Reinforcement Learning and Decision Making (RLDM 2015). 453
Bagnell, J. A. and Bradley, D. M. (2009). Differentiable sparse coding. In D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information
Processing Systems 21 (NIPS’08), pages 113–120. 501
723
BIBLIOGRAPHY
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly
learning to align and translate. In ICLR’2015, arXiv:1409.0473 . 25, 101, 399, 421, 422,
468, 478
Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition
with continuous-parameter hidden Markov models. Computer, Speech and Language,
2
,
219–234. 461
Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis:
Learning from examples without local minima. Neural Networks, 2, 53–58. 286
Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the
past and the future in protein secondary structure prediction. Bioinformatics,
15
(11),
937–946. 396
Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in
high-energy physics with deep learning. Nature communications, 5. 26
Ballard, D. H., Hinton, G. E., and Sejnowski, T. J. (1983). Parallel vision computation.
Nature. 455
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. 146
Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. on Information Theory, 39, 930–945. 198
Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University
Press. 493
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and
Applications. Wiley. 493
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,
Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 25, 82, 212,
222, 449
Basu, S. and Christensen, J. (2013). Teaching classification boundaries to humans. In
AAAI’2013 . 329
Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International
Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz,
California. ACM Press. 246
Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. ArXiv
e-prints. 264
Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces
in random-dot stereograms. Nature, 355, 161–163. 544
724
BIBLIOGRAPHY
Behnke, S. (2001). Learning iterative image reconstruction in the neural abstraction
pyramid. Int. J. Computational Intelligence and Applications, 1(4), 427–438. 518
Beiu, V., Quintana, J. M., and Avedillo, M. J. (2003). VLSI implementations of threshold
logic-a comprehensive survey. Neural Networks, IEEE Transactions on,
14
(5), 1217–
1243. 454
Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for
embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,
Advances in Neural Information Processing Systems 14 (NIPS’01), Cambridge, MA.
MIT Press. 244
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6), 1373–1396. 163, 521
Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. (2015a). Conditional computation in
neural networks for faster models. arXiv:1511.06297. 453
Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint
distributions using neural networks. IEEE Transactions on Neural Networks, special
issue on Data Mining and Knowledge Discovery, 11(3), 550–557. 709
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015b). Scheduled sampling for
sequence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.
385
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recognition.
Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 409
Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computation,
12(8), 1889–1900. 438
Bengio, Y. (2002). New distributed probabilistic language models. Technical Report 1215,
Dept. IRO, Université de Montréal. 470
Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 200, 626
Bengio, Y. (2013). Deep learning of representations: looking forward. In Statistical
Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science,
pages 1–37. Springer, also in arXiv at http://arxiv.org/abs/1305.0445. 451
Bengio, Y. (2015). Early inference in energy-based models approximates back-propagation.
Technical Report arXiv:1510.02777, Universite de Montreal. 658
Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-
layer neural networks. In NIPS 12 , pages 400–406. MIT Press. 707, 709, 710, 712
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.
Neural Computation, 21(6), 1601–1621. 516, 614
725
BIBLIOGRAPHY
Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold
cross-validation. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural
Information Processing Systems 16 (NIPS’03), Cambridge, MA. MIT Press, Cambridge.
122
Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large Scale
Kernel Machines. 19
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In L. Saul,
Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems
17 (NIPS’04), pages 129–136. MIT Press. 159, 522
Bengio, Y. and Sénécal, J.-S. (2003). Quick training of probabilistic neural nets by
importance sampling. In Proceedings of AISTATS 2003 . 473
Bengio, Y. and Sénécal, J.-S. (2008). Adaptive importance sampling to accelerate training
of a neural probabilistic language model. IEEE Trans. Neural Networks,
19
(4), 713–722.
473
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated
acoustic parameters for continuous speech recognition using artificial neural networks.
In Proceedings of EuroSpeech’91 . 27, 462
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Neural network-Gaussian
mixture hybrid for speech recognition or density estimation. In NIPS 4 , pages 175–182.
Morgan Kaufmann. 462
Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term
dependencies in recurrent networks. In IEEE International Conference on Neural
Networks, pages 1183–1195, San Francisco. IEEE Press. (invited paper). 405
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Tr. Neural Nets. 18, 403, 405, 406, 414
Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning of hyper-
parameters. Learning Conference, Snowbird. 438
Bengio, Y., Ducharme, R., and Vincent, P. (2001). A neural probabilistic language model.
In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT
Press. 18, 450, 466, 469, 475, 480, 485
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic
language model. JMLR, 3, 1137–1155. 469, 475
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., and Marcotte, P. (2006a). Convex
neural networks. In NIPS’2005 , pages 123–130. 257
Bengio, Y., Delalleau, O., and Le Roux, N. (2006b). The curse of highly variable functions
for local kernel machines. In NIPS’2005 . 157
726
BIBLIOGRAPHY
Bengio, Y., Larochelle, H., and Vincent, P. (2006c). Non-local manifold Parzen windows.
In NIPS’2005 . MIT Press. 159, 523
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise
training of deep networks. In NIPS’2006 . 14, 19, 200, 324, 325, 531, 533
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In
ICML’09 . 329
Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep
representations. In ICML’2013 . 607
Bengio, Y., Léonard, N., and Courville, A. (2013b). Estimating or propagating gradients
through stochastic neurons for conditional computation. arXiv:1308.3432. 451, 453,
691, 693
Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013c). Generalized denoising auto-
encoders as generative models. In NIPS’2013 . 510, 713, 715
Bengio, Y., Courville, A., and Vincent, P. (2013d). Representation learning: A review and
new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),
35(8), 1798–1828. 558
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative
stochastic networks trainable by backprop. In ICML’2014 . 713, 714, 715, 716, 717
Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data.
Journal of Computational Physics, 22(2), 245–268. 632
Bennett, J. and Lanning, S. (2007). The Netflix prize. 482
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy
approach to natural language processing. Computational Linguistics, 22, 39–71. 476
Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive
divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 617
Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern
Classification. Ph.D. thesis, Université de Montréal. 254
Bergstra, J. and Bengio, Y. (2009). Slow, decorrelated features for pretraining complex
cell-like networks. In NIPS’2009 . 497
Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J.
Machine Learning Res., 13, 281–305. 437, 438
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression
compiler. In Proc. SciPy. 25, 82, 212, 222, 449
727
BIBLIOGRAPHY
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter
optimization. In NIPS’2011 . 439
Berkes, P. and Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex
cell properties. Journal of Vision, 5(6), 579–602. 498
Bertsekas, D. P. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.
106
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician,
24
(3), 179–195.
618
Bishop, C. M. (1994). Mixture density networks. 188
Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.
In Proceedings International Conference on Artificial Neural Networks ICANN’95 ,
volume 1, page 141–148. 242, 249
Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.
Neural Computation, 7(1), 108–116. 242
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 98, 145
Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is NP-complete.
293
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and
the Vapnik–Chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 114
Bonnet, G. (1964). Transformations des signaux aléatoires à travers les systèmes non
linéaires sans mémoire. Annales des Télécommunications, 19(9–10), 203–220. 691
Bordes, A., Weston, J., Collobert, R., and Bengio, Y. (2011). Learning structured
embeddings of knowledge bases. In AAAI 2011 . 487
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and
meaning representations for open-text semantic parsing. AISTATS’2012 . 403, 487
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2013a). A semantic matching energy
function for learning with multi-relational data. Machine Learning: Special Issue on
Learning Semantics. 486
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013b).
Translating embeddings for modeling multi-relational data. In C. Burges, L. Bottou,
M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information
Processing Systems 26 , pages 2787–2795. Curran Associates, Inc. 487
Bornschein, J. and Bengio, Y. (2015). Reweighted wake-sleep. In ICLR’2015,
arXiv:1406.2751 . 695
728
BIBLIOGRAPHY
Bornschein, J., Shabanian, S., Fischer, A., and Bengio, Y. (2015). Training bidirectional
Helmholtz machines. Technical report, arXiv:1506.03877. 695
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for opti-
mal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on
Computational learning theory, pages 144–152, New York, NY, USA. ACM. 18, 140
Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor,
Online Learning in Neural Networks. Cambridge University Press, Cambridge, UK. 296
Bottou, L. (2011). From machine learning to machine reasoning. Technical report,
arXiv.1102.1808. 401, 403
Bottou, L. (2015). Multilayer neural networks. Deep Learning Summer School. 443
Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In NIPS’2008 .
282, 295
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal
dependencies in high-dimensional sequences: Application to polyphonic music generation
and transcription. In ICML’12 . 688
Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in
vision algorithms. In Proc. International Conference on Machine learning (ICML’10).
346
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:
multi-way local pooling for image recognition. In Proc. International Conference on
Computer Vision (ICCV’11). IEEE. 346
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and
singular value decomposition. Biological Cybernetics, 59, 291–294. 505
Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered
perceptrons. Computer Speech and Language, 3, 1–19. 462
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press, New York, NY, USA. 93
Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems,
36
, 665–674.
284
Brakel, P., Stroobandt, D., and Schrauwen, B. (2013). Training energy-based models for
time-series imputation. Journal of Machine Learning Research,
14
, 2771–2797. 676,
700
Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 163,
521
729
BIBLIOGRAPHY
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 255
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
Regression Trees. Wadsworth International Group, Belmont, CA. 145
Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden
Markov model interpretation. Speech Communication, 9(1), 83–92. 185
Briggman, K., Denk, W., Seung, S., Helmstaedter, M. N., and Turaga, S. C. (2009).
Maximin affinity learning of image segmentation. In NIPS’2009 , pages 1865–1873. 360
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,
Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational linguistics, 16(2), 79–85. 21
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Class-
based n-gram models of natural language. Computational Linguistics,
18
, 467–479.
466
Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and
control. Blaisdell Pub. Co. 225
Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving
optimum programming problems. Technical Report BR-1303, Raytheon Company,
Missle and Space Division. 225
Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 535–541. ACM. 451
Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders.
arXiv preprint arXiv:1509.00519 . 700
Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.
In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop
on, pages 291–296. IEEE. 193
Carreira-Perpiñan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.
In R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth International
Workshop on Artificial Intelligence and Statistics (AISTATS’05), pages 33–40. Society
for Artificial Intelligence and Statistics. 614
Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models
Summer School, pages 372–379. 245
Cauchy, A. (1847). Méthode générale pour la résolution de systèmes d’équations simul-
tanées. In Compte rendu des ances de l’académie des sciences, pages 536–538. 83,
224
730
BIBLIOGRAPHY
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,
UCSD. 163
Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM
computing surveys (CSUR), 41(3), 15. 102
Chapelle, O., Weston, J., and Schölkopf, B. (2003). Cluster kernels for semi-supervised
learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural
Information Processing Systems 15 (NIPS’02), pages 585–592, Cambridge, MA. MIT
Press. 244
Chapelle, O., Schölkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT
Press, Cambridge, MA. 244, 544
Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neural
Networks for Document Processing. In Guy Lorette, editor, Tenth International
Workshop on Frontiers in Handwriting Recognition, La Baule (France). Université de
Rennes 1, Suvisoft. http://www.suvisoft.com. 24, 27, 448
Chen, B., Ting, J.-A., Marlin, B. M., and de Freitas, N. (2010). Deep learning of invariant
spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised
Feature Learning Workshop. 361
Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for
language modeling. Computer, Speech and Language, 13(4), 359–393. 465, 476
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. (2014a). DianNao:
A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Pro-
ceedings of the 19th international conference on Architectural support for programming
languages and operating systems, pages 269–284. ACM. 454
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C.,
and Zhang, Z. (2015). MXNet: A flexible and efficient machine learning library for
heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 . 25
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N.,
et al. (2014b). DaDianNao: A machine-learning supercomputer. In Microarchitecture
(MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609–622.
IEEE. 454
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Project Adam:
Building an efficient and scalable deep learning training system. In 11th USENIX
Symposium on Operating Systems Design and Implementation (OSDI’14). 450
Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is efficient for learning restricted
Boltzmann machines. In IJCNN’2010 . 606, 617
731
BIBLIOGRAPHY
Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for
training restricted Boltzmann machines. In ICML’2011 , pages 105–112. 676
Cho, K., van Merriënboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y.
(2014a). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. In Proceedings of the Empiricial Methods in Natural Language
Processing (EMNLP 2014). 397, 477, 478
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the prop-
erties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints,
abs/1409.1259. 414
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The
loss surface of multilayer networks. 285, 286
Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous
speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602.
463
Christianson, B. (1992). Automatic Hessians by reverse accumulation. IMA Journal of
Numerical Analysis, 12(2), 135–150. 224
Chrupala, G., Kadar, A., and Alishahi, A. (2015). Learning language through pictures.
arXiv 1506.03694. 414
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,
arXiv 1412.3555. 414, 463
Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2015a). Gated feedback recurrent
neural networks. In ICML’15 . 414
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y. (2015b). A
recurrent latent variable model for sequential data. In NIPS’2015 . 700
Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural
network for traffic sign classification. Neural Networks, 32, 333–338. 23, 200
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big
simple neural nets for handwritten digit recognition. Neural Computation,
22
, 1–14.
24, 27, 449
Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse
coding and vector quantization. In ICML’2011 . 27, 254, 501
Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in
unsupervised feature learning. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics (AISTATS 2011). 364, 365, 458
732
BIBLIOGRAPHY
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013).
Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester, editors,
Proceedings of the 30th International Conference on Machine Learning (ICML-13),
volume 28 (3), pages 1337–1345. JMLR Workshop and Conference Proceedings. 24, 27,
365, 450
Cohen, N., Sharir, O., and Shashua, A. (2015). On the expressive power of deep learning:
A tensor analysis. arXiv:1509.05009. 557
Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Université de Paris VI,
LIP6. 196
Collobert, R. (2011). Deep learning for efficient discriminative parsing. In AISTATS’2011 .
101, 480
Collobert, R. and Weston, J. (2008a). A unified architecture for natural language processing:
Deep neural networks with multitask learning. In ICML’2008 . 474, 480
Collobert, R. and Weston, J. (2008b). A unified architecture for natural language
processing: Deep neural networks with multitask learning. In ICML’2008 . 538
Collobert, R., Bengio, S., and Bengio, Y. (2001). A parallel mixture of SVMs for very
large scale problems. Technical Report IDIAP-RR-01-12, IDIAP. 453
Collobert, R., Bengio, S., and Bengio, Y. (2002). Parallel mixture of SVMs for very large
scale problems. Neural Computation, 14(5), 1105–1114. 453
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011a).
Natural language processing (almost) from scratch. The Journal of Machine Learning
Research, 12, 2493–2537. 329, 480, 538, 539
Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A Matlab-like environ-
ment for machine learning. In BigLearn, NIPS Workshop. 25, 210, 449
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,
36, 287–314. 494
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning,
20
,
273–297. 18, 140
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation
using depth information. In International Conference on Learning Representations
(ICLR2013). 23, 200
Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Low precision arithmetic for deep
learning. In Arxiv:1412.7024, ICLR’2015 Workshop. 455
Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by
spike-and-slab RBMs. In ICML’11 . 564, 683
733
BIBLIOGRAPHY
Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab
RBM and extensions to discrete and sparse data distributions. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 685
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.
Wiley-Interscience. 73
Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search
approach to unconstrained face recognition. In Automatic Face & Gesture Recognition
and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.
364
Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press. 135,
295
Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature,
304
,
111–114. 612
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics
of Control, Signals, and Systems, 2, 303–314. 197
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition
with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 23
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep
neural networks for large vocabulary speech recognition. IEEE Transactions on Audio,
Speech, and Language Processing, 20(1), 33–42. 462
Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks
for LVCSR using rectified linear units and dropout. In ICASSP’2013 . 462
Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for
QSAR predictions. arXiv:1406.1231. 26
Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse
high-dimensional inputs. In NIPS26 . NIPS Foundation. 622
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with
reconstruction sampling. In ICML’2011 . 474
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization. In NIPS’2014 . 285, 286, 288
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.
(2014). The visual microphone: Passive recovery of sound from video. ACM Transactions
on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 455
734
BIBLIOGRAPHY
Dayan, P. (1990). Reinforcement comparison. In Connectionist Models: Proceedings of
the 1990 Connectionist Summer School , San Mateo, CA. 693
Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks,
9(8), 1385–1403. 695
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine.
Neural computation, 7(5), 889–904. 695
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,
Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep
networks. In NIPS’2012 . 25, 450
Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causation.
Computational Intelligence, 5(3), 142–150. 664
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6), 391–407. 479, 485
Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.
19, 557
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09 . 21
Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than
10,000 image categories tell us? In Proceedings of the 11th European Conference on
Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.
21
Deng, L. and Yu, D. (2014). Deep learning methods and applications. Foundations and
Trends in Signal Processing. 463
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Binary
coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 , Makuhari,
Chiba, Japan. 23
Denil, M., Bazzani, L., Larochelle, H., and de Freitas, N. (2012). Learning where to attend
with deep architectures for image tracking. Neural Computation,
24
(8), 2151–2184. 368
Denton, E., Chintala, S., Szlam, A., and Fergus, R. (2015). Deep generative image models
using a Laplacian pyramid of adversarial networks. NIPS . 703, 704, 720
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for
vision. Technical Report 1327, Département d’Informatique et de Recherche Opéra-
tionnelle, Université de Montréal. 685
735
BIBLIOGRAPHY
Desjardins, G., Courville, A. C., Bengio, Y., Vincent, P., and Delalleau, O. (2010).
Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In
International Conference on Artificial Intelligence and Statistics, pages 145–152. 606,
617
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.
In NIPS’2011 . 633
Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In
Advances in Neural Information Processing Systems, pages 2062–2070. 321
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast
and robust neural network joint models for statistical machine translation. In Proc.
ACL’2014 . 476
Devroye, L. (2013). Non-Uniform Random Variate Generation. SpringerLink : Bücher.
Springer New York. 696
DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.
neurons vs. machines. NIPS Tutorial. 26, 367
Dinh, L., Krueger, D., and Bengio, Y. (2014). NICE: Non-linear independent components
estimation. arXiv:1410.8516. 496
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual
recognition and description. arXiv:1411.4389. 102
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding
techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,
Stanford University. 163, 522
Dosovitskiy, A., Springenberg, J. T., and Brox, T. (2015). Learning to generate chairs with
convolutional neural networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1538–1546. 697, 706, 707
Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.
IEEE Transactions on Neural Networks, 1, 75–80. 403, 406
Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of
Mathematical Analysis and Applications, 5(1), 30–45. 225
Dreyfus, S. E. (1973). The computational solution of optimal control problems with time
lag. IEEE Transactions on Automatic Control, 18(4), 383–385. 225
Drucker, H. and LeCun, Y. (1992). Improving generalisation performance using double
back-propagation. IEEE Transactions on Neural Networks, 3(6), 991–997. 270
736
BIBLIOGRAPHY
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research. 307
Dudik, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning.
In Proceedings of the 28th International Conference on Machine learning, ICML ’11.
485
Dugas, C., Bengio, Y., Bélisle, F., and Nadeau, C. (2001). Incorporating second-order
functional knowledge for better option pricing. In T. Leen, T. Dietterich, and V. Tresp,
editors, Advances in Neural Information Processing Systems 13 (NIPS’00), pages
472–478. MIT Press. 68, 196
Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural net-
works via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 .
705
El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS’1995 . 401, 410
Elkahky, A. M., Song, Y., and He, X. (2015). A multi-view deep learning approach for
cross domain user modeling in recommendation systems. In Proceedings of the 24th
International Conference on World Wide Web, pages 278–288. 483
Elman, J. L. (1993). Learning and development in neural networks: The importance of
starting small. Cognition, 48, 781–799. 329
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The difficulty
of training deep architectures and the effect of unsupervised pre-training. In Proceedings
of AISTATS’2009 . 200
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? J. Machine Learning Res.
532, 536, 537
Fahlman, S. E., Hinton, G. E., and Sejnowski, T. J. (1983). Massively parallel architectures
for AI: NETL, thistle, and Boltzmann machines. In Proceedings of the National
Conference on Artificial Intelligence AAAI-83 . 573, 656
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X.,
Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual
concepts and back. arXiv:1411.4952. 102
Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., and
Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman,
M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and
Distributed Approaches. Cambridge University Press. 526
737
BIBLIOGRAPHY
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013). Learning hierarchical features
for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(8), 1915–1929. 23, 200, 360
Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
28
(4), 594–611. 541
Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. (2015). Learning
visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv
preprint arXiv:1509.06113 . 25
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, 179–188. 21, 105
Földiák, P. (1989). Adaptive network for optimal linear feature extraction. In International
Joint Conference on Neural Networks (IJCNN), volume 1, pages 401–405, Washington
1989. IEEE, New York. 497
Franzius, M., Sprekeler, H., and Wiskott, L. (2007). Slowness and sparseness lead to place,
head-direction, and spatial-view cells. 498
Franzius, M., Wilbert, N., and Wiskott, L. (2008). Invariant object recognition with slow
feature analysis. In Artificial Neural Networks-ICANN 2008 , pages 961–970. Springer.
499
Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data
structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence. 401,
403
Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive
processing of data structures. IEEE Transactions on Neural Networks,
9
(5), 768–786.
401, 403
Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In
Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156,
USA. ACM. 257
Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting. In
Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages
325–332. 257
Frey, B. J. (1998). Graphical models for machine learning and digital communication.
MIT Press. 707, 708
Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn good
density estimators? In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances
in Neural Information Processing Systems 8 (NIPS’95), pages 661–670. MIT Press,
Cambridge, MA. 654
738
BIBLIOGRAPHY
Frobenius, G. (1908). Über matrizen aus positiven elementen, s. B. Preuss. Akad. Wiss.
Berlin, Germany. 600
Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological
Cybernetics, 20, 121–136. 16, 226, 531
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics,
36, 193–202. 16, 24, 27, 226, 368
Gal, Y. and Ghahramani, Z. (2015). Bayesian convolutional neural networks with Bernoulli
approximate variational inference. arXiv preprint arXiv:1506.02158 . 263
Gallinari, P., LeCun, Y., Thiria, S., and Fogelman-Soulie, F. (1987). Memoires associatives
distribuees. In Proceedings of COGNITIVA 87 , Paris, La Villette. 518
Garcia-Duran, A., Bordes, A., Usunier, N., and Grandvalet, Y. (2015). Combining two
and three-way embeddings models for link prediction in knowledge bases. arXiv preprint
arXiv:1506.00999 . 487
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993).
Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1.
NASA STI/Recon Technical Report N , 93, 27403. 462
Garson, J. (1900). The metric system of identification of criminals, as used in Great
Britain and Ireland. The Journal of the Anthropological Institute of Great Britain and
Ireland, (2), 177–227. 21
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual
prediction with LSTM. Neural computation, 12(10), 2451–2471. 411, 415
Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor
analyzers. Technical Report CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto. 492
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. (2015). Multilingual language
processing from bytes. arXiv preprint arXiv:1512.00103 . 480
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2015). Region-based convolutional
networks for accurate object detection and segmentation. 429
Giudice, M. D., Manera, V., and Keysers, C. (2009). Programmed to learn? The ontogeny
of mirror neurons. Dev. Sci., 12(2), 350––363. 658
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In AISTATS’2010 . 303
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In
AISTATS’2011 . 16, 173, 196, 226
739
BIBLIOGRAPHY
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale
sentiment classification: A deep learning approach. In ICML’2011 . 510, 540
Goldberger, J., Roweis, S., Hinton, G. E., and Salakhutdinov, R. (2005). Neighbourhood
components analysis. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural
Information Processing Systems 17 (NIPS’04). MIT Press. 115
Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face
Recognition. Imperial College Press. 164, 522
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep
networks. In NIPS’2009 , pages 646–654. 254
Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).
Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction
(HRI), Osaka, Japan. ACM Press, ACM Press. 100
Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution
for autoencoders. Technical report, Université de Montréal. 358
Goodfellow, I. J. (2014). On distinguishability criteria for estimating generative models.
In International Conference on Learning Representations, Workshops Track . 625, 702,
703
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding
for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning
Hierarchical Models. 535, 541
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).
Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–
1327. 192, 263, 345, 366, 458
Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep
Boltzmann machines. In NIPS26 . NIPS Foundation. 100, 620, 673, 674, 675, 676, 677,
700
Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R.,
Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research
library. arXiv preprint arXiv:1308.4214 . 25, 449
Goodfellow, I. J., Courville, A., and Bengio, Y. (2013d). Scaling up spike-and-slab models
for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(8), 1902–1914. 500, 501, 502, 652, 685
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An empirical
investigation of catastrophic forgeting in gradient-based neural networks. In ICLR’2014 .
193
740
BIBLIOGRAPHY
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-
sarial examples. CoRR, abs/1412.6572. 267, 268, 270, 558, 559
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014 .
547, 691, 702, 703, 706
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-digit
number recognition from Street View imagery using deep convolutional neural networks.
In International Conference on Learning Representations. 25, 101, 200, 201, 202, 391,
425, 452
Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015). Qualitatively characterizing neural
network optimization problems. In International Conference on Learning Representa-
tions. 285, 286, 287, 291
Goodman, J. (2001). Classes for fast maximum entropy training. In International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 470
Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
PAMI-14
(1), 76–86. 284
Gosset, W. S. (1908). The probable error of a mean. Biometrika,
6
(1), 1–25. Originally
published under the pseudonym “Student”. 21
Gouws, S., Bengio, Y., and Corrado, G. (2014). BilBOWA: Fast bilingual distributed
representations without word alignments. Technical report, arXiv:1410.2455. 479, 542
Graf, H. P. and Jackel, L. D. (1989). Analog electronic neural network circuits. Circuits
and Devices Magazine, IEEE , 5(4), 44–49. 454
Graves, A. (2011). Practical variational inference for neural networks. In NIPS’2011 . 242
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies
in Computational Intelligence. Springer. 375, 396, 414, 463
Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report,
arXiv:1308.0850. 189, 411, 418, 422
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent
neural networks. In ICML’2014 . 411
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks,
18
(5), 602–610.
396
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidi-
mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and
L. Bottou, editors, NIPS’2008 , pages 545–552. 396
741
BIBLIOGRAPHY
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural networks. In
ICML’2006 , pages 369–376, Pittsburgh, USA. 463
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fernández, S. (2008). Uncon-
strained on-line handwriting recognition with recurrent neural networks. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 396
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and Schmidhuber, J.
(2009). A novel connectionist system for unconstrained handwriting recognition. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868. 411
Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In ICASSP’2013 , pages 6645–6649. 396, 399, 401, 411, 413, 414, 463
Graves, A., Wayne, G., and Danihelka, I. (2014a). Neural Turing machines.
arXiv:1410.5401. 25
Graves, A., Wayne, G., and Danihelka, I. (2014b). Neural Turing machines. arXiv preprint
arXiv:1410.5401 . 419, 421
Grefenstette, E., Hermann, K. M., Suleyman, M., and Blunsom, P. (2015). Learning to
transduce with unbounded memory. In NIPS’2015 . 421
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2015).
LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 415
Gregor, K. and LeCun, Y. (2010a). Emergence of complex-like cells in a temporal product
network with local receptive fields. Technical report, arXiv:1006.0448. 353
Gregor, K. and LeCun, Y. (2010b). Learning fast approximations of sparse coding. In
L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International
Conference on Machine Learning (ICML-10). ACM. 655
Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep
autoregressive networks. In International Conference on Machine Learning (ICML’2014).
695
Gregor, K., Danihelka, I., Graves, A., and Wierstra, D. (2015). DRAW: A recurrent neural
network for image generation. arXiv preprint arXiv:1502.04623 . 700
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A
kernel two-sample test. The Journal of Machine Learning Research,
13
(1), 723–773.
705
Gülçehre, Ç. and Bengio, Y. (2013). Knowledge matters: Importance of prior information
for optimization. In International Conference on Learning Representations (ICLR’2013).
25
742
BIBLIOGRAPHY
Guo, H. and Gelfand, S. B. (1992). Classification trees with neural network feature
extraction. Neural Networks, IEEE Transactions on, 3(6), 923–933. 453
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning
with limited numerical precision. CoRR, abs/1502.02551. 455
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-
tion principle for unnormalized statistical models. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10). 623
Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.
(2007). Online learning for offroad robots: Spatial label propagation to learn long-range
traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA, USA. 456
Hajnal, A., Maass, W., Pudlak, P., Szegedy, M., and Turan, G. (1993). Threshold circuits
of bounded depth. J. Comput. System. Sci., 46, 129–154. 198
Håstad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings
of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,
California. ACM Press. 198
Håstad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.
Computational Complexity, 1, 113–129. 198
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning:
data mining, inference and prediction. Springer Series in Statistics. Springer Verlag.
145
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing
human-level performance on ImageNet classification. arXiv preprint arXiv:1502.01852 .
28, 192
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 14, 17, 658
Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning
of sparse features for scalable audio classification. In ISMIR’11 . 526
Henderson, J. (2003). Inducing history representations for broad coverage statistical
parsing. In HLT-NAACL, pages 103–110. 480
Henderson, J. (2004). Discriminative training of a neural network statistical parser. In
Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,
page 95. 480
Henniges, M., Puertas, G., Bornschein, J., Eggert, J., and Lücke, J. (2010). Binary sparse
coding. In Latent Variable Analysis and Signal Separation, pages 450–457. Springer.
643
743
BIBLIOGRAPHY
Herault, J. and Ans, B. (1984). Circuits neuronaux à synapses modifiables: Décodage de
messages composites par apprentissage non supervisé. Comptes Rendus de l’Académie
des Sciences, 299(III-13), 525––528. 494
Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures. 307
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine,
29
(6), 82–97. 23,
463
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531 . 451
Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence,
40
,
185–234. 497
Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artificial
Intelligence, 46(1), 47–75. 421
Hinton, G. E. (1999). Products of experts. In ICANN’1999 . 573
Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.
Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 613,
678
Hinton, G. E. (2006). To recognize shapes, first learn to generate images. Technical Report
UTML TR 2006-003, University of Toronto. 531, 598
Hinton, G. E. (2007a). How to do backpropagation in a brain. Invited talk at the
NIPS’2007 Deep Learning Workshop. 658
Hinton, G. E. (2007b). Learning multiple layers of representation. Trends in cognitive
sciences, 11(10), 428–434. 662
Hinton, G. E. (2010). A practical guide to training restricted Boltzmann machines.
Technical Report UTML TR 2010-003, Department of Computer Science, University of
Toronto. 613
Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse
distributed representations. Philosophical Transactions of the Royal Society of London.
146
Hinton, G. E. and McClelland, J. L. (1988). Learning representations by recirculation. In
NIPS’1987 , pages 358–366. 505
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 522
744
BIBLIOGRAPHY
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with
neural networks. Science, 313(5786), 504–507. 512, 527, 531, 532, 537
Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines.
In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,
volume 1, chapter 7, pages 282–317. MIT Press, Cambridge. 573, 656
Hinton, G. E. and Sejnowski, T. J. (1999). Unsupervised learning: foundations of neural
computation. MIT press. 544
Hinton, G. E. and Shallice, T. (1991). Lesioning an attractor network: investigations of
acquired dyslexia. Psychological review, 98(1), 74. 13
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and
Helmholtz free energy. In NIPS’1993 . 505
Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984). Boltzmann machines: Constraint
satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon
University, Dept. of Computer Science. 573, 656
Hinton, G. E., McClelland, J., and Rumelhart, D. (1986). Distributed representations.
In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, volume 1, pages 77–109. MIT Press,
Cambridge. 17, 225, 529
Hinton, G. E., Revow, M., and Dayan, P. (1995a). Recognizing handwritten digits using
mixtures of linear models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances
in Neural Information Processing Systems 7 (NIPS’94), pages 1015–1022. MIT Press,
Cambridge, MA. 492
Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995b). The wake-sleep algorithm
for unsupervised neural networks. Science, 268, 1558–1161. 507, 654
Hinton, G. E., Dayan, P., and Revow, M. (1997). Modelling the manifolds of images of
handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. 502
Hinton, G. E., Welling, M., Teh, Y. W., and Osindero, S. (2001). A new view of ICA. In
Proceedings of 3rd International Conference on Independent Component Analysis and
Blind Signal Separation (ICA’01), pages 746–751, San Diego, CA. 494
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18, 1527–1554. 14, 19, 27, 142, 531, 532, 662, 663
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research
groups. IEEE Signal Process. Mag., 29(6), 82–97. 101
745
BIBLIOGRAPHY
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c).
Improving neural networks by preventing co-adaptation of feature detectors. Technical
report, arXiv:1207.0580. 239, 261, 266
Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the
BayLearn Bay Area Machine Learning Symposium. 451
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma
thesis, T.U. Münich. 18, 403, 405
Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat
minima. In Advances in Neural Information Processing Systems 7 , pages 529–536. MIT
Press. 243
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9(8), 1735–1780. 18, 411, 414
Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In
J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE
Press. 414
Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural network
hardware implementations. Computers, IEEE Transactions on, 42(3), 281–290. 454
Holt, J. L. and Baker, T. E. (1991). Back propagation simulations using limited preci-
sion calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint
Conference on, volume 2, pages 121–126. IEEE. 454
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2, 359–366. 197
Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an
unknown mapping and its derivatives using multilayer feedforward networks. Neural
networks, 3(5), 551–560. 197
Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World
Chess Champion. Princeton University Press, Princeton, NJ, USA. 2
Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for Markov
random fields on lattice. Annals of the Institute of Statistical Mathematics,
54
(1), 1–18.
619
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep
structured semantic models for web search using clickthrough data. In Proceedings of
the 22nd ACM international conference on Conference on information & knowledge
management, pages 2333–2338. ACM. 483
746
BIBLIOGRAPHY
Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey
striate cortex. Journal of Physiology (London), 195, 215–243. 365
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s
striate cortex. Journal of Physiology, 148, 574–591. 365
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction, and
functional architecture in the cat’s visual cortex. Journal of Physiology (London),
160
,
106–154. 365
Huszar, F. (2015). How (not) to train your generative model: schedule sampling, likelihood,
adversary? arXiv:1511.05101. 699
Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization
for general algorithm configuration. In LION-5 . Extended version as UBC Tech report
TR-2010-10. 439
Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96 , pages
13–24. 380
Hyvärinen, A. (1999). Survey on independent component analysis. Neural Computing
Surveys, 2, 94–128. 494
Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score matching.
Journal of Machine Learning Research, 6, 695–709. 516, 620
Hyvärinen, A. (2007a). Connections between score matching, contrastive divergence,
and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural
Networks, 18, 1529–1531. 621
Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics and
Data Analysis, 51, 2499–2512. 621
Hyvärinen, A. and Hoyer, P. O. (1999). Emergence of topography and complex cell
properties from natural images using extensions of ica. In NIPS , pages 827–833. 496
Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis:
Existence and uniqueness results. Neural Networks, 12(3), 429–439. 496
Hyvärinen, A., Karhunen, J., and Oja, E. (2001a). Independent Component Analysis.
Wiley-Interscience. 494
Hyvärinen, A., Hoyer, P. O., and Inki, M. O. (2001b). Topographic independent component
analysis. Neural Computation, 13(7), 1527–1558. 496
Hyvärinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A probabilistic
approach to early computational vision. Springer-Verlag. 371
747
BIBLIOGRAPHY
Iba, Y. (2001). Extended ensemble Monte Carlo. International Journal of Modern Physics,
C12, 623–656. 606
Inayoshi, H. and Kurita, T. (2005). Improved generalization by adding both auto-
association and hidden-layer noise to neural-network-based-classifiers. IEEE Workshop
on Machine Learning for Signal Processing, pages 141—-146. 518
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training
by reducing internal covariate shift. 100, 318, 321
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.
Neural networks, 1(4), 295–307. 307
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures
of local experts. Neural Computation, 3, 79–87. 188, 453
Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In
Advances in Neural Information Processing Systems 15 . 406
Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state
networks. Technical report, Jacobs University. 401
Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 406
Jaeger, H. (2012). Long short-term memory in echo state networks: Details of a simulation
study. Technical report, Technical report, Jacobs University Bremen. 407
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and
saving energy in wireless communication. Science, 304(5667), 78–80. 27, 406
Jaeger, H., Lukosevicius, M., Popovici, D., and Siewert, U. (2007). Optimization and
applications of echo state networks with leaky- integrator neurons. Neural Networks,
20(3), 335–352. 410
Jain, V., Murray, J. F., Roth, F., Turaga, S., Zhigulin, V., Briggman, K. L., Helmstaedter,
M. N., Denk, W., and Seung, H. S. (2007). Supervised learning of image restoration
with convolutional networks. In Computer Vision, 2007. ICCV 2007. IEEE 11th
International Conference on, pages 1–8. IEEE. 360
Jaitly, N. and Hinton, G. (2011). Learning a better representation of speech soundwaves
using restricted Boltzmann machines. In Acoustics, Speech and Signal Processing
(ICASSP), 2011 IEEE International Conference on, pages 5884–5887. IEEE. 461
Jaitly, N. and Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves
speech recognition. In ICML’2013 . 241
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best
multi-stage architecture for object recognition? In ICCV’09 . 16, 24, 27, 173, 192, 226,
364, 365, 526
748
BIBLIOGRAPHY
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev. Lett.,
78, 2690–2693. 628, 631
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University
Press. 53
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target
vocabulary for neural machine translation. arXiv:1412.2007. 477
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters
from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in
Practice. North-Holland, Amsterdam. 465, 476
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding.
http://caffe.berkeleyvision.org/. 25, 210
Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive field
learning for pooled image features. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages 3370–3377. IEEE. 346
Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural
networks: convergence and generalization. IEEE Transactions on Neural Networks,
7(6), 1424–1438. 242
Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 18
Joulin, A. and Mikolov, T. (2015). Inferring algorithmic patterns with stack-augmented
recurrent nets. arXiv preprint arXiv:1503.01007 . 421
Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical evaluation of recurrent
network architectures. In ICML’2015 . 306, 414, 415
Judd, J. S. (1989). Neural Network Design and the Complexity of Learning. MIT press.
293
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive
algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 494
Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, c., Memisevic, R., Vincent,
P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P. L., Dauphin,
Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Raymond,
J.-P., Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A., Bengio,
E., Côté, M., Konda, K. R., and Wu, Z. (2013). Combining modality specific deep
neural networks for emotion recognition in video. In Proceedings of the 15th ACM on
International Conference on Multimodal Interaction. 200
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In
EMNLP’2013 . 477
749
BIBLIOGRAPHY
Kalchbrenner, N., Danihelka, I., and Graves, A. (2015). Grid long short-term memory.
arXiv preprint arXiv:1507.01526 . 397
Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 518
Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image
descriptions. In CVPR’2015 . arXiv:1412.2306. 102
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).
Large-scale video classification with convolutional neural networks. In CVPR. 21
Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. Master’s thesis, Dept. of Mathematics, Univ. of Chicago. 95
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-35(3), 400–401. 465, 476
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding
algorithms with applications to object recognition. Technical report, Computational and
Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01.
526
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant
features through topographic filter maps. In CVPR’2009 . 526
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.
(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010 .
365, 526
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal,
30
(10),
947–954. 225
Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning
and teaching dimension. In Advances in Neural Information Processing Systems 24
(NIPS’11), pages 1449–1457. 329
Kim, S. K., McAfee, L. C., McMahon, P. L., and Olukotun, K. (2009). A highly scalable
restricted Boltzmann machine FPGA implementation. In Field Programmable Logic
and Applications, 2009. FPL 2009. International Conference on, pages 367–372. IEEE.
454
Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary
Mathematics ; V. 1). American Mathematical Society. 569
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980 . 308
750
BIBLIOGRAPHY
Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score
matching. In NIPS’2010 . 516, 623
Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning
with deep generative models. In NIPS’2014 . 429
Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable
models in auxiliary form. Technical report, arxiv:1306.0733. 655, 691, 698
Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings
of the International Conference on Learning Representations (ICLR). 691, 701
Kingma, D. P. and Welling, M. (2014b). Efficient gradient-based inference through
transformations between bayes nets and neural nets. Technical report, arxiv:1402.0480.
691
Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220, 671–680. 328
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.
In ICML’2014 . 102
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embeddings
with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 102, 411
Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed
representations of words. In Proceedings of COLING 2012 . 479, 542
Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and
Pfister, H. (2014). Deep learning for the connectome. GPU Technology Conference. 26
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT Press. 585, 598, 648
Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and
maximization of a posteriori probabilities application to transition-based connectionist
speech recognition. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in
Neural Information Processing Systems 8 (NIPS’95). MIT Press, Cambridge, MA. 462
Koren, Y. (2009). The BellKor solution to the Netflix grand prize. 256, 482
Kotzias, D., Denil, M., de Freitas, N., and Smyth, P. (2015). From group to individual
labels using deep features. In ACM SIGKDD. 106
Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In
ICML’2014 . 410
Kočiský, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-
sentations by Marginalizing Alignments. In Proceedings of ACL. 479
751
BIBLIOGRAPHY
Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties
of DBNs with binary hidden units and real-valued visible units. In ICML’2013 . 556
Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-10. Technical report,
University of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/ kriz/conv-
cifar10-aug2010.pdf. 449
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny
images. Technical report, University of Toronto. 21, 564
Krizhevsky, A. and Hinton, G. E. (2011). Using very deep autoencoders for content-based
image retrieval. In ESANN . 528
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep
convolutional neural networks. In NIPS’2012 . 23, 24, 27, 100, 200, 372, 457, 461
Krueger, K. A. and Dayan, P. (2009). Flexible shaping: how learning in small steps helps.
Cognition, 110, 380–394. 329
Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the
Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–492,
Berkeley, Calif. University of California Press. 95
Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Iyyer,
M., Gulrajani, I., and Socher, R. (2015). Ask me anything: Dynamic memory networks
for natural language processing. arXiv:1506.07285 . 421, 488
Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable
models. In NIPS’2010 . 329
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network
architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon
University. 368, 375, 409
Lang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network
architecture for isolated word recognition. Neural networks, 3(1), 23–43. 375
Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for contextual multi-armed
bandits. In NIPS’2008 , pages 1096––1103. 483
Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear
independent component analysis using ensemble learning: Experiments and discussion.
In Proc. ICA. Citeseer. 496
Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted
Boltzmann machines. In ICML’2008 . 244, 254, 533, 688, 717
752
BIBLIOGRAPHY
Larochelle, H. and Hinton, G. E. (2010). Learning to combine foveal glimpses with a
third-order Boltzmann machine. In Advances in Neural Information Processing Systems
23 , pages 1243–1251. 368
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.
In AISTATS’2011 . 707, 710
Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In
AAAI Conference on Artificial Intelligence. 542
Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for
training deep neural networks. Journal of Machine Learning Research, 10, 1–40. 538
Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and
discriminative models. In Proceedings of the Computer Vision and Pattern Recognition
Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer Society.
244, 252
Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled
convolutional neural networks. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems
23 (NIPS’10), pages 1279–1287. 353
Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. (2011). On optimization
methods for deep learning. In Proc. ICML’2011 . ACM. 316
Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,
A. (2012). Building high-level features using large scale unsupervised learning. In
ICML’2012 . 24, 27
Le Roux, N. and Bengio, Y. (2008). Representational power of restricted Boltzmann
machines and deep belief networks. Neural Computation, 20(6), 1631–1649. 556, 657
Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approxi-
mators. Neural Computation, 22(8), 2192–2207. 556
LeCun, Y. (1985). Une procédure d’apprentissage pour Réseau à seuil assymétrique. In
Cognitiva 85: A la Frontière de l’Intelligence Artificielle, des Sciences de la Connaissance
et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 225
LeCun, Y. (1986). Learning processes in an asymmetric threshold network. In F. Fogelman-
Soulié, E. Bienenstock, and G. Weisbuch, editors, Disordered Systems and Biological
Organization, pages 233–240. Springer-Verlag, Les Houches, France. 351
LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de
Paris VI. 18, 505, 518
LeCun, Y. (1989). Generalization and network design strategies. Technical Report
CRG-TR-89-4, University of Toronto. 331, 351
753
BIBLIOGRAPHY
LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications
of neural network chips and automatic learning. IEEE Communications Magazine,
27(11), 41–46. 369
LeCun, Y., Bottou, L., Orr, G. B., and Müller, K.-R. (1998a). Efficient backprop. In
Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524.
Springer Verlag. 310, 432
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient based learning
applied to document recognition. Proc. IEEE. 16, 18, 21, 27, 372, 461, 463
LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Convolutional networks and
applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE
International Symposium on, pages 253–256. IEEE. 372
L’Ecuyer, P. (1994). Efficiency improvement and variance reduction. In Proceedings of
the 1994 Winter Simulation Conference, pages 122––132. 692
Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2014). Deeply-supervised nets.
arXiv preprint arXiv:1409.5185 . 327
Lee, H., Battle, A., Raina, R., and Ng, A. (2007). Efficient sparse coding algorithms.
In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information
Processing Systems 19 (NIPS’06), pages 801–808. MIT Press. 640
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area
V2. In NIPS’07 . 254
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In L. Bottou
and M. Littman, editors, Proceedings of the Twenty-sixth International Conference on
Machine Learning (ICML’09). ACM, Montreal, Canada. 364, 685, 686
Lee, Y. J. and Grauman, K. (2011). Learning the easy things first: self-paced visual
category discovery. In CVPR’2011 . 329
Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,
2010). 224
Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; representa-
tion and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.
2
Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function.
Neural Networks, 6, 861––867. 197, 198
754
BIBLIOGRAPHY
Levenberg, K. (1944). A method for the solution of certain non-linear problems in least
squares. Quarterly Journal of Applied Mathematics, I I(2), 164–168. 312
L’Hôpital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes
courbes. Paris: L’Imprimerie Royale. 224
Li, Y., Swersky, K., and Zemel, R. S. (2015). Generative moment matching networks.
CoRR, abs/1502.02761. 705
Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies
is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural
Networks, 7(6), 1329–1338. 409
Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity and relation
embeddings for knowledge graph completion. In Proc. AAAI’15 . 487
Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries.
2
Lindsey, C. and Lindblad, T. (1994). Review of hardware neural networks: a user’s
perspective. In Proc. Third Workshop on Neural Networks: From Biology to High
Energy Physics, pages 195––202, Isola d’Elba, Italy. 454
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT
Numerical Mathematics, 16(2), 146–160. 225
LISA (2008). Deep learning tutorials: Restricted Boltzmann machines. Technical report,
LISA Lab, Université de Montréal. 591
Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to
approximately evaluate or simulate. In Proceedings of the 27th International Conference
on Machine Learning (ICML’10). 660
Lotter, W., Kreiman, G., and Cox, D. (2015). Unsupervised learning of visual structure
using predictive generative networks. arXiv preprint arXiv:1511.06380 . 547, 548
Lovelace, A. (1842). Notes upon L. F. Menabrea’s Sketch of the Analytical Engine
invented by Charles Babbage”. 1
Lu, L., Zhang, X., Cho, K., and Renals, S. (2015). A study of the recurrent neural network
encoder-decoder for large vocabulary speech recognition. In Proc. Interspeech. 463
Lu, T., Pál, D., and Pál, M. (2010). Contextual multi-armed bandits. In International
Conference on Artificial Intelligence and Statistics, pages 485–492. 483
Luenberger, D. G. (1984). Linear and Nonlinear Programming. Addison Wesley. 317
Lukoševičius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent
neural network training. Computer Science Review, 3(3), 127–149. 406
755
BIBLIOGRAPHY
Luo, H., Shen, R., Niu, C., and Ullrich, C. (2011). Learning class-relevant features and
class-irrelevant features via a hybrid third-order RBM. In International Conference on
Artificial Intelligence and Statistics, pages 470–478. 689
Luo, H., Carrier, P. L., Courville, A., and Bengio, Y. (2013). Texture modeling with
convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 102
Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of the
Twenty-fifth Conference in Uncertainty in Artificial Intelligence (UAI’09). 621
Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E., and Svetnik, V. (2015). Deep neural nets
as a method for quantitative structure activity relationships. J. Chemical information
and modeling. 533
Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural
network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech, and
Language Processing. 192
Maass, W. (1992). Bounds for the computational power and learning complexity of analog
neural nets (extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing,
pages 335–344. 198
Maass, W., Schnitger, G., and Sontag, E. D. (1994). A comparison of the computational
power of sigmoid and Boolean threshold circuits. Theoretical Advances in Neural
Computation and Learning, pages 127–151. 198
Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without
stable states: A new framework for neural computation based on perturbations. Neural
Computation, 14(11), 2531–2560. 406
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge
University Press. 73
Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter
optimization through reversible learning. arXiv preprint arXiv:1502.03492 . 438
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning
with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 102
Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.
Zeitschrift für Operations Research (Theory), 36, 517–545. 276
Marlin, B. and de Freitas, N. (2011). Asymptotic efficiency of deterministic estimators for
discrete energy-based models: Ratio matching and pseudolikelihood. In UAI’2011 . 620,
622
756
BIBLIOGRAPHY
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for
restricted Boltzmann machine learning. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages
509–516. 616, 621, 622
Marquardt, D. W. (1963). An algorithm for least-squares estimation of non-linear param-
eters. Journal of the Society of Industrial and Applied Mathematics,
11
(2), 431–441.
312
Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science,
194. 368
Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and
M. Littman, editors, Proceedings of the Twenty-seventh International Conference on
Machine Learning (ICML-10), pages 735–742. ACM. 304
Martens, J. and Medabalimi, V. (2014). On the expressive efficiency of sum product
networks. arXiv:1411.7717 . 557
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free
optimization. In Proc. ICML’2011 . ACM. 415
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous
state space Gibbsian processes. The Annals of Applied Probability,
5
(3), pp. 603–612.
619
McClelland, J., Rumelhart, D., and Hinton, G. (1995). The appeal of parallel distributed
processing. In Computation & intelligence, pages 305–341. American Association for
Artificial Intelligence. 17
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 115–133. 14, 15
Mead, C. and Ismail, M. (2012). Analog VLSI implementation of neural systems, volume 80.
Springer Science & Business Media. 454
Melchior, J., Fischer, A., and Wiskott, L. (2013). How to center binary deep Boltzmann
machines. arXiv preprint arXiv:1311.1354 . 675
Memisevic, R. and Hinton, G. E. (2007). Unsupervised learning of image transformations.
In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’07).
688
Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations
with factored higher-order Boltzmann machines. Neural Computation,
22
(6), 1473–1492.
688
757
BIBLIOGRAPHY
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,
Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,
J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In
JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 200, 535, 541
Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the
manifold. Learning Workshop, Snowbird. 713
Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular
PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 480
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,
Brno University of Technology. 417
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empirical
evaluation and combination of advanced language modeling techniques. In Proc. 12th an-
nual conference of the international speech communication association (INTERSPEECH
2011). 475
Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for
training large scale neural network language models. In Proc. ASRU’2011. 329, 475
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word rep-
resentations in vector space. In International Conference on Learning Representations:
Workshops Track. 539
Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting similarities among languages
for machine translation. Technical report, arXiv:1309.4168. 542
Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge
UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 628
Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 15
Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint
arXiv:1411.1784 . 703
Mishkin, D. and Matas, J. (2015). All you need is a good init. arXiv preprint
arXiv:1511.06422 . 305
Misra, J. and Saha, I. (2010). Artificial neural networks in hardware: A survey of two
decades of progress. Neurocomputing, 74(1), 239–255. 454
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 99
Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional
smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677. 268
758
BIBLIOGRAPHY
Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief
networks. In ICML’2014 . 693, 695
Mnih, A. and Hinton, G. E. (2007). Three new graphical models for statistical language
modelling. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International
Conference on Machine Learning (ICML’07), pages 641–648. ACM. 467
Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model.
In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural
Information Processing Systems 21 (NIPS’08), pages 1081–1088. 470
Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-
contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages
2265–2273. Curran Associates, Inc. 475, 625
Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural
probabilistic language models. In ICML’2012 , pages 1751–1758. 475
Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.
In Proceedings of the 11th European Conference on Computer Vision (ECCV). 102
Mnih, V., Larochelle, H., and Hinton, G. (2011). Conditional restricted Boltzmann
machines for structure output prediction. In Proc. Conf. on Uncertainty in Artificial
Intelligence (UAI). 687
Mnih, V., Kavukcuoglo, K., Silver, D., Graves, A., Antonoglou, I., and Wierstra, D. (2013).
Playing Atari with deep reinforcement learning. Technical report, arXiv:1312.5602. 106
Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual
attention. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, NIPS’2014 , pages 2204–2212. 693
Mnih, V., Kavukcuoglo, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidgeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A.,
Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015).
Human-level control through deep reinforcement learning. Nature, 518, 529–533. 25
Mobahi, H. and Fisher, III, J. W. (2015). A theoretical analysis of optimization by
Gaussian continuation. In AAAI’2015 . 328
Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence
in video. In L. Bottou and M. Littman, editors, Proceedings of the 26th International
Conference on Machine Learning, pages 737–744, Montreal. Omnipress. 497
Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition.
462
759
BIBLIOGRAPHY
Mohamed, A., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., and Picheny,
M. A. (2011). Deep belief networks using discriminative features for phone recognition. In
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference
on, pages 5060–5063. IEEE. 462
Mohamed, A., Dahl, G., and Hinton, G. (2012a). Acoustic modeling using deep belief
networks. IEEE Trans. on Audio, Speech and Language Processing,
20
(1), 14–22. 462
Mohamed, A., Hinton, G., and Penn, G. (2012b). Understanding how deep belief networks
perform acoustic modelling. In Acoustics, Speech and Signal Processing (ICASSP),
2012 IEEE International Conference on, pages 4273–4276. IEEE. 462
Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks, 6, 525–533. 316
Montavon, G. and Muller, K.-R. (2012). Deep Boltzmann machines and the centering
trick. In G. Montavon, G. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of
the Trade, volume 7700 of Lecture Notes in Computer Science, pages 621–637. Preprint:
http://arxiv.org/abs/1203.3783. 675
Montúfar, G. (2014). Universal approximation depth and errors of narrow belief networks
with discrete units. Neural Computation, 26. 556
Montúfar, G. and Ay, N. (2011). Refinements of universal approximation results for
deep belief networks and restricted Boltzmann machines. Neural Computation,
23
(5),
1306–1319. 556
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. In NIPS’2014 . 19, 199
Mor-Yosef, S., Samueloff, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking
the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet
Gynecol, 75(6), 944–7. 3
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language
model. In AISTATS’2005 . 470, 472
Mozer, M. C. (1992). The induction of multiscale temporal structure. In J. M. S. Hanson
and R. Lippmann, editors, Advances in Neural Information Processing Systems 4
(NIPS’91), pages 275–282, San Mateo, CA. Morgan Kaufmann. 410
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press,
Cambridge, MA, USA. 62, 98, 145
Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In
ICML’2014 . 189, 712
Nair, V. and Hinton, G. (2010). Rectified linear units improve restricted Boltzmann
machines. In ICML’2010 . 16, 173, 196
760
BIBLIOGRAPHY
Nair, V. and Hinton, G. E. (2009). 3d object recognition with deep belief nets. In Y. Bengio,
D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in
Neural Information Processing Systems 22 , pages 1339–1347. Curran Associates, Inc.
688
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis.
In NIPS’2010 . 163
Naumann, U. (2008). Optimal Jacobian accumulation is NP-complete. Mathematical
Programming, 112(2), 427–441. 221
Navigli, R. and Velardi, P. (2005). Structural semantic interconnections: a knowledge-
based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and
Machine Intelligence, 27(7), 1075––1086. 487
Neal, R. and Hinton, G. (1999). A view of the EM algorithm that justifies incremental,
sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT
Press, Cambridge, MA. 637
Neal, R. M. (1990). Learning stochastic feedforward networks. Technical report. 694
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte-Carlo methods.
Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto. 682
Neal, R. M. (1994). Sampling from multimodal distributions using tempered transitions.
Technical Report 9421, Dept. of Statistics, University of Toronto. 606
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Springer. 264
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing,
11
(2),
125–139. 628, 630, 631, 632
Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance
sampling. 632
Nesterov, Y. (1983). A method of solving a convex programming problem with convergence
rate O(1/k
2
). Soviet Mathematics Doklady, 27, 372–376. 300
Nesterov, Y. (2004). Introductory lectures on convex optimization : a basic course. Applied
optimization. Kluwer Academic Publ., Boston, Dordrecht, London. 300
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading
digits in natural images with unsupervised feature learning. Deep Learning and
Unsupervised Feature Learning Workshop, NIPS. 21
Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical
language modelling. In European Conference on Speech Communication and Technology
(Eurospeech), pages 973–976, Berlin. 466
761
BIBLIOGRAPHY
Ng, A. (2015). Advice for applying machine learning.
https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf. 424
Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of part-of-
speech and automatically derived category-based language models for speech recognition.
In International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 177–180. 466
Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., and Barbano, P. E. (2005).
Toward automatic phenotyping of developing embryos from videos. Image Processing,
IEEE Transactions on, 14(9), 1360–1371. 361
Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 92, 95
Norouzi, M. and Fleet, D. J. (2011). Minimal loss hashing for compact binary codes. In
ICML’2011 . 528
Nowlan, S. J. (1990). Competing experts: An experimental investigation of associative
mixture models. Technical Report CRG-TR-90-5, University of Toronto. 453
Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing.
Neural Computation, 4(4), 473–493. 139
Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural
Computation, 17, 1665–1699. 16
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties
by learning a sparse code for natural images. Nature,
381
, 607–609. 146, 254, 371, 499
Olshausen, B. A., Anderson, C. H., and Van Essen, D. C. (1993). A neurobiological
model of visual attention and invariant pattern recognition based on dynamic routing
of information. J. Neurosci., 13(11), 4700–4719. 453
Opper, M. and Archambeau, C. (2009). The variational Gaussian approximation revisited.
Neural computation, 21(3), 786–792. 691
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and transferring mid-level
image representations using convolutional neural networks. In Computer Vision and
Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1717–1724. IEEE. 539
Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy
of Markov random fields. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1121–1128,
Cambridge, MA. MIT Press. 635
Ovid and Martin, C. (2004). Metamorphoses. W.W. Norton. 1
762
BIBLIOGRAPHY
Paccanaro, A. and Hinton, G. E. (2000). Extracting distributed representations of concepts
and relations from positive and negative propositions. In International Joint Conference
on Neural Networks (IJCNN), Como, Italy. IEEE, New York. 487
Paine, T. L., Khorrami, P., Han, W., and Huang, T. S. (2014). An analysis of unsupervised
pre-training in light of recent advances. arXiv preprint arXiv:1412.6597 . 535
Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot
learning with semantic output codes. In Y. Bengio, D. Schuurmans, J. D. Lafferty,
C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing
Systems 22 , pages 1410–1418. Curran Associates, Inc. 542
Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research
in Economics and Management Sci., MIT. 225
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent
neural networks. In ICML’2013 . 289, 403, 406, 410, 417, 419
Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. Technical report, U.
Montreal, arXiv:1312.6098. 198
Pascanu, R., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014a). How to construct deep
recurrent neural networks. In ICLR’2014 . 19, 199, 264, 399, 400, 401, 413, 463
Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 553
Pati, Y., Rezaiifar, R., and Krishnaprasad, P. (1993). Orthogonal matching pursuit:
Recursive function approximation with applications to wavelet decomposition. In Pro-
ceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers,
pages 40–44. 254
Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential
reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society,
University of California, Irvine, pages 329–334. 566
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. 54
Perron, O. (1907). Zur theorie der matrices. Mathematische Annalen,
64
(2), 248–263. 600
Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 31
Peterson, G. B. (2004). A day of great illumination: B. F. Skinner’s discovery of shaping.
Journal of the Experimental Analysis of Behavior , 82(3), 317–328. 329
Pham, D.-T., Garat, P., and Jutten, C. (1992). Separation of a mixture of independent
sources through a maximum likelihood approach. In EUSIPCO, pages 771–774. 494
763
BIBLIOGRAPHY
Pham, P.-H., Jelaca, D., Farabet, C., Martini, B., LeCun, Y., and Culurciello, E. (2012).
NeuFlow: dataflow vision processing system-on-a-chip. In Circuits and Systems (MWS-
CAS), 2012 IEEE 55th International Midwest Symposium on, pages 1044–1047. IEEE.
454
Pinheiro, P. H. O. and Collobert, R. (2014). Recurrent convolutional neural networks for
scene labeling. In ICML’2014 . 360
Pinheiro, P. H. O. and Collobert, R. (2015). From image-level to pixel-level labeling with
convolutional networks. In Conference on Computer Vision and Pattern Recognition
(CVPR). 360
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition
hard? PLoS Comput Biol, 4. 459
Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired
computer vision: A case study in unconstrained face recognition on facebook. In
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on, pages 35–42. IEEE. 364
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence,
46
(1),
77–105. 401
Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging.
SIAM J. Control and Optimization, 30(4), 838–855. 323
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 296
Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders
and deep networks. CoRR, abs/1406.1831. 241
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In
Proceedings of the Twenty-seventh Conference in Uncertainty in Artificial Intelligence
(UAI), Barcelona, Spain. 557
Presley, R. K. and Haggard, R. L. (1994). A fixed point implementation of the backpropa-
gation learning algorithm. In Southeastcon’94. Creative Technology Transfer-A Global
Affair., Proceedings of the 1994 IEEE , pages 136–138. IEEE. 454
Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE
Transactions on Information Theory, 4(2), 69–72. 691
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual
representation by single neurons in the human brain. Nature,
435
(7045), 1102–1107.
367
764
BIBLIOGRAPHY
Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
555, 703, 704
Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive
distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 678, 711
Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning
using graphics processors. In L. Bottou and M. Littman, editors, Proceedings of the
Twenty-sixth International Conference on Machine Learning (ICML’09), pages 873–880,
New York, NY, USA. ACM. 27, 449
Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations
of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University
Archive for the History of Economic Thought. 56
Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and covariances using
factorized third-order Boltzmann machines. In CVPR’2010 , pages 2551–2558. 682
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Efficient learning of sparse
representations with an energy-based model. In NIPS’2006 . 14, 19, 510, 531, 533
Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning of
invariant feature hierarchies with applications to object recognition. In Proceedings of
the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press. 365
Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief
networks. In NIPS’2007 . 510
Ranzato, M., Krizhevsky, A., and Hinton, G. E. (2010a). Factored 3-way restricted
Boltzmann machines for modeling natural images. In Proceedings of AISTATS 2010 .
680, 681
Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using
gated MRFs. In NIPS’2010 . 682, 683
Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical
parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 135, 295
Rasmus, A., Valpola, H., Honkala, M., Berglund, M., and Raiko, T. (2015). Semi-supervised
learning with ladder network. arXiv preprint arXiv:1507.02672 . 429, 533
Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to
parallelizing stochastic gradient descent. In NIPS’2011 . 450
Reichert, D. P., Seriès, P., and Storkey, A. J. (2011). Neuronal adaptation for sampling-
based probabilistic inference in perceptual bistability. In Advances in Neural Information
Processing Systems, pages 2357–2365. 668
765
BIBLIOGRAPHY
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation
and approximate inference in deep generative models. In ICML’2014 . Preprint:
arXiv:1401.4082. 655, 691, 698
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive
auto-encoders: Explicit invariance during feature extraction. In ICML’2011 . 524, 525,
526
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011b). Higher order contractive auto-encoder. In ECML PKDD. 524, 525
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c). The manifold
tangent classifier. In NIPS’2011 . 270, 271
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for
sampling contractive auto-encoders. In ICML’2012 . 713
Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive
Science, 28(2), 147–166. 369
Roberts, S. and Everson, R. (2001). Independent component analysis: principles and
practice. Cambridge University Press. 496
Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech
recognition system. Computer Speech and Language, 5(3), 259–274. 27, 462
Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 93
Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.
(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550 . 326
Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part i.
linear constraints. Journal of the Society for Industrial and Applied Mathematics,
8
(1),
pp. 181–217. 93
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review , 65, 386–408. 14, 15, 27
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 15, 27
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500). 163, 521
Roweis, S., Saul, L., and Hinton, G. (2002). Global coordination of local linear models. In
T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information
Processing Systems 14 (NIPS’01), Cambridge, MA. MIT Press. 492
Rubin, D. B. et al. (1984). Bayesianly justifiable and relevant frequency calculations for
the applied statistician. The Annals of Statistics, 12(4), 1151–1172. 718
766
BIBLIOGRAPHY
Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by
back-propagating errors. Nature, 323, 533–536. 14, 18, 23, 203, 225, 374, 479, 485
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal represen-
tations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel
Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge. 21,
27, 225
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986c). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
Cambridge. 17
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large
Scale Visual Recognition Challenge. 21
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition
challenge. arXiv preprint arXiv:1409.0575 . 28
Russel, S. J. and Norvig, P. (2003). Artificial Intelligence: a Modern Approach. Prentice
Hall. 86
Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal
elements of macaque V1 receptive fields. Neuron, 46(6), 945–956. 368
Sainath, T., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolu-
tional neural networks for LVCSR. In ICASSP 2013 . 463
Salakhutdinov, R. (2010). Learning in Markov random fields using tempered transitions. In
Y. Bengio, D. Schuurmans, C. Williams, J. Lafferty, and A. Culotta, editors, Advances
in Neural Information Processing Systems 22 (NIPS’09). 606
Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of
the International Conference on Artificial Intelligence and Statistics, volume 5, pages
448–455. 24, 27, 532, 665, 668, 673, 674
Salakhutdinov, R. and Hinton, G. (2009b). Semantic hashing. In International Journal of
Approximate Reasoning. 528
Salakhutdinov, R. and Hinton, G. E. (2007a). Learning a nonlinear embedding by
preserving class neighbourhood structure. In Proceedings of the Eleventh International
Conference on Artificial Intelligence and Statistics (AISTATS’07), San Juan, Porto
Rico. Omnipress. 530
Salakhutdinov, R. and Hinton, G. E. (2007b). Semantic hashing. In SIGIR’2007 . 528
767
BIBLIOGRAPHY
Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance
kernels for Gaussian processes. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1249–1256,
Cambridge, MA. MIT Press. 244
Salakhutdinov, R. and Larochelle, H. (2010). Efficient learning of deep Boltzmann machines.
In Proceedings of the Thirteenth International Conference on Artificial Intelligence and
Statistics (AISTATS 2010), JMLR W&CP, volume 9, pages 693–700. 655
Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrix factorization. In NIPS’2008 .
482
Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief
networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Proceedings of
the Twenty-fifth International Conference on Machine Learning (ICML’08), volume 25,
pages 872–879. ACM. 631, 664
Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for
collaborative filtering. In ICML. 482
Sanger, T. D. (1994). Neural network learning control of robot manipulators using
gradually increasing task difficulty. IEEE Transactions on Robotics and Automation,
10(3). 329
Saul, L. K. and Jordan, M. I. (1996). Exploiting tractable substructures in intractable
networks. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural
Information Processing Systems 8 (NIPS’95). MIT Press, Cambridge, MA. 641
Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid belief
networks. Journal of Artificial Intelligence Research, 4, 61–76. 27, 695
Savich, A. W., Moussa, M., and Areibi, S. (2007). The impact of arithmetic representation
on implementing mlp-bp on fpgas: A study. Neural Networks, IEEE Transactions on,
18(1), 240–252. 454
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On random
weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 364
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. In ICLR. 285, 286, 303
Schaul, T., Antonoglou, I., and Silver, D. (2014). Unit tests for stochastic optimization.
In International Conference on Learning Representations. 309
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of
history compression. Neural Computation, 4(2), 234–242. 401
Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on Neural
Networks, 7(1), 142–146. 480
768
BIBLIOGRAPHY
Schmidhuber, J. (2012). Self-delimiting neural networks. arXiv preprint arXiv:1210.0118 .
391
Schölkopf, B. and Smola, A. J. (2002). Learning with kernels: Support vector machines,
regularization, optimization, and beyond . MIT press. 705
Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 163, 521
Schölkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods
Support Vector Learning. MIT Press, Cambridge, MA. 18, 142
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On
causal and anticausal learning. In ICML’2012 , pages 1255–1262. 548
Schuster, M. (1999). On supervised learning from sequential data with applications for
speech recognition. 189
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681. 396
Schwenk, H. (2007). Continuous space language models. Computer speech and language,
21, 492–518. 469
Schwenk, H. (2010). Continuous space language models for statistical machine translation.
The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 476
Schwenk, H. (2014). Cleaned subset of WMT ’14 dataset. 21
Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural net-
works. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information
Processing Systems 10 (NIPS’97), pages 647–653. MIT Press. 257
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large
vocabulary continuous speech recognition. In International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 765–768, Orlando, Florida. 469
Schwenk, H., Costa-jussà, M. R., and Fonollosa, J. A. R. (2006). Continuous space
language models for the IWSLT 2006 task. In International Workshop on Spoken
Language Translation, pages 166–173. 476
Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-
dependent deep neural networks. In Interspeech 2011 , pages 437–440. 23
Sejnowski, T. (1987). Higher-order Boltzmann machines. In AIP Conference Proceedings
151 on Neural Networks for Computing, pages 398–403. American Institute of Physics
Inc. 688
769
BIBLIOGRAPHY
Series, P., Reichert, D. P., and Storkey, A. J. (2010). Hallucinations in Charles Bonnet
syndrome induced by homeostasis: a deep Boltzmann machine model. In Advances in
Neural Information Processing Systems, pages 2020–2028. 668
Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied
to house numbers digit classification. CoRR, abs/1204.3968. 459
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection
with unsupervised multi-stage feature learning. In Proc. International Conference on
Computer Vision and Pattern Recognition (CVPR’13). IEEE. 23, 200
Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications.
31
Siegelmann, H. (1995). Computation beyond the Turing limit. Science,
268
(5210),
545–548. 380
Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied
Mathematics Letters, 4(6), 77–80. 380
Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.
Journal of Computer and Systems Sciences, 50(1), 132–150. 380, 406
Sietsma, J. and Dow, R. (1991). Creating artificial neural networks that generalize. Neural
Networks, 4(1), 67–79. 241
Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional
neural networks. In ICDAR’2003. 372
Simard, P. and Graf, H. P. (1994). Backpropagation without multiplication. In Advances
in Neural Information Processing Systems, pages 232–239. 454
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism
for specifying selected invariances in an adaptive network. In NIPS’1991 . 269, 271, 357
Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a
new transformation distance. In NIPS’92 . 269
Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation
invariance in pattern recognition tangent distance and tangent propagation. Lecture
Notes in Computer Science, 1524. 269
Simons, D. J. and Levin, D. T. (1998). Failure to detect changes to people during a
real-world interaction. Psychonomic Bulletin & Review, 5(4), 644–649. 546
Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. In ICLR. 324
770
BIBLIOGRAPHY
Sjöberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum,
with application to neural networks. International Journal of Control,
62
(6), 1391–1407.
249
Skinner, B. F. (1958). Reinforcement today. American Psychologist, 13, 94–99. 329
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of
harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed
Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 574, 590, 658
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of
machine learning algorithms. In NIPS’2012 . 439
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic
pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011 .
401, 403
Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural lan-
guage with recursive neural networks. In Proceedings of the Twenty-Eighth International
Conference on Machine Learning (ICML’2011). 401
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).
Semi-supervised recursive autoencoders for predicting sentiment distributions. In
EMNLP’2011 . 401
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts,
C. (2013a). Recursive deep models for semantic compositionality over a sentiment
treebank. In EMNLP’2013 . 401, 403
Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Y. (2013b). Zero-shot learning through
cross-modal transfer. In 27th Annual Conference on Neural Information Processing
Systems (NIPS 2013). 542
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep
unsupervised learning using nonequilibrium thermodynamics. 717, 718
Sohn, K., Zhou, G., and Lee, H. (2013). Learning and selecting features jointly with
point-wise gated Boltzmann machines. In ICML’2013 . 689
Solomonoff, R. J. (1989). A system for incremental learning based on algorithmic proba-
bility. 329
Sontag, E. D. (1998). VC dimension of neural networks. NATO ASI Series F Computer
and Systems Sciences, 168, 69–96. 550, 554
Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local
minima even for networks without hidden layers. Complex Systems, 3, 91–106. 284
771
BIBLIOGRAPHY
Sparkes, B. (1996). The Red and the Black: Studies in Greek Pottery. Routledge. 1
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog: how
“less is more” in unsupervised dependency parsing. In HLT’10 . 329
Squire, W. and Trapp, G. (1998). Using complex variables to estimate derivatives of real
functions. SIAM Rev., 40(1), 110––112. 442
Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of
the 18th Annual Conference on Learning Theory, pages 545–560. Springer-Verlag. 239
Srivastava, N. (2013). Improving Neural Networks With Dropout. Master’s thesis, U.
Toronto. 538
Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann
machines. In NIPS’2012 . 544
Srivastava, N., Salakhutdinov, R. R., and Hinton, G. E. (2013). Modeling documents with
deep Boltzmann machines. arXiv preprint arXiv:1309.6865 . 665
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine
Learning Research, 15, 1929–1958. 257, 263, 264, 265, 674
Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway networks.
arXiv:1505.00387 . 327
Steinkrau, D., Simard, P. Y., and Buck, I. (2005). Using GPUs for machine learning
algorithms. 2013 12th International Conference on Document Analysis and Recognition,
0, 1115–1119. 448
Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical
model parameters given approximate inference, decoding, and model structure. In
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics
(AISTATS), volume 15 of JMLR Workshop and Conference Proceedings, pages 725–733,
Fort Lauderdale. Supplementary material (4 pages) also available. 676, 700
Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). Weakly supervised memory
networks. arXiv preprint arXiv:1503.08895 . 421
Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In
CVPR’2013 . 329
Sussillo, D. (2014). Random walks: Training very deep nonlinear feed-forward networks
with smart initialization. CoRR, abs/1412.6558. 290, 303, 305, 405
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of
computer science, University of Toronto. 408, 415
772
BIBLIOGRAPHY
Sutskever, I. and Hinton, G. E. (2008). Deep narrow sigmoid belief networks are universal
approximators. Neural Computation, 20(11), 2629–2636. 695
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive
Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International
Conference on Artificial Intelligence and Statistics (AISTATS), volume 9, pages 789–795.
615
Sutskever, I., Hinton, G., and Taylor, G. (2009). The recurrent temporal restricted
Boltzmann machine. In NIPS’2008 . 688
Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent
neural networks. In ICML’2011 , pages 1017–1024. 480
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of
initialization and momentum in deep learning. In ICML. 300, 408, 415
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with
neural networks. In NIPS’2014, arXiv:1409.3215 . 25, 101, 397, 411, 414, 477, 478
Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press.
106
Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods
for reinforcement learning with function approximation. In NIPS’1999 , pages 1057–
–1063. MIT Press. 693
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On
autoencoders and score matching for energy based models. In ICML’2011 . ACM. 516
Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw Bayesian optimization.
arXiv preprint arXiv:1406.3896 . 439
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical report,
arXiv:1409.4842. 24, 27, 200, 257, 267, 327, 348
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and
Fergus, R. (2014b). Intriguing properties of neural networks. ICLR,
abs/1312.6199
.
267, 268, 270
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the
Inception Architecture for Computer Vision. ArXiv e-prints. 244, 323
Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). DeepFace: Closing the gap to
human-level performance in face verification. In CVPR’2014 . 100
Tandy, D. W. (1997). Works and Days: A Translation and Commentary for the Social
Sciences. University of California Press. 1
773
BIBLIOGRAPHY
Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In
Proceedings of the 27th International Conference on Machine Learning, June 21-24,
2010, Haifa, Israel. 241
Tang, Y., Salakhutdinov, R., and Hinton, G. (2012). Deep mixtures of factor analysers.
arXiv preprint arXiv:1206.4635 . 492
Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines
for modeling motion style. In L. Bottou and M. Littman, editors, Proceedings of
the Twenty-sixth International Conference on Machine Learning (ICML’09), pages
1025–1032, Montreal, Quebec, Canada. ACM. 688
Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary
latent variables. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19 (NIPS’06), pages 1345–1352. MIT Press, Cambridge,
MA. 687
Teh, Y., Welling, M., Osindero, S., and Hinton, G. E. (2003). Energy-based models
for sparse overcomplete representations. Journal of Machine Learning Research,
4
,
1235–1260. 494
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 163, 521, 536
Theis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative
models. arXiv:1511.01844. 699, 721
Thompson, J., Jain, A., LeCun, Y., and Bregler, C. (2014). Joint training of a convolutional
network and a graphical model for human pose estimation. In NIPS’2014 . 361
Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 269
Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B, 58, 267–288. 236
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to
the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Pro-
ceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08),
pages 1064–1071. ACM. 615
Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive
divergence. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-sixth
International Conference on Machine Learning (ICML’09), pages 1033–1040. ACM.
617
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.
Journal of the Royal Statistical Society B, 61(3), 611–622. 494
774
BIBLIOGRAPHY
Torralba, A., Fergus, R., and Weiss, Y. (2008). Small codes and large databases for
recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference
(CVPR’08), pages 1–8. 528
Touretzky, D. S. and Minton, G. E. (1985). Symbols among the neurons: Details of
a connectionist inference architecture. In Proceedings of the 9th International Joint
Conference on Artificial Intelligence - Volume 1 , IJCAI’85, pages 238–243, San Francisco,
CA, USA. Morgan Kaufmann Publishers Inc. 17
Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of
probabilistic grammars. In IJCAI’2011 . 329
Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk,
W., and Seung, H. S. (2010). Convolutional networks can learn to generate affinity
graphs for image segmentation. Neural Computation, 22(2), 511–538. 360
Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: A simple and
general method for semi-supervised learning. In Proc. ACL’2010, pages 384–394. 538
Töscher, A., Jahrer, M., and Bell, R. M. (2009). The BigChaos solution to the Netflix
grand prize. 482
Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregres-
sive density-estimator. In NIPS’2013 . 711
van den Oörd, A., Dieleman, S., and Schrauwen, B. (2013). Deep content-based music
recommendation. In NIPS’2013 . 483
van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine
Learning Res., 9. 480, 522
Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks
on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.
447, 455
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-
Verlag, Berlin. 114
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
114
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and Its Applications,
16, 264–280. 114
Vincent, P. (2011). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7). 516, 518, 714
775
BIBLIOGRAPHY
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.
523
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In ICML 2008 . 241, 518
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. J. Machine Learning Res., 11. 518
Vincent, P., de Brébisson, A., and Bouthillier, X. (2015). Efficient exact gradient update
for training deep networks with very large sparse targets. In C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information
Processing Systems 28 , pages 1108–1116. Curran Associates, Inc. 468
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a).
Grammar as a foreign language. Technical report, arXiv:1412.7449. 411
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural image
caption generator. arXiv 1411.4555. 411
Vinyals, O., Fortunato, M., and Jaitly, N. (2015a). Pointer networks. arXiv preprint
arXiv:1506.03134 . 421
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015b). Show and tell: a neural image
caption generator. In CVPR’2015 . arXiv:1411.4555. 102
Viola, P. and Jones, M. (2001). Robust real-time object detection. In International
Journal of Computer Vision. 452
Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., and Bengio, Y. (2015).
ReNet: A recurrent neural network based alternative to convolutional networks. arXiv
preprint arXiv:1505.00393 . 397
Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal
projections directed to the auditory pathway. Nature, 404(6780), 871–876. 16
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.
In Advances in Neural Information Processing Systems 26 , pages 351–359. 264
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme
recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 37, 328–339. 375, 456, 462
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural
networks using dropconnect. In ICML’2013 . 265
Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013 . 264
776
BIBLIOGRAPHY
Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014a). Knowledge graph and text jointly
embedding. In Proc. EMNLP’2014 . 487
Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014b). Knowledge graph embedding by
translating on hyperplanes. In Proc. AAAI’2014 . 487
Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical
analysis of dropout in piecewise linear networks. In ICLR’2014 . 261, 265, 266
Wawrzynek, J., Asanovic, K., Kingsbury, B., Johnson, D., Beck, J., and Morgan, N.
(1996). Spert-II: A vector microprocessor system. Computer, 29(3), 79–86. 454
Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforce-
ment learning. In Proc. UAI’2001 , pages 538–545. 693
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by
semidefinite programming. In CVPR’2004 , pages 988–995. 163, 522
Weiss, Y., Torralba, A., and Fergus, R. (2008). Spectral hashing. In NIPS, pages
1753–1760. 528
Welling, M., Zemel, R. S., and Hinton, G. E. (2002). Self supervised boosting. In Advances
in Neural Information Processing Systems, pages 665–672. 705
Welling, M., Hinton, G. E., and Osindero, S. (2003a). Learning sparse topographic
representations with products of Student-t distributions. In NIPS’2002 . 682
Welling, M., Zemel, R., and Hinton, G. E. (2003b). Self-supervised boosting. In S. Becker,
S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing
Systems 15 (NIPS’02), pages 665–672. MIT Press. 626
Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums
with an application to information retrieval. In L. Saul, Y. Weiss, and L. Bottou,
editors, Advances in Neural Information Processing Systems 17 (NIPS’04), volume 17,
Cambridge, MA. MIT Press. 678
Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In
Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 225
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to
rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 403
Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv preprint
arXiv:1410.3916 . 421, 488
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96–104. IRE, New York. 15, 21, 24, 27
777
BIBLIOGRAPHY
Wikipedia (2015). List of animals by number of neurons Wikipedia, the free encyclopedia.
[Online; accessed 4-March-2015]. 24, 27
Williams, C. K. I. and Agakov, F. V. (2002). Products of Gaussians and Probabilistic
Minor Component Analysis. Neural Computation, 14(5), 1169–1182. 684
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In
D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information
Processing Systems 8 (NIPS’95), pages 514–520. MIT Press, Cambridge, MA. 142
Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist
reinforcement learning. Machine Learning, 8, 229–256. 690, 691
Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully
recurrent neural networks. Neural Computation, 1, 270–280. 222
Wilson, D. R. and Martinez, T. R. (2003). The general inefficiency of batch training for
gradient descent learning. Neural Networks, 16(10), 1429–1451. 279
Wilson, J. R. (1984). Variance reduction techniques for digital simulation. American
Journal of Mathematical and Management Sciences, 4(3), 277––312. 692
Wiskott, L. and Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of
invariances. Neural Computation, 14(4), 715–770. 497
Wolpert, D. and MacReady, W. (1997). No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation, 1, 67–82. 293
Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural
Computation, 8(7), 1341–1390. 116
Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image
recognition. arXiv:1501.02876. 450
Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of
Optimization, 7, 814–836. 328
Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated
splicing using RNA sequence and cellular context. Bioinformatics,
27
(18), 2554–2562.
264
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S., and
Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual
attention. In ICML’2015, arXiv:1502.03044 . 102, 411, 693
Yildiz, I. B., Jaeger, H., and Kiebel, S. J. (2012). Re-visiting the echo state property.
Neural networks, 35, 1–9. 407
778
BIBLIOGRAPHY
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features
in deep neural networks? In NIPS’2014 . 324, 539
Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 615
Yu, D., Wang, S., and Deng, L. (2010). Sequential labeling using deep-structured
conditional random fields. IEEE Journal of Selected Topics in Signal Processing. 324
Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 330
Zaremba, W. and Sutskever, I. (2015). Reinforcement learning neural Turing machines.
arXiv:1505.00521 . 422
Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions
of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical
Society. American Mathematical Society. 553
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks.
In ECCV’14 . 6
Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior,
A., Vanhoucke, V., Dean, J., and Hinton, G. E. (2013). On rectified linear units for
speech processing. In ICASSP 2013 . 462
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Object detectors
emerge in deep scene CNNs. ICLR’2015, arXiv:1412.6856. 554
Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative
stochastic network for protein secondary structure prediction. In ICML’2014 . 717
Zhou, Y. and Chellappa, R. (1988). Computation of optical flow using a neural network.
In Neural Networks, 1988., IEEE International Conference on, pages 71–78. IEEE. 340
Zöhrer, M. and Pernkopf, F. (2014). General stochastic networks for classification. In
NIPS’2014 . 717
779