CHAPTER 12. APPLICATIONS
neural nets (Bourlard and Wellekens, 1989; Waibel et al., 1989; Robinson and
Fallside, 1991; Bengio et al., 1991, 1992; Konig et al., 1996). At the time, the
performance of ASR based on neural nets approximately matched the performance
of GMM-HMM systems. For example, Robinson and Fallside (1991) achieved
26% phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with 39
phonemes to discriminate between), which was better than or comparable to
HMM-based systems. Since then, TIMIT has been a benchmark for phoneme
recognition, playing a role similar to the role MNIST plays for object recognition.
However, because of the complex engineering involved in software systems for
speech recognition and the effort that had been invested in building these systems
on the basis of GMM-HMMs, the industry did not see a compelling argument
for switching to neural networks. As a consequence, until the late 2000s, both
academic and industrial research in using neural nets for speech recognition mostly
focused on using neural nets to learn extra features for GMM-HMM systems.
Later, with
much larger and deeper models
and much larger datasets,
recognition accuracy was dramatically improved by using neural networks to
replace GMMs for the task of associating acoustic features to phonemes (or sub-
phonemic states). Starting in 2009, speech researchers applied a form of deep
learning based on unsupervised learning to speech recognition. This approach
to deep learning was based on training undirected probabilistic models called
restricted Boltzmann machines (RBMs) to model the input data. RBMs will be
described in Part III. To solve speech recognition tasks, unsupervised pretraining
was used to build deep feedforward networks whose layers were each initialized
by training an RBM. These networks take spectral acoustic representations in
a fixed-size input window (around a center frame) and predict the conditional
probabilities of HMM states for that center frame. Training such deep networks
helped to significantly improve the recognition rate on TIMIT (Mohamed et al.,
2009, 2012a), bringing down the phoneme error rate from about 26% to 20.7%.
See Mohamed et al. (2012b) for an analysis of reasons for the success of these
models. Extensions to the basic phone recognition pipeline included the addition
of speaker-adaptive features (Mohamed et al., 2011) that further reduced the
error rate. This was quickly followed up by work to expand the architecture from
phoneme recognition (which is what TIMIT is focused on) to large-vocabulary
speech recognition (Dahl et al., 2012), which involves not just recognizing phonemes
but also recognizing sequences of words from a large vocabulary. Deep networks
for speech recognition eventually shifted from being based on pretraining and
Boltzmann machines to being based on techniques such as rectified linear units and
dropout (Zeiler et al., 2013; Dahl et al., 2013). By that time, several of the major
speech groups in industry had started exploring deep learning in collaboration with
462