CHAPTER 6. DEEP FEEDFORWARD NETWORKS
may not be able to find the value of the parameters that corresponds to the desired
function. Second, the training algorithm might choose the wrong function due to
overfitting. Recall from Sec. 5.2.1 that the “no free lunch” theorem shows that
there is no universally superior machine learning algorithm. Feedforward networks
provide a universal system for representing functions, in the sense that, given a
function, there exists a feedforward network that approximates the function. There
is no universal procedure for examining a training set of specific examples and
choosing a function that will generalize to points not in the training set.
The universal approximation theorem says that there exists a network large
enough to achieve any degree of accuracy we desire, but the theorem does not
say how large this network will be. Barron (1993) provides some bounds on the
size of a single-layer network needed to approximate a broad class of functions.
Unfortunately, in the worse case, an exponential number of hidden units (possibly
with one hidden unit corresponding to each input configuration that needs to be
distinguished) may be required. This is easiest to see in the binary case: the
number of possible binary functions on vectors
v ∈ {
0
,
1
}
n
is 2
2
n
and selecting
one such function requires 2
n
bits, which will in general require
O
(2
n
) degrees of
freedom.
In summary, a feedforward network with a single layer is sufficient to represent
any function, but the layer may be infeasibly large and may fail to learn and
generalize correctly. In many circumstances, using deeper models can reduce the
number of units required to represent the desired function and can reduce the
amount of generalization error.
There exist families of functions which can be approximated efficiently by an
architecture with depth greater than some value
d
, but which require a much larger
model if depth is restricted to be less than or equal to
d
. In many cases, the number
of hidden units required by the shallow model is exponential in
n
. Such results
were first proven for models that do not resemble the continuous, differentiable
neural networks used for machine learning, but have since been extended to these
models. The first results were for circuits of logic gates (Håstad, 1986). Later
work extended these results to linear threshold units with non-negative weights
(Håstad and Goldmann, 1991; Hajnal et al., 1993), and then to networks with
continuous-valued activations (Maass, 1992; Maass et al., 1994). Many modern
neural networks use rectified linear units. Leshno et al. (1993) demonstrated
that shallow networks with a broad family of non-polynomial activation functions,
including rectified linear units, have universal approximation properties, but these
results do not address the questions of depth or efficiency—they specify only that
a sufficiently wide rectifier network could represent any function. Pascanu et al.
198