Multilayer Perceptron
Multilayer Perceptron
where 𝑣𝑗 is the induced local field (i.e., the weighted sum of all synaptic inputs
plus the bias) of neuron j, and 𝑦𝑗 is the output of the neuron.
The presence of non-linearities is important because otherwise the input-output
relation of the network could be reduced to that of a single-layer perceptron.
The network contains one or more layers of hidden neurons that are not part of the input or
output of the network.
These hidden neurons enable the network to learn complex tasks by extracting
progressively more meaningful features from the input patterns (vectors).
The network exhibits a high degrees of connectivity, determined by the synapses of the
network.
A change in the connectivity of the network requires a change in the population of
synaptic connections or their weights.
It is through the combination of these characteristics together with the ability to learn from
experience through training that the multilayer perceptron derives it computing power.
These same characteristics, however, are also responsible for the deficiencies in our present state
of knowledge on the behavior of the network.
First, the presence of a distributed form of nonlinearity and the high connectivity of the network
make the theoretical analysis of a multilayer perceptron difficult to undertake.
Second, the use of hidden neurons makes the learning process harder to visualize.
In an implicit sense, the learning process must decide which features of the input pattern should
be represented by the hidden neurons.
The learning process is therefore made more difficult because the search has to be conducted in a
much larger space of possible functions, and a choice has to be made between alternative
representations of the input pattern (Hinton, 1989).
The usage of the term “back-propagation" appears to have evolved after 1985, when its use was
popularized through the publication of the seminal book entitled Parallel Distributed Processing
(Rumelhart and McClelland, 1986).
The development of the back-propagation algorithm represents a landmark in neural networks in
that it provides a computationally efficient method for the training of multilayer perceptrons.
Although we cannot claim that the back-propagation algorithm provides an optimal solution for
all solvable problems.
SOME PRELIMINARIES
Figure 4.1 shows the architectural graph of a multilayer perceptron with two hidden layers and an
output layer.
To set the stage for a description of the multilayer perceptron in its general form, the network
shown here is fully connected.
This means that a neuron in any layer of the network is connected to all the nodes/neurons in the
previous layer.
Signal flow through the network progresses in a forward direction, from left to right and on a layer-
by-layer basis.
FIGURE 4.1 Architectural graph of a multilayer perceptron with two hidden layers.
Figure 4.2 depicts a portion of the multilayer perceptron. Two kinds of signals are identified in
this network (Parker, 1987):
1. Function Signals. A function signal is an input signal (stimulus) that comes in at the input end
of the network, propagates forward (neuron by neuron) through the network, and emerges at the
output end of the network as an output signal.
We refer to such a signal as a "function signal" for two reasons:
First, it is presumed to perform a useful function at the output of the network.
Second, at each neuron of the network through which a function signal passes, the
signal is calculated as a function of the inputs and associated weights applied to that
neuron.
The function signal is also referred to as the input signal.
FIGURE 4.2 Illustration of the directions of two basic signal flows in a multilayer perceptron:
forward propagation of function signals and back-propagation of error signals.
2. Error Signals. An error signal originates at an output neuron of the network, and propagates
backward (layer by layer) through the network.
We refer to it as an "error signal" because:
its computation by every neuron of the network involves an error-dependent function in
one form or another.
The output neurons (computational nodes) constitute the output layers of the network. The
remaining neurons (computational nodes) constitute hidden layers of the network.
Thus the hidden units are not part of the output or input of the network- hence their designation as
"hidden."
The first hidden layer is fed from the input layer made up of sensory units (source nodes); the
resulting outputs of the first hidden layer are in turn applied to the next hidden layer; and so on for
the rest of the network.
Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
1. The computation of the function signal appearing at the output of a neuron, which is expressed
as a continuous nonlinear function of the input signal and synaptic weights associated with that
neuron.
2. The computation of an estimate of the gradient vector (i.e., the gradients of the error surface
with respect to the weights connected to the inputs of a neuron), which is needed for the backward
pass through the network.
The derivation of the back-propagation algorithm is rather involved. To ease the mathematical
burden involved in this derivation, we first present a summary of the notations used in the
derivation.
Notation
• The indices 𝑖, 𝑗, and 𝑘 refer to different neurons in the network; with signals propagating through
the network from left to right, neuron j lies in a layer to the right of neuron i, and neuron k lies in
a layer to the right of neuron j when neuron j is a hidden unit.
• In iteration (time step) n, the nth training pattern (example) is presented to the network.
• The symbol ℰ(𝑛) refers to the instantaneous sum of error squares or error energy at iteration n.
The average of ℰ(𝑛) over all values of n (i.e., the entire training set) yields the average error energy
ℰ𝑎𝑣
• The symbol 𝑒𝑗 (𝑛) refers to the error signal at the output of neuron j for iteration n.
• The symbol 𝑑𝑗 (𝑛) refers to the desired response for neuron j and is used to compute 𝑒𝑗 (𝑛).
• The symbol 𝑦𝑗 (𝑛) refers to the function signal appearing at the output of neuron j at iteration n.
• The symbol 𝑤𝑗𝑖 (𝑛) denotes the synaptic weight connecting the output of neuron i to the input of
neuron j at iteration n. The correction applied to this weight at iteration n is denoted by Δ𝑤𝑗𝑖 (𝑛).
• The induced local field (i.e., weighted sum of all synaptic inputs plus bias) of neuron j at iteration
n is denoted by 𝑣𝑗 (𝑛); it constitutes the signal applied to the acti- vation function associated with
neuron j.
• The activation function describing the input-output functional relationship of the nonlinearity
associated with neuron j is denoted by 𝜙(·).
• The bias applied to neuron j is denoted by 𝑏𝑗 ; its effect is represented by a synapse of weight
𝑤𝑗0 = 𝑏𝑗 connected to a fixed input equal to +1.
Correspondingly, the instantaneous value ℰ(𝑛) of the total error energy is obtained by summing
1
𝑒 2 (𝑛)
2 𝑗
over all neurons in the output layer; these are the only "visible" neu- rons for which error
signals can be calculated directly. We may thus write
1
ℰ(𝑛) = 2 𝛴𝑗∈𝐶 𝑒 2 (𝑛) (4.2)
where the set C includes all the neurons in the output layer of the network.
Let N denote the total number of patterns (examples) contained in the training set.
The average squared error energy is obtained by summing ℰ(𝑛) over all n and then normalizing
with respect to the set size N, as shown by
𝑁 1
ℰ𝑎𝑣 = 𝑁 𝛴𝑛=1 𝜀(𝑛) (4.3)
The instantaneous error energy ℰ(𝑛), and therefore the average error energy ℰ𝑎𝑣 is a function of
all the free parameters (i.e., synaptic weights and bias levels) of the network.
For a given training set, ℰ𝑎𝑣 represents the cost function as a measure of learning performance.
The objective of the learning process is to adjust the free parameters of the network to minimize
ℰ𝑎𝑣 .
To do this minimization, we use an approximation similar in principle to that used for the
derivation of the LMS algorithm.
Specifically, we consider a simple method of training in which the weights are updated on a
pattern-by-pattern basis until one epoch, that is, one complete presentation of the entire training
set has been dealt with.
The adjustments to the weights are made in accordance with the respective errors computed for
each pattern presented to the network.
where m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic weight
𝑤𝑗0 (corresponding to the fixed input 𝑦0 = +1) equals the bias 𝑏𝑗 applied to neuron j. Hence the
function signal 𝑦𝑗 (𝑛) appearing at the output of neuron j at iteration n is
In a manner similar to the LMS algorithm, the back-propagation algorithm applies a correction
𝜕ℰ(𝑛)
Δ𝑤𝑗𝑖 (𝑛) to the synaptic weight 𝑤𝑗𝑖 (𝑛), which is proportional to the partial derivative .
𝜕𝑤𝑗𝑖 (𝑛)
According to the chain rule of calculus, we may express this gradient as:
𝜕ℰ (𝑛)
= 𝑒𝑗 (𝑛) (4.7)
𝜕𝑒𝑗 (𝑛)
𝜕𝑒𝑗 (𝑛)
= −1 (4.8)
𝜕𝑦𝑗 (𝑛)
𝜕𝑦𝑗 (𝑛)
= 𝜙𝑗′ (𝑣𝑗 (𝑛)) (4.9)
𝜕𝑣𝑗 (𝑛)
where the use of prime (on the right-hand side) signifies differentiation with respect to the
argument. Finally, differentiating Eq. (4.4) with respect to 𝑤𝑗𝑖 (𝑛) yields
𝜕𝑣𝑗 (𝑛)
𝜕𝑤𝑗𝑖 (𝑛)
= 𝑦𝑖 (𝑛) (4.10)
The correction Δ𝑤𝑗𝑖 (𝑛) applied to 𝑤𝑗𝑖 (𝑛) is defined by the delta rule:
𝜕ℰ
Δ𝑤𝑗𝑖 (𝑛) = −𝜂 (4.12)
𝜕𝑤𝑗𝑖 (𝑛)
where 𝜂 is the learning-rate parameter of the back-propagation algorithm.
The use of the minus sign in Eq. (4.12) accounts for gradient descent in weight space (i.e., seeking
a direction for weight change that reduces the value of ℰ(𝑛)).
Accordingly, the use of Eq. (4.11) in (4.12) yields
𝜕ℰ (𝑛)
𝛿𝑗 (𝑛) = −
𝜕𝑣𝑗 (𝑛)
𝜕ℰ (𝑛) 𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛)
=
𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛) 𝜕𝑣𝑗 (𝑛)
= 𝑒𝑗 (𝑛)𝜙𝑗′ (𝑣𝑗 (𝑛)) (4.14)
The local gradient points to required changes in synaptic weights. According to Eq. (4.14), the
local gradient 𝛿𝑗 (𝑛) for output neuron j is equal to the product of the corresponding error signal
𝑒𝑗 (𝑛) for that neuron and the derivative 𝜙𝑗′ (𝑣𝑗 (𝑛)) of the associated activation function.
From Eqs. (4.13) and (4.14) we note that a key factor involved in the calculation of the
weight adjustment Δ𝑤𝑗𝑖 (𝑛) is the error signal 𝑒𝑗 (𝑛) at the output of neuron j.
In this context we may identify two distinct cases, depending on where in the network
neuron j is located.
In case 1, neuron j is an output node:
This case is simple to handle because each output node of the network is supplied with a
desired response of its own, making it a straightforward matter to calculate the associated
error signal.
In case 2, neuron j is a hidden node.
Even though hidden neurons are not directly accessible, they share responsibility for any
error made at the output of the network.
The question, however, is to know how to penalize or reward hidden neurons for their share
of the responsibility.
This problem is the credit-assignment problem considered earlier.
It is solved in an elegant fashion by back-propagating the error signals through the network.