0% found this document useful (0 votes)
1 views

Multilayer Perceptron

Uploaded by

drvppadhy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Multilayer Perceptron

Uploaded by

drvppadhy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Multilayer Perceptrons

 Multilayer feedforward networks, an important class of neural networks.


 The network consists of a set of sensory units (source nodes) that constitute the input layer,
one or more hidden layers of computation nodes, and an output layer of computation nodes.
 The input signal propagates through the network in a forward direction, on a layer-by-layer
basis.
 These neural networks are commonly referred to as multilayer perceptrons (MLPs), which
represent a generalization of the single-layer perceptron.
 Multilayer perceptrons have been applied successfully to solve some difficult and diverse
problems by training them in a supervised manner with a highly popular algorithm known
as the error back-propagation algorithm.
 This algorithm is based on the error-correction learning rule.
 It may be viewed as a generalization of an equally popular adaptive filtering algorithm: the
ubiquitous least-mean-square (LMS) algorithm for the special case of a single linear
neuron.
 Basically, error back-propagation learning consists of two passes through the different
layers of the network:
 a forward pass and a backward pass.
 In the forward pass, an activity pattern (input vector) is applied to the sensory nodes of the
network, and its effect propagates through the network layer by layer.
 Finally, a set of outputs is produced as the actual response of the network.
 During the forward pass the synaptic weights of the networks are all fixed.
 During the backward pass, on the other hand, the synaptic weights are all adjusted in
accordance with an error-correction rule.
 Specifically, the actual response of the network is subtracted from a desired (target)
response to produce an error signal.
 This error signal is then propagated backward through the network. against the direction of
synaptic connections-hence the name "error back-propagation."
 The synaptic weights are adjusted to make the actual response of the network move closer
to the desired response in a statistical sense.
 The error back-propagation algorithm is also referred to in the literature as the back-
propagation algorithm, or simply back-prop.
 Henceforth we will refer to it as the back-propagation algorithm.
 The learning process performed with the algorithm is called back-propagation learning.
A multilayer perceptron has three distinctive characteristics:
 The model of each neuron in the network includes a nonlinear activation function.
 The important point to emphasize here is that the nonlinearity is smooth (i.e.,
differentiable everywhere)
 A commonly used form of nonlinearity that satisfies this requirement is a sigmoidal
nonlinearity1 defined by the logistic function:
1
𝑦𝑗 =
1 + 𝑒𝑥𝑝(−𝑣𝑗 )

 where 𝑣𝑗 is the induced local field (i.e., the weighted sum of all synaptic inputs
plus the bias) of neuron j, and 𝑦𝑗 is the output of the neuron.
 The presence of non-linearities is important because otherwise the input-output
relation of the network could be reduced to that of a single-layer perceptron.
 The network contains one or more layers of hidden neurons that are not part of the input or
output of the network.
 These hidden neurons enable the network to learn complex tasks by extracting
progressively more meaningful features from the input patterns (vectors).
 The network exhibits a high degrees of connectivity, determined by the synapses of the
network.
 A change in the connectivity of the network requires a change in the population of
synaptic connections or their weights.
It is through the combination of these characteristics together with the ability to learn from
experience through training that the multilayer perceptron derives it computing power.
These same characteristics, however, are also responsible for the deficiencies in our present state
of knowledge on the behavior of the network.
First, the presence of a distributed form of nonlinearity and the high connectivity of the network
make the theoretical analysis of a multilayer perceptron difficult to undertake.
Second, the use of hidden neurons makes the learning process harder to visualize.
In an implicit sense, the learning process must decide which features of the input pattern should
be represented by the hidden neurons.
The learning process is therefore made more difficult because the search has to be conducted in a
much larger space of possible functions, and a choice has to be made between alternative
representations of the input pattern (Hinton, 1989).
The usage of the term “back-propagation" appears to have evolved after 1985, when its use was
popularized through the publication of the seminal book entitled Parallel Distributed Processing
(Rumelhart and McClelland, 1986).
The development of the back-propagation algorithm represents a landmark in neural networks in
that it provides a computationally efficient method for the training of multilayer perceptrons.
Although we cannot claim that the back-propagation algorithm provides an optimal solution for
all solvable problems.
SOME PRELIMINARIES
Figure 4.1 shows the architectural graph of a multilayer perceptron with two hidden layers and an
output layer.
To set the stage for a description of the multilayer perceptron in its general form, the network
shown here is fully connected.
This means that a neuron in any layer of the network is connected to all the nodes/neurons in the
previous layer.
Signal flow through the network progresses in a forward direction, from left to right and on a layer-
by-layer basis.

FIGURE 4.1 Architectural graph of a multilayer perceptron with two hidden layers.
Figure 4.2 depicts a portion of the multilayer perceptron. Two kinds of signals are identified in
this network (Parker, 1987):
1. Function Signals. A function signal is an input signal (stimulus) that comes in at the input end
of the network, propagates forward (neuron by neuron) through the network, and emerges at the
output end of the network as an output signal.
We refer to such a signal as a "function signal" for two reasons:
 First, it is presumed to perform a useful function at the output of the network.
 Second, at each neuron of the network through which a function signal passes, the
signal is calculated as a function of the inputs and associated weights applied to that
neuron.
 The function signal is also referred to as the input signal.
FIGURE 4.2 Illustration of the directions of two basic signal flows in a multilayer perceptron:
forward propagation of function signals and back-propagation of error signals.
2. Error Signals. An error signal originates at an output neuron of the network, and propagates
backward (layer by layer) through the network.
We refer to it as an "error signal" because:
 its computation by every neuron of the network involves an error-dependent function in
one form or another.
The output neurons (computational nodes) constitute the output layers of the network. The
remaining neurons (computational nodes) constitute hidden layers of the network.
Thus the hidden units are not part of the output or input of the network- hence their designation as
"hidden."
The first hidden layer is fed from the input layer made up of sensory units (source nodes); the
resulting outputs of the first hidden layer are in turn applied to the next hidden layer; and so on for
the rest of the network.
Each hidden or output neuron of a multilayer perceptron is designed to perform two computations:
1. The computation of the function signal appearing at the output of a neuron, which is expressed
as a continuous nonlinear function of the input signal and synaptic weights associated with that
neuron.
2. The computation of an estimate of the gradient vector (i.e., the gradients of the error surface
with respect to the weights connected to the inputs of a neuron), which is needed for the backward
pass through the network.
The derivation of the back-propagation algorithm is rather involved. To ease the mathematical
burden involved in this derivation, we first present a summary of the notations used in the
derivation.
Notation
• The indices 𝑖, 𝑗, and 𝑘 refer to different neurons in the network; with signals propagating through
the network from left to right, neuron j lies in a layer to the right of neuron i, and neuron k lies in
a layer to the right of neuron j when neuron j is a hidden unit.
• In iteration (time step) n, the nth training pattern (example) is presented to the network.
• The symbol ℰ(𝑛) refers to the instantaneous sum of error squares or error energy at iteration n.
The average of ℰ(𝑛) over all values of n (i.e., the entire training set) yields the average error energy
ℰ𝑎𝑣
• The symbol 𝑒𝑗 (𝑛) refers to the error signal at the output of neuron j for iteration n.

• The symbol 𝑑𝑗 (𝑛) refers to the desired response for neuron j and is used to compute 𝑒𝑗 (𝑛).

• The symbol 𝑦𝑗 (𝑛) refers to the function signal appearing at the output of neuron j at iteration n.

• The symbol 𝑤𝑗𝑖 (𝑛) denotes the synaptic weight connecting the output of neuron i to the input of
neuron j at iteration n. The correction applied to this weight at iteration n is denoted by Δ𝑤𝑗𝑖 (𝑛).

• The induced local field (i.e., weighted sum of all synaptic inputs plus bias) of neuron j at iteration
n is denoted by 𝑣𝑗 (𝑛); it constitutes the signal applied to the acti- vation function associated with
neuron j.
• The activation function describing the input-output functional relationship of the nonlinearity
associated with neuron j is denoted by 𝜙(·).
• The bias applied to neuron j is denoted by 𝑏𝑗 ; its effect is represented by a synapse of weight
𝑤𝑗0 = 𝑏𝑗 connected to a fixed input equal to +1.

• The 𝑖 𝑡ℎ element of the input vector (pattern) is denoted by 𝑥𝑖 (𝑛).


• The 𝑘 𝑡ℎ element of the overall output vector (pattern) is denoted by 𝑜𝑘 (𝑛).
• The learning-rate parameter is denoted by 𝜂.
• The symbol 𝑚𝑙 denotes the size (i.e., number of nodes) in layer 𝑙 of the multilayer perceptron;
𝑙 = 0, 1, . . . , 𝐿, where L is the "depth" of the network.
Thus 𝑚0 denotes the size of the input layer, 𝑚1 denotes the size of the first hidden layer, and 𝑚𝐿
denotes the size of the output layer. The notation 𝑚𝑙 = 𝑀 is also used.
BACK-PROPAGATION ALGORITHM
The error signal at the output of neuron j at iteration n (i.e., presentation of the 𝑛𝑡ℎ training
example) is defined by
𝑒𝑗 (𝑛) = 𝑑𝑗 (𝑛) − 𝑦𝑗 (𝑛) neuron j is an output node (4.1)
1
We define the instantaneous value of the error energy for neuron j as 2 𝑒𝑗2 (𝑛).

Correspondingly, the instantaneous value ℰ(𝑛) of the total error energy is obtained by summing
1
𝑒 2 (𝑛)
2 𝑗
over all neurons in the output layer; these are the only "visible" neu- rons for which error
signals can be calculated directly. We may thus write
1
ℰ(𝑛) = 2 𝛴𝑗∈𝐶 𝑒 2 (𝑛) (4.2)
where the set C includes all the neurons in the output layer of the network.
Let N denote the total number of patterns (examples) contained in the training set.
The average squared error energy is obtained by summing ℰ(𝑛) over all n and then normalizing
with respect to the set size N, as shown by

𝑁 1
ℰ𝑎𝑣 = 𝑁 𝛴𝑛=1 𝜀(𝑛) (4.3)

The instantaneous error energy ℰ(𝑛), and therefore the average error energy ℰ𝑎𝑣 is a function of
all the free parameters (i.e., synaptic weights and bias levels) of the network.
For a given training set, ℰ𝑎𝑣 represents the cost function as a measure of learning performance.
The objective of the learning process is to adjust the free parameters of the network to minimize
ℰ𝑎𝑣 .
To do this minimization, we use an approximation similar in principle to that used for the
derivation of the LMS algorithm.
Specifically, we consider a simple method of training in which the weights are updated on a
pattern-by-pattern basis until one epoch, that is, one complete presentation of the entire training
set has been dealt with.
The adjustments to the weights are made in accordance with the respective errors computed for
each pattern presented to the network.

FIGURE 4.3 Signal-flow graph highlighting the details of output neuron j.


Consider then Fig. 4.3., which depicts neuron j being fed by a set of function signals produced by
a layer of neurons to its left. The induced local field 𝑣𝑗 (𝑛) produced at the input of the activation
function associated with neuron j is therefore
𝑚
𝑣𝑗 (𝑛) = 𝛴𝑖=0 𝑤𝑗𝑖 (𝑛)𝑦𝑖 (𝑛) (4.4)

where m is the total number of inputs (excluding the bias) applied to neuron j. The synaptic weight
𝑤𝑗0 (corresponding to the fixed input 𝑦0 = +1) equals the bias 𝑏𝑗 applied to neuron j. Hence the
function signal 𝑦𝑗 (𝑛) appearing at the output of neuron j at iteration n is

𝑦𝑗 (𝑛) = 𝜙𝑗 (𝑣𝑗 (𝑛)) (4.5)

In a manner similar to the LMS algorithm, the back-propagation algorithm applies a correction
𝜕ℰ(𝑛)
Δ𝑤𝑗𝑖 (𝑛) to the synaptic weight 𝑤𝑗𝑖 (𝑛), which is proportional to the partial derivative .
𝜕𝑤𝑗𝑖 (𝑛)
According to the chain rule of calculus, we may express this gradient as:

𝜕ℰ (𝑛) 𝜕ℰ (𝑛) 𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛) 𝜕𝑣𝑗 (𝑛)


= (4.6)
𝜕𝑤𝑗𝑖 (𝑛) 𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛) 𝜕𝑣𝑗 (𝑛) 𝜕𝑤𝑗𝑖 (𝑛)
𝜕ℰ(𝑛)
The partial derivative 𝜕𝑤 represents a sensitivity factor, determining the direction of search in
𝑗𝑖 (𝑛)
weight space for the synaptic weight 𝑤𝑗𝑖

Differentiating both sides of Eq. (4.2) with respect to 𝑒𝑗 (𝑛), we get

𝜕ℰ (𝑛)
= 𝑒𝑗 (𝑛) (4.7)
𝜕𝑒𝑗 (𝑛)

Differentiating both sides of Eq. (4.1) with respect to 𝑦𝑗 (𝑛), we get

𝜕𝑒𝑗 (𝑛)
= −1 (4.8)
𝜕𝑦𝑗 (𝑛)

Next, differentiating Eq. (4.5) with respect to 𝑣𝑗 (𝑛), we get

𝜕𝑦𝑗 (𝑛)
= 𝜙𝑗′ (𝑣𝑗 (𝑛)) (4.9)
𝜕𝑣𝑗 (𝑛)
where the use of prime (on the right-hand side) signifies differentiation with respect to the
argument. Finally, differentiating Eq. (4.4) with respect to 𝑤𝑗𝑖 (𝑛) yields

𝜕𝑣𝑗 (𝑛)
𝜕𝑤𝑗𝑖 (𝑛)
= 𝑦𝑖 (𝑛) (4.10)

The use of Eqs. (4.7) to (4.10) in (4.6) yields


𝜕ℰ (𝑛)
= −𝑒𝑗 (𝑛)𝜙𝑗′ (𝑣𝑗 (𝑛))𝑦𝑖 (𝑛) (4.11)
𝜕𝑤𝑗𝑖 (𝑛)

The correction Δ𝑤𝑗𝑖 (𝑛) applied to 𝑤𝑗𝑖 (𝑛) is defined by the delta rule:

𝜕ℰ
Δ𝑤𝑗𝑖 (𝑛) = −𝜂 (4.12)
𝜕𝑤𝑗𝑖 (𝑛)
where 𝜂 is the learning-rate parameter of the back-propagation algorithm.
The use of the minus sign in Eq. (4.12) accounts for gradient descent in weight space (i.e., seeking
a direction for weight change that reduces the value of ℰ(𝑛)).
Accordingly, the use of Eq. (4.11) in (4.12) yields

Δ𝑤𝑗𝑖 (𝑛) = −𝜂 𝛿𝑗 (𝑛)𝑦𝑖 (𝑛) (4.13)

where the local gradient 𝛿𝑗 (𝑛) is defined by

𝜕ℰ (𝑛)
𝛿𝑗 (𝑛) = −
𝜕𝑣𝑗 (𝑛)
𝜕ℰ (𝑛) 𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛)
=
𝜕𝑒𝑗 (𝑛) 𝜕𝑦𝑗 (𝑛) 𝜕𝑣𝑗 (𝑛)
= 𝑒𝑗 (𝑛)𝜙𝑗′ (𝑣𝑗 (𝑛)) (4.14)

The local gradient points to required changes in synaptic weights. According to Eq. (4.14), the
local gradient 𝛿𝑗 (𝑛) for output neuron j is equal to the product of the corresponding error signal
𝑒𝑗 (𝑛) for that neuron and the derivative 𝜙𝑗′ (𝑣𝑗 (𝑛)) of the associated activation function.

 From Eqs. (4.13) and (4.14) we note that a key factor involved in the calculation of the
weight adjustment Δ𝑤𝑗𝑖 (𝑛) is the error signal 𝑒𝑗 (𝑛) at the output of neuron j.
 In this context we may identify two distinct cases, depending on where in the network
neuron j is located.
In case 1, neuron j is an output node:
 This case is simple to handle because each output node of the network is supplied with a
desired response of its own, making it a straightforward matter to calculate the associated
error signal.
In case 2, neuron j is a hidden node.
 Even though hidden neurons are not directly accessible, they share responsibility for any
error made at the output of the network.
 The question, however, is to know how to penalize or reward hidden neurons for their share
of the responsibility.
 This problem is the credit-assignment problem considered earlier.
 It is solved in an elegant fashion by back-propagating the error signals through the network.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy