0% found this document useful (0 votes)
18 views

Unit III

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Unit III

Uploaded by

ajankit0712
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit III

Introduction to Neural Network


The Artificial Neural Network (ANN) is a deep learning method that arose from the concept of the
human brain Biological Neural Networks. They are among the most powerful machine learning
algorithms used today. The development of ANN was the result of an attempt to replicate the
workings of the human brain. The workings of ANN are extremely similar to those of biological
neural networks, although they are not identical. ANN algorithm accepts only numeric and
structured data.

This article explores Artificial Neural Networks (ANN) in machine learning, focusing on how
CNNs and RNNs process unstructured data like images, text, and speech. You’ll learn about neural
networks in AI, their types, and their role in machine learning. Researchers use Artificial Neural
Networks (ANN) algorithms based on brain function to model complicated patterns and
forecast issues. Neurons within interconnected units collaborate to identify patterns, acquire
knowledge from data, and generate predictions. Artificial neural networks (ANNs) are
commonly employed in activities such as identifying images, processing language, and
making decisions.

Basic Architecture of Neural Network:

Artificial Neural Networks contain artificial neurons which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system. A layer can have only a dozen units or millions of units as this depends on how the complex
neural networks will be required to learn the hidden patterns in the dataset. Commonly, Artificial
Neural Network has an input layer, an output layer as well as hidden layers. The input layer
receives data from the outside world which the neural network needs to analyze or learn about.
Then this data passes through one or multiple hidden layers that transform the input into data that
is valuable for the output layer. Finally, the output layer provides an output in the form of a
response of the Artificial Neural Networks to input data provided.

In the majority of neural networks, units are interconnected from one layer to another. Each of
these connections has weights that determine the influence of one unit on another unit. As the data
transfers from one unit to another, the neural network learns more and more about the data which
eventually results in an output from the output layer.

The structures and operations of human neurons serve as the basis for artificial neural networks. It
is also known as neural networks or neural nets. The input layer of an artificial neural network is
the first layer, and it receives input from external sources and releases it to the hidden layer, which
is the second layer. In the hidden layer, each neuron receives input from the previous layer neurons,
computes the weighted sum, and sends it to the neurons in the next layer. These connections are
weighted means effects of the inputs from the previous layer are optimized more or less by
assigning different-different weights to each input and it is adjusted during the training process by
optimizing these weights for improved model performance.
The “backpropagation algorithm” is a method by which neural networks work, converting ANN
into a learning algorithm by learning from mistakes. The optimization approach uses a “gradient
descent” technique to quantify prediction errors. This technique is a cornerstone of supervised
learning, as it iteratively adjusts weights to minimize errors. In order to find the optimum value for
weights, we try small adjustments in weights and examine the impact on prediction errors.

Structure of Neural Network

An artificial neuron or neural node is a mathematical model. In most cases, it computes the
weighted average of its input and then applies a bias to it. Post that, it passes this resultant term
through an activation function. This activation function is a nonlinear function such as the sigmoid
function that accepts a linear input and gives a nonlinear output.

Components of a Neuron in an ANN:

1. Inputs:

 Neurons receive one or more inputs, which are typically the features of the data fed
into the network (e.g., pixel values of an image, words in text, etc.).
 Each input has an associated weight, which determines its importance in the
computation. Weights are learned during training.
2. Summation (Weighted Sum):

 Each input value is multiplied by its corresponding weight.


 These weighted values are then summed together to produce a total input to the
neuron.
 The formula for this step is:

𝑇𝑜𝑡𝑎𝑙 𝑖𝑛𝑝𝑢𝑡 = ∑(𝐼𝑛𝑝𝑢𝑡𝑖 × 𝑊𝑒𝑖𝑔ℎ𝑡𝑖 )

3. Bias:

 A bias is added to the weighted sum. This is a constant value that helps the neuron
make adjustments to the output, ensuring that it can represent more complex
functions.
 The equation becomes:

𝑇𝑜𝑡𝑎𝑙 𝑖𝑛𝑝𝑢𝑡 𝑤𝑖𝑡ℎ 𝐵𝑖𝑎𝑠 = ∑(𝐼𝑛𝑝𝑢𝑡𝑖 × 𝑊𝑒𝑖𝑔ℎ𝑡𝑖 ) + 𝐵𝑖𝑎𝑠

4. Activation Function:

 The result of the weighted sum plus the bias is passed through an activation
function. This function determines whether the neuron should be activated (i.e.,
produce an output) and introduces non-linearity into the network, enabling it to
learn complex patterns.
 Common activation functions include:

• Sigmoid: Outputs a value between 0 and 1.


• ReLU (Rectified Linear Unit): Outputs the input directly if it is positive,
otherwise, it outputs zero.
• Tanh: Outputs a value between -1 and 1.
• Softmax: Used in classification tasks to give a probability distribution over
multiple classes.

5. Output:

 The final output of the neuron is the result of the activation function applied to the
total input (weighted sum + bias).
 This output can then be passed to the next layer of neurons (in a multi-layer
network) or used as the final prediction (in simpler networks).

Example:

In a simple neural network, consider a neuron that takes three inputs:

 Inputs: X1=0.5, X2=1.0, X3=0.2


 Weights: W1=0.3, W2=0.7, W3=0.9
 Bias: b=0.1

The total input to the neuron is:

Total Input=(0.5×0.3)+(1.0×0.7)+(0.2×0.9)+0.1=0.15+0.7+0.18+0.1=1.13

If we use the ReLU activation function:

Output=ReLU(1.13)=1.13(since 1.13 is positive)

This output (1.13) will be passed on to the next layer or used as the final result.

Weights:

Imagine a neural network as a complex web of interconnected nodes, each representing a


computational unit known as a neuron. These neurons work together to process information and
produce output. However, not all connections between neurons are created equal. This is where
weights come into play.

Weights are numerical values associated with the connections between neurons. They determine
the strength of these connections and, in turn, the influence that one neuron's output has on another
neuron's input. Think of weights as the coefficients that adjust the impact of incoming data. They
can increase or decrease the importance of specific information.

During the training phase of a neural network, these weights are adjusted iteratively to minimize
the difference between the network's predictions and the actual outcomes. This process is akin to
fine-tuning the network's ability to make accurate predictions.

Let's consider a practical example to illustrate the role of weights. Suppose you're building a neural
network to recognize handwritten digits. Each pixel in an image of a digit can be considered an
input to the network. The weights associated with each pixel determine how much importance the
network places on that pixel when making a decision about which digit is represented in the image.

As the network learns from a dataset of labelled digits, it adjusts these weights to give more
significance to pixels that are highly correlated with the correct digit and less significance to pixels
that are less relevant. Over time, the network learns to recognize patterns in the data and make
accurate predictions.

In essence, weights are the neural network's way of learning from data. They capture the
relationships between input features and the target output, allowing the network to generalize and
make predictions on new, unseen data.

Bias:

The total number of biases = Total number of hidden neurons + the number of output neurons

In simple words, neural network bias can be defined as the constant which is added to the product
of features and weights. It is used to offset the result. It helps the models to shift the activation
function towards the positive or negative side.

Let us understand the importance of bias with the help of an example.

Consider a sigmoid activation function which is represented by the equation below:

On replacing the variable ‘x’ with the equation of line, we get the following:

In the above equation, ‘w’ is weights, ‘x’ is the feature vector, and ‘b’ is defined as the bias. On
substituting the value of ‘b’ equal to 0, we get the graph of the above equation as shown in the
figure below:
If we vary the values of the weight ‘w’, keeping bias ‘b’=0, we will get the following graph:

While changing the values of ‘w’, there is no way we can shift the origin of the activation function,
i.e., the sigmoid function. On changing the values of ‘w’, only the steepness of the curve will
change.

Due to absence of bias, model will train over point passing through origin only, which is not in
accordance with real-world scenario. Also with the introduction of bias, the model will become
more flexible.

Example: Imagine a neuron that processes the brightness of an image pixel. Without a bias, this
neuron might only activate when the pixel's brightness is exactly at a certain threshold. However,
by introducing a bias, you allow the neuron to activate even when the brightness is slightly below
or above the threshold.

This flexibility is crucial because real-world data is rarely perfectly aligned with specific
thresholds. Biases enable neurons to activate in response to various input conditions, making
neural networks more robust and capable of handling complex patterns.
There is only one way to shift the origin and that is to include bias ‘b’. On keeping the value of
weight ‘w’ fixed and varying the value of bias ‘b’, we will get the graph below:

From the graph, it can be concluded that the bias is required for shifting the origin of the curve to
the left or right.

How to shift the curve to the left or the right

If we substitute the value of ‘w’= -1 and ‘b’=0 in Eq.(6), the graph of the activation function will
look as shown below.

It can be seen that the output is equal to 0 for the values of ‘x’ less than 0 and equal to 1 for the
values of ‘x’ greater than 0.

Now, how should we design our equation so that the output of the activation function is equal to 1
for all values of ‘x’ less than 5? We achieve this by introducing the term bias ‘b’ in our equation
with value of ‘b’=-5.

Why is bias added in neural networks?

To understand the concept of neural network bias, let’s begin by discussing single layer neural
networks.

A given neural network computes the function Y=f(X), where X and Y are feature vector and
output vector, respectively, with independent components. If the given neural network has weight
‘W’ then it can also be represented as Y=f(X,W). If the dimensionality of both X and Y is equal
to 1, the function can be plotted in a two-dimensional plane as below:
Such a neural network can approximate any linear function of the form y=mx + c. When c=0, then
y=f(x)=mx and the neural network can approximate only the functions passing through origin. If
a function includes the constant term c, it can approximate any of the linear functions in the plane.

How to add bias to neural networks

Consider the conditions mentioned in the previous section for the neural network. From there, we
can infer that if there is any error during the prediction by the function, bias ‘b’ can be added to
the output values for obtaining the true values. Thus, the neural network would compute the
function y=f(x) + b which includes all the predictions by the neural network shifted by the constant
‘b’.

Now, if we add one more input to the neural network layer, the function is defined as y=f(x1,x2).
Since both x1 and x2 are independent components, they should have independent biases b1 and
b2, respectively. Thus, the neural networks can be represented as y=f(x1, x2) + b1 + b2.

Applications of Artificial Neural Networks

1. Social Media:
Artificial Neural Networks are used heavily in Social Media. For example, let’s take
the ‘People you may know’ feature on Facebook that suggests people that you might know
in real life so that you can send them friend requests. Well, this magical effect is achieved
by using Artificial Neural Networks that analyze your profile, your interests, your current
friends, and also their friends and various other factors to calculate the people you might
potentially know. Another common application of Machine Learning in social media
is facial recognition . This is done by finding around 100 reference points on the person’s
face and then matching them with those already available in the database using
convolutional neural networks.

2. Marketing and Sales:


When you log onto E-commerce sites like Amazon and Flipkart, they will recommend your
products to buy based on your previous browsing history. Similarly, suppose you love
Pasta, then Zomato, Swiggy, etc. will show you restaurant recommendations based on your
tastes and previous order history. This is true across all new-age marketing segments like
Book sites, Movie services, Hospitality sites, etc. and it is done by
implementing personalized marketing. This uses Artificial Neural Networks to identify
the customer likes, dislikes, previous shopping history, etc., and then tailor the marketing
campaigns accordingly.

3. Healthcare:
Artificial Neural Networks are used in Oncology to train algorithms that can identify
cancerous tissue at the microscopic level at the same accuracy as trained physicians.
Various rare diseases may manifest in physical characteristics and can be identified in their
premature stages by using Facial Analysis on the patient photos. So the full-scale
implementation of Artificial Neural Networks in the healthcare environment can only
enhance the diagnostic abilities of medical experts and ultimately lead to the overall
improvement in the quality of medical care all over the world.

4. Personal Assistants:
I am sure you all have heard of Siri, Alexa, Cortana, etc., and also heard them based on the
phones you have!!! These are personal assistants and an example of speech recognition that
uses Natural Language Processing to interact with the users and formulate a response
accordingly. Natural Language Processing uses artificial neural networks that are made to
handle many tasks of these personal assistants such as managing the language syntax,
semantics, correct speech, the conversation that is going on, etc.

Benefits of Artificial Neural Networks:

ANNs offers many key benefits that make them particularly well-suited to specific issues and
situations:
 ANNs can learn and model non-linear and complicated interactions, which is critical since
many of the relationships between inputs and outputs in real life are non-linear and
complex.
 Artificial Neural Network in machine learning can generalize after learning from the
original inputs and their associations, the model may infer unknown relationships from
anonymous data, allowing it to generalize and predict unknown data.
 ANN does not impose any constraints on the input variables, unlike many other prediction
approaches (like how they should be distributed). Furthermore, numerous studies have
demonstrated that ANN algorithms can better simulate heteroscedasticity, or data with high
volatility and non-constant variance, because of their capacity to discover latent
correlations in the data without imposing any present associations. This is particularly
helpful in financial time series forecasting (for example, stock prices) when significant data
volatility.

Activation functions in Neural Networks

While building a neural network, one key decision is selecting the Activation Function for both
the hidden layer and the output layer. Activation functions decide whether a neuron should be
activated.

An activation function is a mathematical function applied to the output of a neuron. It


introduces non-linearity into the model, allowing the network to learn and represent complex
patterns in the data. Without this non-linearity feature, a neural network would behave like a linear
regression model, no matter how many layers it has.
The activation function decides whether a neuron should be activated by calculating the weighted
sum of inputs and adding a bias term. This helps the model make complex decisions and
predictions by introducing non-linearities to the output of each neuron.

Neural networks consist of neurons that operate using weights, biases, and activation functions.
In the learning process, these weights and biases are updated based on the error produced at the
output this process is known as backpropagation. Activation functions enable backpropagation
by providing gradients that are essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions empower neural networks to model highly complex data
distributions and solve advanced deep learning tasks. Adding non-linear activation functions
introduce flexibility and enable the network to learn more complex and abstract patterns from data.

Types of Activation Functions in Deep Learning

I.Linear Activation Function

Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains, if they all use linear activation functions, the output is a linear
combination of the input.
 The range of the output spans from (−∞ to +∞)
 Linear activation function is used at just one place i.e. output layer.
 Using linear activation across all layers makes the network’s ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output

II.Non-Linear Activation Functions

1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is mathematically defined as
1
𝐴=
1 + 𝑒 −𝑥
This formula ensures a smooth and continuous output that is essential for gradient-based
optimization methods.
 It allows neural networks to handle and model complex patterns that linear equations
cannot.
 The output ranges between 0 and 1, hence useful for binary classification.
 The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity
means that small changes in input x can cause significant changes in output y, which is
critical during the training process.

Sigmoid or Logistic Activation Function Graph

2. Tanh Activation Function


Tanh function or hyperbolic tangent function, is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:
2
𝑓(𝑥) = tanh(𝑥) = −1
1 + 𝑒 −2𝑥
Alternatively, it can be expressed using the sigmoid function:
𝑓(𝑥) = tanh(𝑥) = 2 × 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(2𝑥) − 1
 Value Range: Outputs values from -1 to +1.
 Non-linear: Enables modeling of complex data patterns.
 Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
 If one has to choose between the sigmoid and tanh and has no specific reason to prefer one
over the other, tanh is often the better choice because of the reasons mentioned above.

Tanh Activation Function

3. ReLU (Rectified Linear Unit) Function


ReLU activation is defined by A(x)=max(0,x), this means that if the input x is positive, ReLU
returns x, if the input is negative, it returns 0.
 Value Range: [0,∞), meaning the function only outputs non-negative values.
 Nature: It is a non-linear activation function, allowing neural networks to learn complex
patterns and making backpropagation more efficient.
 Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few neurons
are activated making the network sparse making it efficient and easy for computation.

ReLU Activation Function

III.Exponential Linear Units

1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms raw
output scores from a neural network into probabilities. It works by squashing the output values of
each class into the range of 0 to 1, while ensuring that the sum of all probabilities equals 1.
 Softmax is a non-linear activation function.
 The Softmax function ensures that each class is assigned a probability, helping to identify
which class the input belongs to.

Softmax Activation Function

2. SoftPlus Function
Softplus function is defined mathematically as:
𝐴(𝑥) = log(1 + 𝑒 𝑥 )
This equation ensures that the output is always positive and differentiable at all points, which is an
advantage over the traditional ReLU function.
 Nature: The Softplus function is non-linear.
 Range: The function outputs values in the range (0,∞), similar to ReLU, but without the
hard zero threshold that ReLU has.
 Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU, which can sometimes lead to problems during optimization.

Softplus Activation Function

Impact of Activation Functions on Model Performance


The choice of activation function has a direct impact on the performance of a neural network in
several ways:
1. Convergence Speed: Functions like ReLU allow faster training by avoiding the vanishing
gradient problem, while Sigmoid and Tanh can slow down convergence in deep networks.
2. Gradient Flow: Activation functions like ReLU ensure better gradient flow, helping
deeper layers learn effectively. In contrast, Sigmoid can lead to small gradients, hindering
learning in deep layers.
3. Model Complexity: Activation functions like Softmax allow the model to handle
complex multi-class problems, whereas simpler functions like ReLU or Leaky ReLU are
used for basic layers.

Perceptron:

The Perceptron is one of the simplest artificial neural network architectures, introduced by
Frank Rosenblatt in 1957. It is primarily used for binary classification.
At that time, traditional methods like Statistical Machine Learning and Conventional Programming
were commonly used for predictions. Despite being one of the simplest forms of artificial neural
networks, the Perceptron model proved to be highly effective in solving specific classification
problems, laying the groundwork for advancements in AI and machine learning.

What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by
McCulloch and Walter Pitts in the 1940s. This foundational model has played a crucial role in the
development of more advanced neural networks and machine learning algorithms.

Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable
patterns. It is effective for tasks where the data can be divided into distinct categories
through a straight line. While powerful in its simplicity, it struggles with more complex
problems where the relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or
more layers, adept at handling more complex patterns and relationships within the data.

Basic Components of Perceptron


A Perceptron is composed of key components that work together to process information and make
predictions.
 Input Features: The perceptron takes multiple input features, each representing a
characteristic of the input data.
 Weights: Each input feature is assigned a weight that determines its influence on the output.
These weights are adjusted during training to find the optimal values.
 Summation Function: The perceptron calculates the weighted sum of its inputs,
combining them with their respective weights.
 Activation Function: The weighted sum is passed through the Heaviside step function,
comparing it to a threshold to produce a binary output (0 or 1).
 Output: The final output is determined by the activation function, often used for binary
classification tasks.
 Bias: The bias term helps the perceptron make adjustments independent of the input,
improving its flexibility in learning.
 Learning Algorithm: The perceptron adjusts its weights and bias using a learning
algorithm, such as the Perceptron Learning Rule, to minimize prediction errors.

How does Perceptron work?


A weight is assigned to each input node of a perceptron, indicating the importance of that input in
determining the output. The Perceptron’s output is calculated as a weighted sum of the inputs,
which is then passed through an activation function to decide whether the Perceptron will fire.
The weighted sum is computed as:

The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it’s 0. This is the most common activation function
used in Perceptrons are represented by the Heaviside step function:

A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.

In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:

where X is the input W is the weight for each inputs neurons and b is the bias and h is the step
function.
During training, the Perceptron’s weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:

Where:
 𝑤𝑖,𝑗 is the weight between the ith input and jth output neuron
 𝑥𝑖 is the ith input value.
 𝑦𝑖 is the actual value, and 𝑦̂𝑖 is the predicted value
 𝜂 is the learning rate, controlling how much the weights are adjusted.
This process enables the perceptron to learn from data and improve its prediction accuracy over
time.

Example: Perceptron in Action


Let’s take a simple example of classifying whether a given fruit is an apple or not based on two
inputs: its weight (in grams) and its color (on a scale of 0 to 1, where 1 means red). The perceptron
receives these inputs, multiplies them by their weights, adds a bias, and applies the activation
function to decide whether the fruit is an apple or not.
 Input 1 (Weight): 150 grams
 Input 2 (Color): 0.9 (since the fruit is mostly red)
 Weights: [0.5, 1.0]
 Bias: 1.5
The perceptron’s weighted sum would be:
(150∗0.5)+(0.9∗1.0)+1.5=76.4
Let’s assume the activation function uses a threshold of 75. Since 76.4 > 75, the perceptron
classifies the fruit as an apple (output = 1).

Limitations of Perceptron
The Perceptron was a significant breakthrough in the development of neural networks, proving
that simple networks could learn to classify patterns. However, the Perceptron model has certain
limitations that can make it unsuitable for some tasks:
 Limited to linearly separable problems
 Struggles with convergence when handling non-separable data
 Requires labelled data for training
 Sensitive to input scaling
 Lacks hidden layers for complex decision-making

Example:
Let x = 0.2, w = 0.5, b = 1 and y = 1
Note: In regular practices, weights i.e, w values are randomly initialized to start with. Setting w to
0 hampers the progress of the NNs. Therefore, in this example as well, we are randomly assuming
the values of w to be 0.5
Starting with Forward Propagation

Now, activation function:


y = 1 while y_pred = 0.750260105. Now, we need to calculate the loss of this model in-order to
fix the value of w so that we get a more accurate prediction. Again for the purpose of simplicity,
we use the square error or squared deviation to calculate the loss here.

Now, backward propagation


Here, we have to make use of some calculus to generate partial derivatives.

By simple chain rule, we get:

Let’s calculate each individual derivative:


After we calculate the deviation of w, we need to fix the value of w using:

With the new updated weight, if we do a forward propagation, we get:

As you can see, the error has decreased by 0.000035026

Feedforward Neural Network

A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name "feedforward."

Structure of a Feedforward Neural Network


1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron
in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output layers.
These layers are responsible for learning the complex patterns in the data. Each neuron in
a hidden layer applies a weighted sum of inputs followed by a non-linear activation
function.
3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or
the number of outputs in a regression problem.
Each connection between neurons in these layers has an associated weight that is adjusted during
the training process to minimize the error in predictions.

Training a Feedforward Neural Network


Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the network
to update the weights. The gradient of the loss function with respect to each weight is
calculated, and the weights are adjusted using gradient descent.

Forward propagation is part of the inference phase, as it does not involve adjusting weights but
simply computes the network’s predictions.

Backpropagation Algorithm:

Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function by
adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic gradient
descent. The algorithm computes the gradient using the chain rule from calculus, allowing it to
effectively navigate complex layers in the neural network to minimize the cost function.

Why is Backpropagation Important?


Backpropagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to each
weight using the chain rule, making it possible to update weights efficiently.
2. Scalability: The backpropagation algorithm scales well to networks with multiple layers
and complex architectures, making deep learning feasible.
3. Automated Learning: With backpropagation, the learning process becomes automated,
and the model can adjust itself to optimize its performance.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
How Does the Forward Pass Work?
In the forward pass, the input data is fed into the input layer. These inputs, combined with their
respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)), the output from
h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted
inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit), which returns
the input if it’s positive and zero otherwise. This adds non-linearity, allowing the model to learn
complex relationships in the data. Finally, the outputs from the last hidden layer are passed to the
output layer, where an activation function, such as softmax, converts the weighted outputs into
probabilities for classification.

How Does the Backward Pass Work?


In the backward pass, the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error (MSE), given by:
MSE=(Predicted Output−Actual Output)2
Once the error is calculated, the network adjusts weights using gradients, which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted
to minimize the error in the next iteration. The backward pass continues layer by layer, ensuring
that the network learns and improves its performance. The activation function, through its
derivative, plays a crucial role in computing these gradients during backpropagation.

Example:

There are two units in the Input Layer, two units in the Hidden Layer and two units in the Output
Layer. The w1,w2,w2,…,w8 represent the respective weights. b1 and b2 are the biases for Hidden
Layer and Output Layer, respectively.

Let’s get started with the forward pass.


For h1,

Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash
the weighted sum into the range (0 and +1). The logistic function is an activation function for our
example neural network.

Now, outputh1 and outputh2 will be considered as inputs to the next layer.
Computing the total error
We started off supposing the expected outputs to be 0.05 and 0.95 respectively for outputo1
and outputo2. Now we will compute the errors based on the outputs received until now and the
expected outputs.
We’ll use the following error formula,

The Backpropagation
The aim of backpropagation (backward pass) is to distribute the total error back to the network so
as to update the weights in order to minimize the cost function (loss). The weights are updated in
such a way that when the next forward pass utilizes the updated weights, the total error will be
reduced by a certain margin (until the minima is reached).
For weights in the output layer (w5, w6, w7, w8)

For w5,
Let’s compute how much contribution w5 has on E1. If we become clear on how w5 is updated,
then it would be really easy for us to generalize the same to the rest of the weights. If we look
closely at the example neural network, we can see that E1 is affected by outputo1, outputo1 is
affected by sumo1, and sumo1 is affected by w5. It’s time to recall the Chain Rule.
Let’s deal with each component of the above chain separately.
Once we’ve computed all the new weights, we need to update all the old weights with these new
weights. Once the weights are updated, one backpropagation cycle is finished. Now the forward
pass is done and the total new error is computed. And based on this newly computed total error the
weights are again updated. This goes on until the loss value converges to minima. This way a
neural network starts with random values for its weights and finally converges to optimum values.

Forward propagation vs backward propagation in neural network


Below is the table for a clear difference between forward and backward propagation in the neural
network.
Aspect Forward Propagation Backward Propagation
Compute the output of the neural Adjust the weights of the network
Purpose network given inputs to minimise error

Direction Forward from input to output Backwards, from output to input

Computes the output using current Updates weights and biases using
Calculation weights and biases calculated gradients

Information
flow Input data -> Output data Error signal -> Gradient updates

1. Error is calculated using a loss


1. Input data is fed into the function.
network. 2. Gradients of the loss function
2. Data is processed through are calculated.
hidden layers. 3. Weights and biases are updated
Steps 3. Output is generated. using gradients.

Used in Prediction and inference Training the neural network

Gradient Descent:
Consider the following neural network,

Equation 1

Here w1,w2, w3 are the weights of there corresponding features like X1,X2, X3 and b is a constant
called the bias. Its importance is that it gives flexibility. So, using such an equation the machine
tries to predict a value y which may be a value we need like the price of the house. Now, the
machine tries to perfect its prediction by tweaking these weights. It does so, by comparing the
predicted value y with the actual value of the example in our training set and using a function of
their differences. This function is called a loss function.

Equation 2
The machine tries to decrease this loss function or the error, i.e tries to get the prediction value
close to the actual value.

Gradient Descent
This method is the key to minimizing the loss function and achieving our target, which is to predict
close to the original value.

Gradient descent for MSE:

In below diagram, loss function graph. It is basically a parabolic shape or a convex shape, it has a
specific global minimum which we need to find in order to find the minimum loss function value.
Always try to use a loss function which is convex in shape in order to get a proper minimum. Now,
we see the predicted results depend on the weights from the equation. If we replace equation 1 in
equation 2 we obtain this graph, with weights in X-axis and Loss on Y-axis.

Initially, the model assigns random weights to the features. So, say it initializes the weight=a. So,
it generates a loss which is far from the minimum point L-min.
Now, if we move the weights more towards the positive x-axis we can optimize the loss function
and achieve minimum value. But, how will the machine know? We need to optimize weight to
minimize error, so, obviously, we need to check how the error varies with the weights. To do this
we need to find the derivative of the Error with respect to the weight. This derivative is called
Gradient.

Gradient = dE/dw

Where E is the error and w is the weight.


Let’s see how this works. Say, if the loss increases with an increase in weight so Gradient will
be positive, So we are basically at the point C, where we can see this statement is true. If loss
decreases with an increase in weight so gradient will be negative. We can see point A,
corresponds to such a situation. Now, from point A we need to move towards positive x-axis and
the gradient is negative. From point C, we need to move towards negative x-axis but the gradient
is positive. So, always the negative of the Gradient shows the directions along which the
weights should be moved in order to optimize the loss function. So, this way the gradient guides
the model whether to increase or decrease weights in order to optimize the loss function.
The model found which way to move, now the model needs to find by how much it should move
the weights. This is decided by a parameter called Learning Rate denoted by eta. The diagram
we see, the weights are moved from point A to point B which are at a distance of dx.
dx = eta * dE/dw
So, the distance to move is the product of learning rate parameter eta and the magnitude of change
in error with a change in weight at that point.
Now, we need to decide the Learning Rate very carefully. If it is very large the values of weights
will be changed with a great amount and it would overstep the optimal value. If it is very low it
takes tiny steps and takes a lot of steps to optimize. The updated weights are changed according to
the following formula.
w=w — alpha * dE/dw
where w is the previous weight.
With each epoch, the model moves the weights according to the gradient to find the best weights.
Now, this is a loss optimization for a particular example in our training dataset. Our dataset
contains thousands of such examples, so it will take a huge time to find optimal weights for all.
Experiments have shown that if we optimize on only one sample of our training set, the weight
optimization is good enough for the whole dataset. So, depending upon the methods we have
different types of gradient descent mechanisms.

The Maths
If we use the above equation 1 and 2, We get

This equation shows the change in error with a change output prediction for E= MSE.
Now,

So, this is pretty clear from basic maths. Here ‘i’ can be any integer from 0 to the number of
features-1.
According to the problem, we need to find the dE/dwi0, i.e the change in error with the change in
the weights.
Now, from chain rule, we can tell the following,

So, we know both the values from the above equations.


This is the final change in Error with the weights. Now, let’s look for updating the new weights.

These are the new weight values updated.


We also need to update the bias value also. It is done in a similar manner.
Backpropagation

In other words, a problem like this where the two classes, can easily be separated using drawing a
straight line which we can easily devise using equation 1.
Now, imagine doing so, for the following graph.

Will it be possible to classify the points using a normal linear model? The answer is no. In our
daily lives, we usually face non-linear problems only, so each time it is hard for us to devise the
feature crossing for the classification of the following classes. Here is where the neural networks
are used. Neural networks are capable of coming up with a non-linear equation that is fit to serve
as a boundary between non-linear classes.
The way the Neural Network achieve such non-linear equations is through activation functions.
These activation functions are the units of non-linearity. They are used at every
layer in a Neural Network. Now, in neural networks, we stack such layers one over the others.
Finally, we obtain complex functions using cascaded functions like f(f(x)).
The common types of activation function are:
1. ReLU: f(x)= max(0,x)
2. Sigmoid:

3. Tanh
Why is Backpropagation Required?
The minimum of the loss function of the neural network is not very easy to locate because it is not
an easy function like the one we saw for MSE.
Now, as we see in the graph the loss function may look something like this. As we can see it has
two minima, a local one and a global one. So, if we somehow end up in the local one we will end
up in a suboptimal state. So, here the point where the weights initialize matters. For example, if
the weights initialize to somewhere near x1 and there is a high chance we will get stuck at the local
minima, which is not the same with normal MSE.
Secondly, Neural networks are of different structures.

So, in neural nets the result Y-output is dependent on all the weights of all the edges. So, the error
is obtained at the last output node and then we need to change w12 and w13 accordingly. So, we
need to backpropagate the error all the way to the input node from the output node.
So, let's say,

Wij is the weight of the edge from the output of the ith node to the input of the jth node. Now, here
the x is the input to every node. y is the output from every node. Except for the input node, for all
nodes,
Y=F(X)
where F is the activation function.
For Input node, Y=X

Now, we can see, the hidden layer nodes have a function F1 but in the output layer, it is F2. The F1
is usually ReLU and F2 is usually a Sigmoid.
So for optimization of weights, we need to know the dE /dWij for every Wij in the network.
For this, we also need to, find the dE/dXi and dE/dYi for every node in the network.
Forward Propagation
We know the Neural network is said to use Forward Propagation. It is because the input to a node
in layer k is dependent on the output of a node at layer k-1.
For example,

So, generalize,

We can use these formulas.


We will try to trace a few nodes.

Here, we can trace the paths or routes of the inputs and outputs of every node very clearly. Now,
we will see the cascading functions building.

In the Y4 and Y5, we can see the cascading of the non-linear activation function, to create the
classifier equation. The more we stack up the layers, the more cascading occurs, the more our
classifier function becomes complex.
Maths for Backpropagation
The error generated is backpropagated from the through the same nodes and the same edges
through which forward propagation takes place and reaches from input edges from the output node.
The first step, the output node,

This is the derivative of the error with respect to the Y output at the final node.

From chain rule.


Now considering F2 as sigmoid function,

For sigmoid, f ‘ (x) has that property.


Now, we need to find dE/dW56

So, here also we use the chain rule. We obtain the values:
We will try this for two more layers and try to generalize a formula.
We try to calculate dE/ dY5 so that we could move to the next level.

We obtain both dE/dY5 and dE/dY4. Now we go for the change in error for a change in input for
node 5 and node 4.

Once we obtain the change with the input we can easily calculate the change in error with the
change in weights of the edges incident on that input using the same method we used for W56. Here
I am directly writing the result.
These are the changes of error with a change in the weights of edges. Now, calculate for Y2 and
Y3. But, one thing to notice is, when we are going to calculate the change in error with a change
in Y2 and Y3 from backpropagation, they will be affected by both the edges from Y5 and Y4.
So, the change will be a sum of the effect of change in node 4 and node 5.
We can calculate the effects in a similar way we calculated dE/dY5

So, we can generalize it to:

where the ith node is in the Lth layer and the jth node is at the (L+1)th layer.
Now, once we find, the change in error with a change in weight for all the edges. We can update
the weights and start learning for the next epoch using the formula.

where alpha is the learning rate. This is how the backpropagation algorithm actually works.

https://www.kaggle.com/code/andls555/explain-forward-propagation

Difference between SLP and MLP:

Aspect Single-Layer Perceptron Multi-Layer Perceptron (MLP)


(SLP)
Number of 1 (Input + Output layer) 3 or more (Input + Hidden layers + Output
Layers layer)
Decision Can only model linear Can model non-linear decision boundaries
Boundaries decision boundaries
Activation Step function or linear Non-linear activation functions (e.g., ReLU,
Function activation Sigmoid, Tanh)
Learning Limited to linear separability Can learn complex, non-linear relationships
Capacity
Training Simple perceptron learning Backpropagation and gradient descent
Algorithm rule or gradient descent
Use Cases Binary classification (linear Complex classification, regression, and
problems) pattern recognition tasks
Model Simple and fast to train, More complex, capable of approximating
Complexity limited in function complex functions
approximation
Expressive Limited to linearly separable Highly expressive, can approximate any
Power problems continuous function (Universal
Approximation Theorem)
Overfitting Low, but struggles with non- Higher risk of overfitting (needs
Risk linear data regularization and careful tuning)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy