Unit III
Unit III
This article explores Artificial Neural Networks (ANN) in machine learning, focusing on how
CNNs and RNNs process unstructured data like images, text, and speech. You’ll learn about neural
networks in AI, their types, and their role in machine learning. Researchers use Artificial Neural
Networks (ANN) algorithms based on brain function to model complicated patterns and
forecast issues. Neurons within interconnected units collaborate to identify patterns, acquire
knowledge from data, and generate predictions. Artificial neural networks (ANNs) are
commonly employed in activities such as identifying images, processing language, and
making decisions.
Artificial Neural Networks contain artificial neurons which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in a
system. A layer can have only a dozen units or millions of units as this depends on how the complex
neural networks will be required to learn the hidden patterns in the dataset. Commonly, Artificial
Neural Network has an input layer, an output layer as well as hidden layers. The input layer
receives data from the outside world which the neural network needs to analyze or learn about.
Then this data passes through one or multiple hidden layers that transform the input into data that
is valuable for the output layer. Finally, the output layer provides an output in the form of a
response of the Artificial Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to another. Each of
these connections has weights that determine the influence of one unit on another unit. As the data
transfers from one unit to another, the neural network learns more and more about the data which
eventually results in an output from the output layer.
The structures and operations of human neurons serve as the basis for artificial neural networks. It
is also known as neural networks or neural nets. The input layer of an artificial neural network is
the first layer, and it receives input from external sources and releases it to the hidden layer, which
is the second layer. In the hidden layer, each neuron receives input from the previous layer neurons,
computes the weighted sum, and sends it to the neurons in the next layer. These connections are
weighted means effects of the inputs from the previous layer are optimized more or less by
assigning different-different weights to each input and it is adjusted during the training process by
optimizing these weights for improved model performance.
The “backpropagation algorithm” is a method by which neural networks work, converting ANN
into a learning algorithm by learning from mistakes. The optimization approach uses a “gradient
descent” technique to quantify prediction errors. This technique is a cornerstone of supervised
learning, as it iteratively adjusts weights to minimize errors. In order to find the optimum value for
weights, we try small adjustments in weights and examine the impact on prediction errors.
An artificial neuron or neural node is a mathematical model. In most cases, it computes the
weighted average of its input and then applies a bias to it. Post that, it passes this resultant term
through an activation function. This activation function is a nonlinear function such as the sigmoid
function that accepts a linear input and gives a nonlinear output.
1. Inputs:
Neurons receive one or more inputs, which are typically the features of the data fed
into the network (e.g., pixel values of an image, words in text, etc.).
Each input has an associated weight, which determines its importance in the
computation. Weights are learned during training.
2. Summation (Weighted Sum):
3. Bias:
A bias is added to the weighted sum. This is a constant value that helps the neuron
make adjustments to the output, ensuring that it can represent more complex
functions.
The equation becomes:
4. Activation Function:
The result of the weighted sum plus the bias is passed through an activation
function. This function determines whether the neuron should be activated (i.e.,
produce an output) and introduces non-linearity into the network, enabling it to
learn complex patterns.
Common activation functions include:
5. Output:
The final output of the neuron is the result of the activation function applied to the
total input (weighted sum + bias).
This output can then be passed to the next layer of neurons (in a multi-layer
network) or used as the final prediction (in simpler networks).
Example:
Total Input=(0.5×0.3)+(1.0×0.7)+(0.2×0.9)+0.1=0.15+0.7+0.18+0.1=1.13
This output (1.13) will be passed on to the next layer or used as the final result.
Weights:
Weights are numerical values associated with the connections between neurons. They determine
the strength of these connections and, in turn, the influence that one neuron's output has on another
neuron's input. Think of weights as the coefficients that adjust the impact of incoming data. They
can increase or decrease the importance of specific information.
During the training phase of a neural network, these weights are adjusted iteratively to minimize
the difference between the network's predictions and the actual outcomes. This process is akin to
fine-tuning the network's ability to make accurate predictions.
Let's consider a practical example to illustrate the role of weights. Suppose you're building a neural
network to recognize handwritten digits. Each pixel in an image of a digit can be considered an
input to the network. The weights associated with each pixel determine how much importance the
network places on that pixel when making a decision about which digit is represented in the image.
As the network learns from a dataset of labelled digits, it adjusts these weights to give more
significance to pixels that are highly correlated with the correct digit and less significance to pixels
that are less relevant. Over time, the network learns to recognize patterns in the data and make
accurate predictions.
In essence, weights are the neural network's way of learning from data. They capture the
relationships between input features and the target output, allowing the network to generalize and
make predictions on new, unseen data.
Bias:
The total number of biases = Total number of hidden neurons + the number of output neurons
In simple words, neural network bias can be defined as the constant which is added to the product
of features and weights. It is used to offset the result. It helps the models to shift the activation
function towards the positive or negative side.
On replacing the variable ‘x’ with the equation of line, we get the following:
In the above equation, ‘w’ is weights, ‘x’ is the feature vector, and ‘b’ is defined as the bias. On
substituting the value of ‘b’ equal to 0, we get the graph of the above equation as shown in the
figure below:
If we vary the values of the weight ‘w’, keeping bias ‘b’=0, we will get the following graph:
While changing the values of ‘w’, there is no way we can shift the origin of the activation function,
i.e., the sigmoid function. On changing the values of ‘w’, only the steepness of the curve will
change.
Due to absence of bias, model will train over point passing through origin only, which is not in
accordance with real-world scenario. Also with the introduction of bias, the model will become
more flexible.
Example: Imagine a neuron that processes the brightness of an image pixel. Without a bias, this
neuron might only activate when the pixel's brightness is exactly at a certain threshold. However,
by introducing a bias, you allow the neuron to activate even when the brightness is slightly below
or above the threshold.
This flexibility is crucial because real-world data is rarely perfectly aligned with specific
thresholds. Biases enable neurons to activate in response to various input conditions, making
neural networks more robust and capable of handling complex patterns.
There is only one way to shift the origin and that is to include bias ‘b’. On keeping the value of
weight ‘w’ fixed and varying the value of bias ‘b’, we will get the graph below:
From the graph, it can be concluded that the bias is required for shifting the origin of the curve to
the left or right.
If we substitute the value of ‘w’= -1 and ‘b’=0 in Eq.(6), the graph of the activation function will
look as shown below.
It can be seen that the output is equal to 0 for the values of ‘x’ less than 0 and equal to 1 for the
values of ‘x’ greater than 0.
Now, how should we design our equation so that the output of the activation function is equal to 1
for all values of ‘x’ less than 5? We achieve this by introducing the term bias ‘b’ in our equation
with value of ‘b’=-5.
To understand the concept of neural network bias, let’s begin by discussing single layer neural
networks.
A given neural network computes the function Y=f(X), where X and Y are feature vector and
output vector, respectively, with independent components. If the given neural network has weight
‘W’ then it can also be represented as Y=f(X,W). If the dimensionality of both X and Y is equal
to 1, the function can be plotted in a two-dimensional plane as below:
Such a neural network can approximate any linear function of the form y=mx + c. When c=0, then
y=f(x)=mx and the neural network can approximate only the functions passing through origin. If
a function includes the constant term c, it can approximate any of the linear functions in the plane.
Consider the conditions mentioned in the previous section for the neural network. From there, we
can infer that if there is any error during the prediction by the function, bias ‘b’ can be added to
the output values for obtaining the true values. Thus, the neural network would compute the
function y=f(x) + b which includes all the predictions by the neural network shifted by the constant
‘b’.
Now, if we add one more input to the neural network layer, the function is defined as y=f(x1,x2).
Since both x1 and x2 are independent components, they should have independent biases b1 and
b2, respectively. Thus, the neural networks can be represented as y=f(x1, x2) + b1 + b2.
1. Social Media:
Artificial Neural Networks are used heavily in Social Media. For example, let’s take
the ‘People you may know’ feature on Facebook that suggests people that you might know
in real life so that you can send them friend requests. Well, this magical effect is achieved
by using Artificial Neural Networks that analyze your profile, your interests, your current
friends, and also their friends and various other factors to calculate the people you might
potentially know. Another common application of Machine Learning in social media
is facial recognition . This is done by finding around 100 reference points on the person’s
face and then matching them with those already available in the database using
convolutional neural networks.
3. Healthcare:
Artificial Neural Networks are used in Oncology to train algorithms that can identify
cancerous tissue at the microscopic level at the same accuracy as trained physicians.
Various rare diseases may manifest in physical characteristics and can be identified in their
premature stages by using Facial Analysis on the patient photos. So the full-scale
implementation of Artificial Neural Networks in the healthcare environment can only
enhance the diagnostic abilities of medical experts and ultimately lead to the overall
improvement in the quality of medical care all over the world.
4. Personal Assistants:
I am sure you all have heard of Siri, Alexa, Cortana, etc., and also heard them based on the
phones you have!!! These are personal assistants and an example of speech recognition that
uses Natural Language Processing to interact with the users and formulate a response
accordingly. Natural Language Processing uses artificial neural networks that are made to
handle many tasks of these personal assistants such as managing the language syntax,
semantics, correct speech, the conversation that is going on, etc.
ANNs offers many key benefits that make them particularly well-suited to specific issues and
situations:
ANNs can learn and model non-linear and complicated interactions, which is critical since
many of the relationships between inputs and outputs in real life are non-linear and
complex.
Artificial Neural Network in machine learning can generalize after learning from the
original inputs and their associations, the model may infer unknown relationships from
anonymous data, allowing it to generalize and predict unknown data.
ANN does not impose any constraints on the input variables, unlike many other prediction
approaches (like how they should be distributed). Furthermore, numerous studies have
demonstrated that ANN algorithms can better simulate heteroscedasticity, or data with high
volatility and non-constant variance, because of their capacity to discover latent
correlations in the data without imposing any present associations. This is particularly
helpful in financial time series forecasting (for example, stock prices) when significant data
volatility.
While building a neural network, one key decision is selecting the Activation Function for both
the hidden layer and the output layer. Activation functions decide whether a neuron should be
activated.
Neural networks consist of neurons that operate using weights, biases, and activation functions.
In the learning process, these weights and biases are updated based on the error produced at the
output this process is known as backpropagation. Activation functions enable backpropagation
by providing gradients that are essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions empower neural networks to model highly complex data
distributions and solve advanced deep learning tasks. Adding non-linear activation functions
introduce flexibility and enable the network to learn more complex and abstract patterns from data.
Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains, if they all use linear activation functions, the output is a linear
combination of the input.
The range of the output spans from (−∞ to +∞)
Linear activation function is used at just one place i.e. output layer.
Using linear activation across all layers makes the network’s ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output
1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is mathematically defined as
1
𝐴=
1 + 𝑒 −𝑥
This formula ensures a smooth and continuous output that is essential for gradient-based
optimization methods.
It allows neural networks to handle and model complex patterns that linear equations
cannot.
The output ranges between 0 and 1, hence useful for binary classification.
The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity
means that small changes in input x can cause significant changes in output y, which is
critical during the training process.
1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms raw
output scores from a neural network into probabilities. It works by squashing the output values of
each class into the range of 0 to 1, while ensuring that the sum of all probabilities equals 1.
Softmax is a non-linear activation function.
The Softmax function ensures that each class is assigned a probability, helping to identify
which class the input belongs to.
2. SoftPlus Function
Softplus function is defined mathematically as:
𝐴(𝑥) = log(1 + 𝑒 𝑥 )
This equation ensures that the output is always positive and differentiable at all points, which is an
advantage over the traditional ReLU function.
Nature: The Softplus function is non-linear.
Range: The function outputs values in the range (0,∞), similar to ReLU, but without the
hard zero threshold that ReLU has.
Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU, which can sometimes lead to problems during optimization.
Perceptron:
The Perceptron is one of the simplest artificial neural network architectures, introduced by
Frank Rosenblatt in 1957. It is primarily used for binary classification.
At that time, traditional methods like Statistical Machine Learning and Conventional Programming
were commonly used for predictions. Despite being one of the simplest forms of artificial neural
networks, the Perceptron model proved to be highly effective in solving specific classification
problems, laying the groundwork for advancements in AI and machine learning.
What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by
McCulloch and Walter Pitts in the 1940s. This foundational model has played a crucial role in the
development of more advanced neural networks and machine learning algorithms.
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable
patterns. It is effective for tasks where the data can be divided into distinct categories
through a straight line. While powerful in its simplicity, it struggles with more complex
problems where the relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or
more layers, adept at handling more complex patterns and relationships within the data.
The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it’s 0. This is the most common activation function
used in Perceptrons are represented by the Heaviside step function:
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.
In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:
where X is the input W is the weight for each inputs neurons and b is the bias and h is the step
function.
During training, the Perceptron’s weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:
Where:
𝑤𝑖,𝑗 is the weight between the ith input and jth output neuron
𝑥𝑖 is the ith input value.
𝑦𝑖 is the actual value, and 𝑦̂𝑖 is the predicted value
𝜂 is the learning rate, controlling how much the weights are adjusted.
This process enables the perceptron to learn from data and improve its prediction accuracy over
time.
Limitations of Perceptron
The Perceptron was a significant breakthrough in the development of neural networks, proving
that simple networks could learn to classify patterns. However, the Perceptron model has certain
limitations that can make it unsuitable for some tasks:
Limited to linearly separable problems
Struggles with convergence when handling non-separable data
Requires labelled data for training
Sensitive to input scaling
Lacks hidden layers for complex decision-making
Example:
Let x = 0.2, w = 0.5, b = 1 and y = 1
Note: In regular practices, weights i.e, w values are randomly initialized to start with. Setting w to
0 hampers the progress of the NNs. Therefore, in this example as well, we are randomly assuming
the values of w to be 0.5
Starting with Forward Propagation
A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name "feedforward."
Forward propagation is part of the inference phase, as it does not involve adjusting weights but
simply computes the network’s predictions.
Backpropagation Algorithm:
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function by
adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic gradient
descent. The algorithm computes the gradient using the chain rule from calculus, allowing it to
effectively navigate complex layers in the neural network to minimize the cost function.
Example:
There are two units in the Input Layer, two units in the Hidden Layer and two units in the Output
Layer. The w1,w2,w2,…,w8 represent the respective weights. b1 and b2 are the biases for Hidden
Layer and Output Layer, respectively.
Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash
the weighted sum into the range (0 and +1). The logistic function is an activation function for our
example neural network.
Now, outputh1 and outputh2 will be considered as inputs to the next layer.
Computing the total error
We started off supposing the expected outputs to be 0.05 and 0.95 respectively for outputo1
and outputo2. Now we will compute the errors based on the outputs received until now and the
expected outputs.
We’ll use the following error formula,
The Backpropagation
The aim of backpropagation (backward pass) is to distribute the total error back to the network so
as to update the weights in order to minimize the cost function (loss). The weights are updated in
such a way that when the next forward pass utilizes the updated weights, the total error will be
reduced by a certain margin (until the minima is reached).
For weights in the output layer (w5, w6, w7, w8)
For w5,
Let’s compute how much contribution w5 has on E1. If we become clear on how w5 is updated,
then it would be really easy for us to generalize the same to the rest of the weights. If we look
closely at the example neural network, we can see that E1 is affected by outputo1, outputo1 is
affected by sumo1, and sumo1 is affected by w5. It’s time to recall the Chain Rule.
Let’s deal with each component of the above chain separately.
Once we’ve computed all the new weights, we need to update all the old weights with these new
weights. Once the weights are updated, one backpropagation cycle is finished. Now the forward
pass is done and the total new error is computed. And based on this newly computed total error the
weights are again updated. This goes on until the loss value converges to minima. This way a
neural network starts with random values for its weights and finally converges to optimum values.
Computes the output using current Updates weights and biases using
Calculation weights and biases calculated gradients
Information
flow Input data -> Output data Error signal -> Gradient updates
Gradient Descent:
Consider the following neural network,
Equation 1
Here w1,w2, w3 are the weights of there corresponding features like X1,X2, X3 and b is a constant
called the bias. Its importance is that it gives flexibility. So, using such an equation the machine
tries to predict a value y which may be a value we need like the price of the house. Now, the
machine tries to perfect its prediction by tweaking these weights. It does so, by comparing the
predicted value y with the actual value of the example in our training set and using a function of
their differences. This function is called a loss function.
Equation 2
The machine tries to decrease this loss function or the error, i.e tries to get the prediction value
close to the actual value.
Gradient Descent
This method is the key to minimizing the loss function and achieving our target, which is to predict
close to the original value.
In below diagram, loss function graph. It is basically a parabolic shape or a convex shape, it has a
specific global minimum which we need to find in order to find the minimum loss function value.
Always try to use a loss function which is convex in shape in order to get a proper minimum. Now,
we see the predicted results depend on the weights from the equation. If we replace equation 1 in
equation 2 we obtain this graph, with weights in X-axis and Loss on Y-axis.
Initially, the model assigns random weights to the features. So, say it initializes the weight=a. So,
it generates a loss which is far from the minimum point L-min.
Now, if we move the weights more towards the positive x-axis we can optimize the loss function
and achieve minimum value. But, how will the machine know? We need to optimize weight to
minimize error, so, obviously, we need to check how the error varies with the weights. To do this
we need to find the derivative of the Error with respect to the weight. This derivative is called
Gradient.
Gradient = dE/dw
The Maths
If we use the above equation 1 and 2, We get
This equation shows the change in error with a change output prediction for E= MSE.
Now,
So, this is pretty clear from basic maths. Here ‘i’ can be any integer from 0 to the number of
features-1.
According to the problem, we need to find the dE/dwi0, i.e the change in error with the change in
the weights.
Now, from chain rule, we can tell the following,
In other words, a problem like this where the two classes, can easily be separated using drawing a
straight line which we can easily devise using equation 1.
Now, imagine doing so, for the following graph.
Will it be possible to classify the points using a normal linear model? The answer is no. In our
daily lives, we usually face non-linear problems only, so each time it is hard for us to devise the
feature crossing for the classification of the following classes. Here is where the neural networks
are used. Neural networks are capable of coming up with a non-linear equation that is fit to serve
as a boundary between non-linear classes.
The way the Neural Network achieve such non-linear equations is through activation functions.
These activation functions are the units of non-linearity. They are used at every
layer in a Neural Network. Now, in neural networks, we stack such layers one over the others.
Finally, we obtain complex functions using cascaded functions like f(f(x)).
The common types of activation function are:
1. ReLU: f(x)= max(0,x)
2. Sigmoid:
3. Tanh
Why is Backpropagation Required?
The minimum of the loss function of the neural network is not very easy to locate because it is not
an easy function like the one we saw for MSE.
Now, as we see in the graph the loss function may look something like this. As we can see it has
two minima, a local one and a global one. So, if we somehow end up in the local one we will end
up in a suboptimal state. So, here the point where the weights initialize matters. For example, if
the weights initialize to somewhere near x1 and there is a high chance we will get stuck at the local
minima, which is not the same with normal MSE.
Secondly, Neural networks are of different structures.
So, in neural nets the result Y-output is dependent on all the weights of all the edges. So, the error
is obtained at the last output node and then we need to change w12 and w13 accordingly. So, we
need to backpropagate the error all the way to the input node from the output node.
So, let's say,
Wij is the weight of the edge from the output of the ith node to the input of the jth node. Now, here
the x is the input to every node. y is the output from every node. Except for the input node, for all
nodes,
Y=F(X)
where F is the activation function.
For Input node, Y=X
Now, we can see, the hidden layer nodes have a function F1 but in the output layer, it is F2. The F1
is usually ReLU and F2 is usually a Sigmoid.
So for optimization of weights, we need to know the dE /dWij for every Wij in the network.
For this, we also need to, find the dE/dXi and dE/dYi for every node in the network.
Forward Propagation
We know the Neural network is said to use Forward Propagation. It is because the input to a node
in layer k is dependent on the output of a node at layer k-1.
For example,
So, generalize,
Here, we can trace the paths or routes of the inputs and outputs of every node very clearly. Now,
we will see the cascading functions building.
In the Y4 and Y5, we can see the cascading of the non-linear activation function, to create the
classifier equation. The more we stack up the layers, the more cascading occurs, the more our
classifier function becomes complex.
Maths for Backpropagation
The error generated is backpropagated from the through the same nodes and the same edges
through which forward propagation takes place and reaches from input edges from the output node.
The first step, the output node,
This is the derivative of the error with respect to the Y output at the final node.
So, here also we use the chain rule. We obtain the values:
We will try this for two more layers and try to generalize a formula.
We try to calculate dE/ dY5 so that we could move to the next level.
We obtain both dE/dY5 and dE/dY4. Now we go for the change in error for a change in input for
node 5 and node 4.
Once we obtain the change with the input we can easily calculate the change in error with the
change in weights of the edges incident on that input using the same method we used for W56. Here
I am directly writing the result.
These are the changes of error with a change in the weights of edges. Now, calculate for Y2 and
Y3. But, one thing to notice is, when we are going to calculate the change in error with a change
in Y2 and Y3 from backpropagation, they will be affected by both the edges from Y5 and Y4.
So, the change will be a sum of the effect of change in node 4 and node 5.
We can calculate the effects in a similar way we calculated dE/dY5
where the ith node is in the Lth layer and the jth node is at the (L+1)th layer.
Now, once we find, the change in error with a change in weight for all the edges. We can update
the weights and start learning for the next epoch using the formula.
where alpha is the learning rate. This is how the backpropagation algorithm actually works.
https://www.kaggle.com/code/andls555/explain-forward-propagation