0% found this document useful (0 votes)
4 views

Unit 1 (1)

This document provides an overview of deep learning, including its fundamentals, types of neural networks, and optimization algorithms. It discusses the architecture of multilayer perceptrons, feed-forward networks, backpropagation, gradient descent, and the vanishing gradient problem. Additionally, it covers various optimization techniques and their significance in training deep learning models.

Uploaded by

pm139581
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 1 (1)

This document provides an overview of deep learning, including its fundamentals, types of neural networks, and optimization algorithms. It discusses the architecture of multilayer perceptrons, feed-forward networks, backpropagation, gradient descent, and the vanishing gradient problem. Additionally, it covers various optimization techniques and their significance in training deep learning models.

Uploaded by

pm139581
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Deep Learning

Unit 1
Fundamentals of
Deep Learning
AI Vs ML Vs DL

AI ML DL
1956 1959 2000

John McCarthy Arthur Samuel Igor Aizenberg

Machine to Mimic Machine to Learn. Algo. Inspired by


Human Behavior. Structure & Functions
Training Time Less of Human Brain.

Testing Time More Training Time More

Testing Time Less


What is Deep Learning ?
• A type of machine learning based on artificial
neural networks in which multiple layers of
processing are used to extract progressively
higher level features from data.
• Deep learning is part of a broader family of
machine learning methods based on artificial
neural networks with representation learning.
Deep Learning
Why ? What ? Where ?
Handle huge
Huge Amount of amount of Medical field
Data. Structured and
Unstructured data.
Complex
Complex problems Operations, Self driving Cars
Problems Solved
Feature Extraction Translation
Multilayer Perceptron
• Perceptron
• Learn Machine Learning and Deep Learning
technologies.
• Consists of : a set of weights, input values or
scores, and a threshold.
• Perceptron is a building block of an Artificial
Neural Network.
• Mid of 19th century, Mr. Frank Rosenblatt invented
the Perceptron.
• Perceptron perform certain calculations to detect
input data capabilities or business intelligence.
• Perceptron - a linear Machine Learning algorithm
used for supervised learning for various binary
classifiers.
• Algorithm enables neurons to learn elements and
processes them one by one during preparation.
• Basic Components of Perceptron:
• Input Nodes or Input Layer:
• Primary component of Perceptron, contains a real
numerical value.
• Wight and Bias:
• Weight - represents the strength of the connection
between units.
• Weight is directly proportional to the strength of
the associated input neuron in deciding the output.
• Biases - scalar values added to the input to ensure
that at least a few nodes per layer are activated
regardless of signal strength.
• Activation Functions:
• Functions that govern the artificial neuron’s
behavior are called activation functions.
• Transmission of that input is known as forward
propagation.
• Activation functions transform the combination of
inputs, weights and biases.
• Artificial neuron passes on a nonzero value to
another artificial neuron, said to be activated.
Types of Activation functions:
• Sign function.
• Step function.
• Sigmoid function.
MLP Algorithm
• 1. Begins with the multiplication of all input values
and their weights.
• 2. Adds all values together to create the weighted
sum.
• 3. Weighted sum is applied to the activation
function ‘g' to obtain the desired output.
• Activation function also known as the step
function and is represented by ‘g'.
• Perceptron model works in two important steps.
• Step-1
• First step: Weighted sum.
• ∑wi*xi = x1*w1 + x2*w2 +…wn*xn
• Add bias 'b’,improve the model's performance.
• ∑wi*xi + b
• Step-2
• Activation function is applied ,output either in
binary form or a continuous value.
• Y = f(∑wi*xi + b)
• MLP model has a greater number of hidden layers.
• MLP model also known as the Backpropagation
algorithm.
• Two Execution Stages :
• Forward Stage:
• Activation functions start from the input layer in
the forward stage and terminate on the output
layer.
• Backward Stage:
• Weight and bias values are modified as per the
model's requirement.
• Advantages of Multi-Layer Perceptron:
• Can be used to solve complex non-linear problems.
• Works well with both small and large input data.
• Helps to obtain quick predictions after the training.
• Helps to obtain the same accuracy ratio with large
as well as small data.
• Disadvantages of Multi-Layer Perceptron:
• Computations are difficult and time-consuming.
• Difficult to predict how much the dependent
variable affects each independent variable.
• Model functioning depends on the quality of the
training.
Feed-Forward Neural Network
• FFN Neural Network is an artificial neural
network in which the connections between
nodes does not form a cycle.
• Information is only processed in one direction.
• Data may pass through multiple hidden nodes,
it always moves in one direction and never
backwards.
• A single input layer.
• One or many hidden layers, fully connected
• A single output layer.
• Input layer:
• Number of neurons in an input layer is typically the
same number as the input feature to the network.
• Input layers are followed by one or more hidden
layers.
• Input layers in classical feed-forward neural
networks are fully connected to the next hidden
layer.
• Hidden layer:
• One or more hidden layers in a feed-forward neural
network.
• Weight values on the connections between the
layers are how neural networks encode the learned
information extracted from the raw training data.
• Hidden layers are the key to allowing neural
networks to model nonlinear functions.
• Output layer:
• Answer or prediction from our model from the
output layer.
• Depending on the setup of the neural network, the
final output may be a real-valued output
(regression) or a set of probabilities (classification).
• Controlled by the activation function.
• Typically uses either a softmax or sigmoid
activation function for classification.
• Connections between layers :
• Previous layer to all next layer.
• Weights change progressively as algorithm finds the
best solution.
Backpropagation Learning
• Backpropagation is an important part of reducing error
in a neural network model.
• Backpropagation, how information circulates within a
feed-foward neural network.
• Algorithm intuition :
• Backpropagation learning is similar to the perceptron
learning algorithm.
• Compute the input example’s output with a forward
pass through the network.
• If the output matches the label, we don’t do anything.
• If the output does not match the label, we need to
adjust the weights on the connections in the neural
network.
• General neural network training pseudo code :
• With the perceptron learning algorithm, it’s easy
because there is only one weight per input to
influence the output value.
• With feed forward multilayer networks learning
algorithms, many weights connecting each input to
the output, so it becomes more difficult.
• Each weight contributes to more than one output.
• With backpropagation,minimize the error between
the label (or “actual”) output associated with the
training input and the value generated from the
network output.
Backpropagation algorithm for updating weights pseudo code
Gradient Descent
(Steepest Descent)
Gradient Descent
• Discovered by "Augustin-Louis Cauchy" in mid of
18th century.
• Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms
of machine learning to train the machine learning
and deep learning models.
• It helps in finding the local minimum of a function.

• Cost-function Vs Loss Function :


• The loss function calculates the error per
observation, whilst the cost function calculates the
error over the whole dataset.
• In GD, we can imagine the quality of our network’s
predictions
• Hills represent locations (parameter values or
weights) that give a lot of prediction error; valleys
represent locations with less error.
• Choose one point on that landscape at which to
place our initial weight.
• Select the initial weight randomly.
• Move weight downhill, to areas of lower error.
• GD can sense the actual slope of the hills with
regard to each weight, direction is down.
• GD measures the slope and takes the weight one
step toward the bottom of the valley.
• Taking a derivative of the loss function to produce
the gradient.
• In convex optimization, looking for the point at
which the derivative is equal to 0.
• Point is known as stationary point of the function
or the minimum point.
• Process of measuring loss and changing the weight
by one step in the direction of less error is repeated
until the weight arrives at a point beyond which it
cannot go lower.
• Learning Rate:
• Defined as the step size taken to reach the
minimum or lowest point.
• Typically a small value that is evaluated and
updated based on the behavior of the cost
function.
• Learning rate is high :
• Results: Larger steps but leads to risks of
overshooting the minimum.
• Learning rate is low :
• Result : Small step sizes, advantage of more
precision.
Gradient Descent Procedure
• Suppose we have a function f(x), where x is a tuple
of several variables., x = (x_1, x_2, …x_n).
• Suppose gradient of f(x) is given by ∇f(x).
• We want to find the value of the variables (x_1,
x_2, …x_n) that give us the minimum of the
function.
• At any iteration t, we’ll denote the value of the
tuple x by x[t].
• So x[t][1] is the value of x_1 at iteration t, x[t][2] is
the value of x_2 at iteration t, e.t.c.
Notation
• t = Iteration number
• T = Total iterations
• n = Total variables in the domain of f (also called
the dimensionality of x)
• j = Iterator for variable number, e.g., x_j represents
the jth variable
• 𝜂 = Learning rate
• ∇f(x[t]) = Value of the gradient vector of f at
iteration t
Training Method
Choose a random initial point x_initial and set
x[0] = x_initial

For iterations t=1..T


Update x[t] = x[t-1] – 𝜂∇f(x[t-1])

• The learning rate 𝜂 is a user defined.


• Its value lies in the range [0,1].
Two iterations of the algorithm, T=2 and 𝜂=0.1
1.Initial t=0
x[0] = (4,3) # This is just a randomly chosen
point
2.At t = 1
x[1] = x[0] – 𝜂∇f(x[0])
x[1] = (4,3) – 0.1*(8,12)
x[1] = (3.2,1.8)
3.At t=2
x[2] = x[1] – 𝜂∇f(x[1])
x[2] = (3.2,1.8) – 0.1*(6.4,7.2)
x[2] = (2.56,1.08)
Procedure will eventually end up at the point
where the function is minimum, i.e., (0,0).
Types of Gradient Descent
• Batch Gradient Descent.
• Stochastic Gradient Descent.
• Mini-Batch Gradient Descent.
Batch Gradient Descent
• Also known as vanilla gradient descent.
• Calculates the error for each example within the
training dataset.
• Still, the model is not changed until every training
sample has been assessed.
• Entire procedure is referred to as a cycle and a
training epoch.
• Advantages of Batch gradient descent:
• Produces less noise in comparison to other gradient
descent.
• Produces stable gradient descent convergence.
• Computationally efficient as all resources are used
for all training samples.
Stochastic gradient descent
• Stochastic gradient descent (SGD) is a type of
gradient descent that runs one training example
per iteration.
• As it requires only one training example at a time,
hence it is easier to store in allocated memory.
• Frequent updates, it is also treated as a noisy
gradient.
• Advantages :
• It is easier to allocate in desired memory.
• It is relatively fast to compute than batch gradient
descent.
• It is more efficient for large datasets.
MiniBatch Gradient Descent:
• Combination of both batch gradient descent and
stochastic gradient descent.
• Divides the training datasets into small batch sizes.
• Performs the updates on those batches separately.
• We can achieve a special type of gradient descent
with higher computational efficiency and less noisy
gradient descent.
• Advantages :
• It is easier to fit in allocated memory.
• It is computationally efficient.
• It produces stable gradient descent convergence.
VANISHING GRADIENT PROBLEM
• A phenomenon that occurs during the training of
deep neural networks, where the gradients that
are used to update the network become extremely
small or "vanish“.
• As they are backpropogated from the output
layers to the earlier layers.
• Backpropogation algorithm calculates gradients by
propagating the error from the output layer to the
input layer.
• Gradient problem include slow convergence,
network getting stuck in low minima, and impaired
learning of deep representations.
How do you overcome the vanishing gradient
problem?
• Skip connections
• Provides direct connections between layers.
• Allowing the gradients to bypass multiple layers
during backpropogation.
• Residual neural networks (ResNets)
• ResNets use skip connections to learn the
residual mapping,
• Enabling easier gradient flow & efficient training
of deep neural networks.
• Rectified linear unit (ReLU)
• ReLU avoids the saturation function sigmoid
tangent or hyperbolic tangent, that can cause the
gradients to vanish.
• Long Short Term Memory
• Type of artificial neural network which uses
sequential data or time series data.
• Vanishing Gradient Problem occurs when the
gradients become too large or too small .
Optimization Algorithms
• Optimization Algorithms :
• Training a model in machine learning involves
finding the best set of values for the parameter
vector of the model.
Optimization problem in which minimize the loss
function with respect to the parameters of
prediction function.
• Define best set of values for the parameter vector
as the values with lowest loss function.
• Divide the Algorithms :
• First – order.
• Second – order.
• First-order optimization algorithms calculate
the Jacobian matrix.
• Jacobian is a matrix of partial derivatives of
loss function values with respect to each
parameter.
• To calculate partial derivatives, all other
variables are momentarily treated as
constants.
• Algorithm takes one step in the direction
specified by the Jacobian.
First-order methods
• Taking one step at a time to reach an objective,
first-order methods calculate a gradient (Jacobian)
at each step to determine which direction to go in
next.
• Each iteration, or step, we are trying to find the
next best possible direction to go, as defined by
objective function.
• Consider optimization algorithms to be a "search."
• Finding a path toward minimal error.
• Gradient descent is a member of path-finding class
of algorithms.
The Jacobian Matrix
• Consider first a function that maps u real inputs, to
a single real output.

• For an input vector, x, of length, u, the


Jacobian vector of size, 1 × u, can be defined
as
• Consider another function that maps u real inputs,
to v real outputs:

• For the same input vector, x, of length, u, the


Jacobian is now a v × u matrix.
• u real inputs and v real outputs, matrix .

• Second-order algorithms calculate the
derivative of the Jacobian (i.e., the derivative
of a matrix of derivatives) by approximating
the Hessian.
• Second order methods take into account
interdependencies between parameters
when choosing how much to modify each
parameter.
• Second-order methods can take "better"
steps; however, each step will take longer to
calculate.
• Hessian Matrix
• We have a function f of n variables

• Hessian of f is given
Second-order methods
• Hessian Matrix of second-order partial derivatives,
analogous to "tracking acceleration rather than
speed."
• The Hessian's job is to describe the curvature of
each point of the Jacobian.
• Second-order methods include:
• Limited-memory BFGS (L-BFGS).
• Conjugate gradient
• Hessian-free
• L-BFGS is an optimization algorithm and a so-called
quasi-Newton method.
• It's a variation of the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) algorithm, and it limits how much
gradient is stored in memory.
• Algorithm does not compute the full Hessian
matrix, which is more computationally expensive.
• Hessian L-BFGS stores only a few vectors that
represent a local approximation of it.
• L-BFGS performs faster because it uses
approximated second-order information.
• L-BFGS and conjugate gradient in practice can be
faster and more stable than SGD methods.
• Conjugate gradient guides the direction of the line
search process based on conjugacy information.
• Conjugate gradient methods focus on minimizing
the conjugate L2 norm.
• L2-norm is also known as least squares. It is
basically minimizing the sum of the square of the
differences between the target value and the
estimated values.
• Conjugate gradient is very similar to gradient
descent in that it performs line search.
• The major difference is that conjugate gradient
requires each successive step in the line search
process.
• Hessian-free
• Hessian-free optimization is related to Newton's
method, but it better minimizes the quadratic
function.
• It is a powerful optimization method adapted to
neural networks by James Martens in 2010.
• We find the minimum of the quadratic function
with an iterative method called conjugate gradient.
Hyper Parameters
Hyper Parameters
• Hyperparameters are the variables which
determines the network structure.
• Eg: Number of Hidden Layers.
• & the variables which determine how the network
is trained.
• Eg: Learning Rate.
• Hyperparameters are set before training (before
optimizing the weights and bias).
Hyper parameters
• Layer size.
• Magnitude (momentum, learning rate).
• Regularization (dropout, drop connect, L1,
L2)
• Activations (activation function families)
• Weight initialization strategy.
• Loss functions
• Settings for epochs during training (mini-
batch size)
• Normalization scheme for input data
(Vectorization).
Layer Size
• Layer size : Number of neurons in a layer.
• Input and output layers are easy to figure out.
• Deciding neuron counts for hidden layer is a
challenge.
• Neurons come with a cost.
• Connection schema between layers can vary.
• Weights on the connections, are the parameters we
must train.
• More parameters -increase the amount of effort
needed to train the network.
• Long training times - models struggle to find
convergence.
Magnitude – Hyper parameter
• Magnitude group involve gradient, step size, and
momentum.
• Learning rate defines how quickly a network
updates its parameters.
• Low learning rate slows down the learning process
but converges smoothly.
• Larger learning rate speeds up the learning but
may not converge.
• Momentum helps to know the direction of the next
step with the knowledge of the previous steps.
• Speed up our training by increasing momentum.
• Momentum is a factor between 0.0 and 1.0
,applied to the change rate of the weights.
• Typically, the value for momentum between 0.9
and 0.99.
• Adaptive Gradient Algorithm (Adagrad) is an
algorithm for gradient-based optimization.
• AdaGrad - Technique to help finding the “right”
learning rate.
• AdaGrad is monotonically decreasing and never
increases the learning rate.
• AdaGrad is the square root of the sum of squares of
the history of gradient computations.
• AdaGrad speeds training in the beginning and slows it
appropriately toward convergence.
• RMSprop (Root Mean Square Propagation) is a very
effective, but currently unpublished adaptive learning
rate method.
• AdaDelta is a variant of AdaGrad that keeps only the
most recent history.
• Adam (adaptive moment estimation).
• Derives learning rates from estimates of first and
second moments of the gradients.
• First Moment : sum of gradient.
• Second Moment : sum of the gradient squared.
Regularization
• Regularization is a measure taken against
overfitting.
• Overfitting : when a model describes the training
set but cannot generalize well over new inputs.
• Overfitted models have no predictive capacity for
data that they haven’t seen.
• Geoffery Hinton described the best way to build a
neural network model:
• Cause it to overfit, and then regularize it to death.
• Regularization, modify the gradient so that it
doesn’t step in directions that lead it to overfit.
• Regularization includes :
• Dropout
• Drop Connect
• L1 penalty
• L2 penalty
• Dropout :
• Dropout is driven by randomly dropping a neuron
so that it will not contribute to the forward pass
and back propagation.
• Dropout is a mechanism used to improve the
training of neural networks by omitting a hidden
unit.
• It also speeds training.
• DropConnect :
• DropConnect does the same thing as Dropout, but
instead of choosing a hidden unit, it mutes the
connection between two neurons.
Penalty Methods :
• Regularization :
• Regularization is a way to avoid overfitting by
penalizing high-valued regression coefficients.
• Regression coefficients are used to predict the
value of an unknown variable using a known
variable.
• It reduces parameters and shrinks (simplifies) the
model.
• Regularization adds penalties to more complex.
• Model with the lowest “overfitting” score is usually
the best choice for predictive power.
• Regularization works by biasing data towards
particular values (such as small values near zero).
• L1 regularization adds an L1 penalty equal to
the absolute value of the magnitude of coefficients.
• In other words, it limits the size of the coefficients.
• L1 can yield sparse models (i.e. models with few
coefficients);
• Some coefficients can become zero and
eliminated.
• Lasso regression uses this method.
• L2 regularization adds an L2 penalty equal to the
square of the magnitude of coefficients.
• L2 will not yield sparse models and all coefficients
are shrunk by the same factor (none are
eliminated).
• Ridge regression and SVMs use this method.
• Elastic nets combine L1 & L2 methods, but do add
a hyperparameter
Mini-batching
• Batch size always seems to affect training.
• Using a very small batch size can lead to slower
convergence of the model.
• Too small or a too large batch size can both affect
training badly.
• A batch size of 32 or 64 almost always seems like a
good option.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy