Unit 1 (1)
Unit 1 (1)
Unit 1
Fundamentals of
Deep Learning
AI Vs ML Vs DL
AI ML DL
1956 1959 2000
• Hessian of f is given
Second-order methods
• Hessian Matrix of second-order partial derivatives,
analogous to "tracking acceleration rather than
speed."
• The Hessian's job is to describe the curvature of
each point of the Jacobian.
• Second-order methods include:
• Limited-memory BFGS (L-BFGS).
• Conjugate gradient
• Hessian-free
• L-BFGS is an optimization algorithm and a so-called
quasi-Newton method.
• It's a variation of the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) algorithm, and it limits how much
gradient is stored in memory.
• Algorithm does not compute the full Hessian
matrix, which is more computationally expensive.
• Hessian L-BFGS stores only a few vectors that
represent a local approximation of it.
• L-BFGS performs faster because it uses
approximated second-order information.
• L-BFGS and conjugate gradient in practice can be
faster and more stable than SGD methods.
• Conjugate gradient guides the direction of the line
search process based on conjugacy information.
• Conjugate gradient methods focus on minimizing
the conjugate L2 norm.
• L2-norm is also known as least squares. It is
basically minimizing the sum of the square of the
differences between the target value and the
estimated values.
• Conjugate gradient is very similar to gradient
descent in that it performs line search.
• The major difference is that conjugate gradient
requires each successive step in the line search
process.
• Hessian-free
• Hessian-free optimization is related to Newton's
method, but it better minimizes the quadratic
function.
• It is a powerful optimization method adapted to
neural networks by James Martens in 2010.
• We find the minimum of the quadratic function
with an iterative method called conjugate gradient.
Hyper Parameters
Hyper Parameters
• Hyperparameters are the variables which
determines the network structure.
• Eg: Number of Hidden Layers.
• & the variables which determine how the network
is trained.
• Eg: Learning Rate.
• Hyperparameters are set before training (before
optimizing the weights and bias).
Hyper parameters
• Layer size.
• Magnitude (momentum, learning rate).
• Regularization (dropout, drop connect, L1,
L2)
• Activations (activation function families)
• Weight initialization strategy.
• Loss functions
• Settings for epochs during training (mini-
batch size)
• Normalization scheme for input data
(Vectorization).
Layer Size
• Layer size : Number of neurons in a layer.
• Input and output layers are easy to figure out.
• Deciding neuron counts for hidden layer is a
challenge.
• Neurons come with a cost.
• Connection schema between layers can vary.
• Weights on the connections, are the parameters we
must train.
• More parameters -increase the amount of effort
needed to train the network.
• Long training times - models struggle to find
convergence.
Magnitude – Hyper parameter
• Magnitude group involve gradient, step size, and
momentum.
• Learning rate defines how quickly a network
updates its parameters.
• Low learning rate slows down the learning process
but converges smoothly.
• Larger learning rate speeds up the learning but
may not converge.
• Momentum helps to know the direction of the next
step with the knowledge of the previous steps.
• Speed up our training by increasing momentum.
• Momentum is a factor between 0.0 and 1.0
,applied to the change rate of the weights.
• Typically, the value for momentum between 0.9
and 0.99.
• Adaptive Gradient Algorithm (Adagrad) is an
algorithm for gradient-based optimization.
• AdaGrad - Technique to help finding the “right”
learning rate.
• AdaGrad is monotonically decreasing and never
increases the learning rate.
• AdaGrad is the square root of the sum of squares of
the history of gradient computations.
• AdaGrad speeds training in the beginning and slows it
appropriately toward convergence.
• RMSprop (Root Mean Square Propagation) is a very
effective, but currently unpublished adaptive learning
rate method.
• AdaDelta is a variant of AdaGrad that keeps only the
most recent history.
• Adam (adaptive moment estimation).
• Derives learning rates from estimates of first and
second moments of the gradients.
• First Moment : sum of gradient.
• Second Moment : sum of the gradient squared.
Regularization
• Regularization is a measure taken against
overfitting.
• Overfitting : when a model describes the training
set but cannot generalize well over new inputs.
• Overfitted models have no predictive capacity for
data that they haven’t seen.
• Geoffery Hinton described the best way to build a
neural network model:
• Cause it to overfit, and then regularize it to death.
• Regularization, modify the gradient so that it
doesn’t step in directions that lead it to overfit.
• Regularization includes :
• Dropout
• Drop Connect
• L1 penalty
• L2 penalty
• Dropout :
• Dropout is driven by randomly dropping a neuron
so that it will not contribute to the forward pass
and back propagation.
• Dropout is a mechanism used to improve the
training of neural networks by omitting a hidden
unit.
• It also speeds training.
• DropConnect :
• DropConnect does the same thing as Dropout, but
instead of choosing a hidden unit, it mutes the
connection between two neurons.
Penalty Methods :
• Regularization :
• Regularization is a way to avoid overfitting by
penalizing high-valued regression coefficients.
• Regression coefficients are used to predict the
value of an unknown variable using a known
variable.
• It reduces parameters and shrinks (simplifies) the
model.
• Regularization adds penalties to more complex.
• Model with the lowest “overfitting” score is usually
the best choice for predictive power.
• Regularization works by biasing data towards
particular values (such as small values near zero).
• L1 regularization adds an L1 penalty equal to
the absolute value of the magnitude of coefficients.
• In other words, it limits the size of the coefficients.
• L1 can yield sparse models (i.e. models with few
coefficients);
• Some coefficients can become zero and
eliminated.
• Lasso regression uses this method.
• L2 regularization adds an L2 penalty equal to the
square of the magnitude of coefficients.
• L2 will not yield sparse models and all coefficients
are shrunk by the same factor (none are
eliminated).
• Ridge regression and SVMs use this method.
• Elastic nets combine L1 & L2 methods, but do add
a hyperparameter
Mini-batching
• Batch size always seems to affect training.
• Using a very small batch size can lead to slower
convergence of the model.
• Too small or a too large batch size can both affect
training badly.
• A batch size of 32 or 64 almost always seems like a
good option.