Unit 1
Unit 1
The term Artificial Intelligence (AI) refers to the technique that enables the computer to emulate
human intelligence. A set of AI algorithms devised to make the computer learn from the past data
to act upon future prediction(s) is called Machine Learning (ML). Artificial Neural Network
(ANN) is one such algorithm mimicking the structure and behavior of human brain/central nervous
system. The ANN is built upon several artificial neurons, arranged in layered manner. A simple
ANN is composed of Input, hidden and output layers to accept, process and predict the data
respectively. With increased number of hidden layers, the ANN can be made deeper to call it a
deep neural network.
The AND logical function is a 2-variables function, AND (x1, x2), with binary inputs and output.
ŷ = ϴ(w1*x1 + w2*x2 + b)
This time, we have three parameters: w1, w2, and b. For the values w1 = 1, w2 = 1, b = -1.5 we
AND(1, 1) = 1
AND(1, 0) = 0
AND(0, 1) = 0
AND(0, 0) = 0
OR logical function
OR (x1, x2) is a 2-variables function too, and its output is 1-dimensional (i.e., one number) and
has two possible states (0 or 1). Therefore, we will use a perceptron with the same architecture as
the one before.
For these values of w1 = 1, w2 = 1, b = -0.5, the result would be
OR(1, 1) = 1
OR(1, 0) = 1
OR(0, 1) = 1
OR(0, 0) = 0
human error. Therefore, an algorithm which automates this process has been developed. The
The XOR function whose Truth Table is given as above, can also be represented using the same
fundamental functions as shown below.
But there exists no such combination of W1, W2 and b that realizes the XOR function. So Minsky
and Papert have shown in their famous book on perceptron that the single perceptron cannot
represent a simple non-linear function such as XOR where Multi-Layer Perceptron (MLP) prevails.
Although this allegation prevented people from applying AI to real world applications, the
introduction of Back-Propagation Algorithm (BPA) in 1970 has again quickened its growth.
This optimal set of weights is obtained with the help of a back-propagation algorithm. MLP training
is comprised of two functions called Feedforward and Back-Propagation. The First function feeds
the weighted features to the MLP neurons in forward direction. The prediction obtained by these
input features is then compared with the actual expected output called the ground truth value using
the loss function. The error obtained at this stage is then propagated back to the input layer using
the Back-Propagation algorithm. While propagating back, BPA updates the weights of those
neurons that are responsible for this error. The Feedforward process is again carried out with these
updated weights. This cycle is repeated until the training process converges by arriving at an
acceptable minimum error value.
Gradient Descent Algorithm
A Back propagation algorithm is implemented with gradient descent algorithm. The algorithm
reduces each parameter value of a node by the gradient of its activation function.
The Gradient Descent Algorithm consists of three moules in it
➢ Farward Propogation
➢ Backward Propogation
➢ Parameter Updation
The gradient ∇𝑊𝑘 will be very large at steep slopes and very small near gentle slopes. This
causes the gentle slopes to last for long time/region. Leading to slower convergence.
Learning Rate
The learning rate is arguably the most important hyperparameter. In general, the optimal learning
rate is about half of the maximum learning rate (i.e., the learning rate above which the training
algorithm diverges)
Optimizers
Choosing a better optimizer than plain old Mini-batch Gradient Descent (and tuning its
hyperparameters) is also quite important.
Batch size
The batch size can also have a significant impact on your model’s performance and the training
time. In general, the optimal batch size will be lower than 32. A small batch size ensures that each
training iteration is very fast, and although a large batch size will give a more precise estimate of
the gradients, in practice this does not matter much since the optimization landscape is quite
complex and the direction of the true gradients do not point precisely in the direction of the
optimum.
Few Python libraries to be used hyperparameters optimizers:
• Hyperopt: a popular Python library for optimizing over all sorts of complex search spaces
(including real values such as the learning rate, or discrete values such as the number of layers).
• Hyperas, kopt or Talos: optimizing hyperparameters for Keras model (the first two are based
on Hyperopt).
• Scikit-Optimize (skopt): a general-purpose optimization library. The Bayes SearchCV class
performs Bayesian optimization using an interface similar to Grid SearchCV.
• Spearmint: a Bayesian optimization library.
• Sklearn-Deap: a hyperparameter optimization library based on evolutionary algorithms, also
with a GridSearchCV-like interface.
Solutions:
The following techniques are used in practice to avoid the vanishing and exploding gradient
problems.
Where
Djork-Arné Clevert et al.6 proposed a new activation function called the exponential linear unit
(ELU) that outperformed all the ReLU variants in their experiments: training time was reduced,
and the neural network performed better on the test set.
The main drawback of the ELU activation function is that it is slower to compute than the ReLU
and its variants (due to the use of the exponential function), but during training this is compensated
by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU
network.
Batch Normalization
Although using He initialization along with ELU (or any variant of ReLU) can signifi-
cantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t
guarantee that they won’t come back during training.
In 2015, Sergey Ioffe and Christian Szegedy proposed a technique called Batch
Normalization (BN) to address the vanishing/exploding gradients problems. The technique
consists of adding an operation in the model just before or after the activation function of each
hidden layer, simply zero-centering and normalizing each input, then scaling and shifting the result
using two new parameter vectors per layer: one for scaling, the other for shifting. This operation
lets the model learn the optimal scale and mean of each of the layer’s inputs. It does so by
evaluating the mean and standard deviation of each input over the current mini batch.
Batch Normalization algorithm
• μB is the vector of input means, evaluated over the whole mini-batch B (it con‐
tains one mean per input).
• σB is the vector of input standard deviations, also evaluated over the whole mini-
batch (it contains one standard deviation per input).
• mB is the number of instances in the mini-batch.
• x(i) is the vector of zero-centered and normalized inputs for instance i.
• γ is the output scale parameter vector for the layer (it contains one scale parame‐
ter per input).
Gradient Clipping
Another popular technique to lessen the exploding gradients problem is to simply clip the gradients
during backpropagation so that they never exceed some threshold. This is called Gradient Clipping.
In Keras, implementing Gradient Clipping is just a matter of setting the clipvalue or clipnorm
argument when creating an optimizer. For example:
This will clip every component of the gradient vector to a value between –1.0 and 1.0. Note that it
may change the orientation of the gradient vec‐ tor: for example, if the original gradient vector is
[0.9, 100.0], it points mostly in the direction of the second axis, but once you clip it by value, you
get [0.9, 1.0], which points roughly in the diagonal between the two axes. In practice however, this
approach works well. If you want to ensure that Gradient Clipping does not change the direction
of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue.
Note that model_A and model_B_on_A now share some layers. When you train model_B_on_A,
it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse
its layers. To do this, you must clone model A’s architecture, then copy its weights as follows
Now you can train the model_B_on_A to to perform the task B, but since the new output layer
was initialized randomly, it will make large errors, at least during the first few epochs, so there will
be large error gradients that may wreck the reused weights.
To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the
new layer some time to learn reasonable weights. To do this, simply set every layer’s train able
attribute to False and compile the model:
Next, we can train the model for a few epochs, then unfreeze the reused layers (which requires
compiling the model again) and continue training to fine-tune the reused layers for task B.
After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again
to avoid damaging the reused weights.
Unsupervised Pretraining
If you must handle a task for which not much labeled data is available, and also there won’t be a
model trained for a similar task available. You can go for Unsupervised learning. If you can gather
plenty of unlabeled training data, you can try to train the layers one by one, starting with the lowest
layer and then going up, using an unsupervised feature detector algorithm such as Restricted
Boltzmann Machines (RBMs) or autoencoders.
In Unsupervised learning, each layer is trained on the output of the previously trained layers (all
layers except the one being trained are frozen). Once all layers have been trained this way, you can
add the output layer for your task, and fine-tune the final network using supervised learning (i.e.,
with the labeled training examples). At this point, you can unfreeze all the pretrained layers, or just
some of the upper ones.
Unsupervised pretraining
The ⊘ symbol represents the element-wise division, and ϵ is a smoothing term to avoid division
by zero, typically set to 10−10
RMSProp
• Early Stopping
• Batch Normalization
• ℓ1 and ℓ2 Regularization
• Dropout
• Max-norm regularization.
Early Stopping
A very different way to regularize iterative learning algorithms such as Gradient Descent
is to stop training as soon as the validation error reaches a minimum. This is called early
stopping
ℓ1 and ℓ2 Regularization
➢ L1 Regularization, also called a lasso regression, adds the “absolute value of magnitude”
of the coefficient as a penalty term to the loss function. Essentially, when we use L1
regularization, we are penalizing the absolute value of the weights.
➢ L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the
coefficient as the penalty term to the loss function. L2 regularization returns a non-sparse
solution since the weights will be non-zero (although some may be close to 0). A major
snag to consider when using L2 regularization is that it’s not robust to outliers. The squared
terms will blow up the differences in the error of the outliers. The regularization would
then attempt to fix this by penalizing the weights.
Here is how to apply ℓ2 regularization to a Keras layer’s connection weights, using a regularization
factor of 0.01:
Dropout
Dropout is one of the most popular regularization techniques for deep neural net‐ works. It is a
fairly simple algorithm: at every training step, every neuron (including the input neurons, but
always excluding the output neurons) has a probability p of being temporarily “dropped out,”
meaning it will be entirely ignored during this training step, but it may be active during the next
step (see Figure 11-9). The hyperparameter p is called the dropout rate, and it is typically set to
50%. After training, neurons don’t get dropped anymore.
To implement dropout using Keras, you can use the keras.layers.Dropout layer. During training, it
randomly drops some inputs (setting them to 0) and divides the remaining inputs by the keep
probability. After training, it does nothing at all, it just passes the inputs to the next layer. For
example, the following code applies dropout regularization before every Dense layer, using a
dropout rate of 0.2:
Max-Norm Regularization
Another regularization technique that is quite popular for neural networks is called max-norm
regularization: for each neuron, it constrains the weights w of the incom‐ ing connections such that
∥ *w* ∥2 ≤ _r_, where r is the max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm. Max-norm
regularization does not add a regularization loss term to the overall loss function. Instead, it is
typically implemented by computing ∥w∥2 after each training step and clipping w if needed