Deep Learning Algorithms and Architectures
Deep Learning Algorithms and Architectures
May 1, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2912200
ABSTRACT Deep learning (DL) is playing an increasingly important role in our lives. It has already made a
huge impact in areas, such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting,
and speech recognition. The painstakingly handcrafted feature extractors used in traditional learning,
classification, and pattern recognition systems are not scalable for large-sized data sets. In many cases,
depending on the problem complexity, DL can also overcome the limitations of earlier shallow networks
that prevented efficient training and abstractions of hierarchical representations of multi-dimensional training
data. Deep neural network (DNN) uses multiple (deep) layers of units with highly optimized algorithms and
architectures. This paper reviews several optimization methods to improve the accuracy of the training and
to reduce training time. We delve into the math behind training algorithms used in recent deep networks.
We describe current shortcomings, enhancements, and implementations. The review also covers different
types of deep architectures, such as deep convolution networks, deep residual networks, recurrent neural
networks, reinforcement learning, variational autoencoders, and others.
INDEX TERMS Machine learning algorithm, optimization, artificial intelligence, deep neural network
architectures, convolution neural network, backpropagation, supervised and unsupervised learning.
DNNs are implemented in the following popular ways: for camera motion estimation and monocular depth. Modi-
1. Sparse Autoencoders fied Neural Networks such as Deep Belief Network (DBM)
2. Convolution Neural Networks (CNNs or ConvNets) as described by Chen and Lin [23] uses both labeled and
3. Restricted Boltzmann Machines (RBMs) unlabeled data with supervised and unsupervised learning
4. Long Short-Term Memory (LSTM) respectively to improve performance. Developing a way
Autoencoders are neural networks that learn fea- to automatically extract meaningful features from labeled
tures or encoding from a given dataset in order to perform and unlabeled high dimensional data space is challenging.
dimensionality reduction. Sparse Autoencoder is a variation Yann LeCun et al. asserts that one way we could achieve this
of Autoencoders, where some of the units output a value would be to utilize and integrate both unsupervised and super-
close to zero or are inactive and do not fire. Deep CNN vised learning [24]. Complementing unsupervised learning
uses multiple layers of unit collections that interact with (with un-labeled data) with supervised learning (with labeled
the input (pixel values in the case of image) and result in data) is referred to as semi-supervised learning.
desired feature extraction. CNN finds it application in image DNN and training algorithms have to overcome two major
recognition, recommender systems and NLP. RBM is used to challenges: premature convergence and overfitting. Prema-
learn probability distribution within the data set. ture convergence occurs when the weights and bias of the
All these networks use backpropagation for training. DNN settle into a state that is only optimal at a local level
Backpropagation uses gradient descent for error reduction, and misses out on the global minima of the entire multi-
by adjusting the weights based on the partial derivative of the dimensional space. Overfitting on the other hand describes a
error with respect to each weight. state when DNNs become highly tailored to a given training
Neural Network models can also be divided into the fol- data set at a fine grain level that it becomes unfit, rigid and
lowing two distinct categories: less adaptable for any other test data set.
1. Discriminative Along with different types of training, algorithms and
2. Generative architecture, we also have different machine learning frame-
Discriminative model is a bottom-up approach in which works (Table 1) and libraries that have made training models
data flows from input layer via the hidden layers to the output easier. These frameworks make complex mathematical func-
layer. They are used in supervised training for problems like tions, training algorithms and statistically modeling available
classification and regression. Generative models on the other without having to write them on your own. Some provide
hand are top-down and data flows in the opposite direction. distributed and parallel processing capabilities, and conve-
They are used in unsupervised pre-training and probabilistic nient development and deployment features. Figure 3 shows
distribution problems. If the input x and corresponding label y a graph with various deep learning libraries along with their
are given, a discriminative model learns the probability dis- Github stars from 2015-2018. Github is the largest hosting
tribution p(y|x), i.e., the probability of y given x directly, service provider of source code in the world [25]. Github
whereas a generative model learns the joint probability of stars are indicative of how popular a project is on Github.
p(x,y), from which P(y|x) can be predicted [20]. In general TensorFlow is the most popular DL library.
whenever labeled data is available discriminative approaches
III. DNN ARCHITECTURES
are undertaken as they provide effective training, and when
Deep neural network consists of several layers of nodes. Dif-
labeled data is not available generative approach can be
ferent architectures have been developed to solve problems in
taken [21].
different domains or use-cases. E.g., CNN is used most of the
Training can be broadly categorized into three types:
1. Supervised time in computer vision and image recognition, and RNN is
2. Unsupervised commonly used in time series problems/forecasting. On the
3. Semi-supervised other hand, there is no clear winner for general problems like
Supervised learning consists of labeled data which is used classification as the choice of architecture could depend on
to train the network, whereas unsupervised learning there multiple factors. Nonetheless [27] evaluated 179 classifiers
is no labeled data set, thus no learning based on feed- and concluded that parallel random forest or parRF_t, which
back. In unsupervised learning, neural networks are pre- is essentially parallel implementation of variation of decision
trained using generating models such as RBMs and later tree, performed the best. Below are three of the most common
could be fine-tuned using standard supervised learning algo- architectures of deep neural networks.
rithms. It is then used on test data set to determine pat- 1. Convolution Neural Network
terns or classifications. Big data has pushed the envelope 2. Autoencoder
even further for deep learning with its sheer volume and 3. Restricted Boltzmann Machine (RBM)
variety of data. Contrary to our intuitive inclination, there is 4. Long Short-Term Memory (LSTM)
no clear consensus on whether supervised learning is better
than the unsupervised learning. Both have their merits and A. CONVOLUTION NEURAL NETWORK
use cases. Reference [22] demonstrated enhance results with CNN is based on the human visual cortex and is the neural
unsupervised learning using unstructured video sequences network of choice for computer vision (image recognition)
TABLE 1. Popular deep learning frameworks and libraries. output layers. Fully connected layers that perform classifica-
tion follow the convolution layers. Sub-sampling or pooling
layers are often inserted between each convolution layers.
CNN’s takes a 2D n × n pixelated image as an input. Each
layer consists of groups of 2D neurons called filters or ker-
nels. Unlike other neural networks, neurons in each feature
extraction layers of CNN are not connected to all neurons in
the adjacent layers. Instead, they are only connected to the
spatially mapped fixed sized and partially overlapping neu-
rons in the previous layer’s input image or feature map. This
region in the input is called local receptive field. The lowered
number of connections reduces training time and chances of
overfitting. All neurons in a filter are connected to the same
number of neurons in the previous input layer (or feature map)
and are constrained to have the same sequence of weights and
biases. These factors speed up the learning and reduces the
memory requirements for the network. Thus, each neuron in a
specific filter looks for the same pattern but in different parts
of the input image. Sub-sampling layers reduce the size of
the network. In addition, along with local receptive fields and
shared weights (within the same filter), it effectively reduces
the network’s susceptibility of shifts, scale and distortions
of images [29]. Max/mean pooling or local averaging filters
are used often to achieve sub-sampling. The final layers of
CNN are responsible for the actual classifications, where
neurons between the layers are fully connected. Deep CNN
and video recognition. It is also used in other areas such can be implemented with multiple series of weight-sharing
as NLP, drug discovery, etc. As shown in Figure 4, a CNN convolution layers and sub-sampling layers. The deep nature
consists of a series of convolution and sub-sampling lay- of the CNN results in high quality representations while
ers followed by a fully connected layer and a normalizing maintaining locality, reduced parameters and invariance to
(e.g., softmax function) layer. Figure 4 illustrates the well- minor variations in the input image [30].
known 7 layered LeNet-5 CNN architecture devised by In most cases, backpropagation is used solely for training
LeCun et al. [28] for digit recognition. The series of mul- all parameters (weights and biases) in CNN. Here is a brief
tiple convolution layers perform progressively more refined description of the algorithm. The cost function with respect
feature extraction at every layer moving from input to to individual training example (x, y) in hidden layers can be
2. Inception:
a. Deep CNN developed by Google
3. ResNet:
a. Very deep Residual network developed by
Microsoft. It won 1st place in the ILSVRC
2015 competition on ImageNet dataset.
4. VGG:
a. Very deep CNN developed for large scale image
recognition FIGURE 6. Training stages in autoencoder [36].
5. DCGAN:
a. Deep convolutional generative adversarial net- the least mean square error, which would then have the least
works proposed by [33]. It is used in unsupervised reconstruction error.
learning of hierarchy of feature representations in Autoencoders use encoder and decoder blocks of
input objects. non-linear hidden layers to generalize PCA to perform
dimensionality reduction and eventual reconstruction of the
B. AUTOENCODER original data. It uses greedy layer by layer unsupervised pre-
Autoencoder is a neural network that uses unsupervised algo- training and fin-tuning with backpropagation [35]. Despite
rithm and learns the representation in the input data set for using backpropagation, which is mostly used in supervised
dimensionality reduction and to recreate the original data set. training, autoencoders are considered unsupervised DNN
The learning algorithm is based on the implementation of the because they regenerate the input x (i) itself instead of a
backpropagation. different set of target values y(i) , i.e., y(i) = x (i) . Hinton
Autoencoders extend the idea of principal component et al. were able to achieve a near perfect reconstruction of
analysis (PCA). As shown in Figure 5, a PCA trans- 784-pixel images using autoencoder, proving that it is far
forms multi-dimensional data into a linear representation. better than PCA [36].
Figure 5 demonstrates how a 2D input data can be reduced to a While performing dimensionality reduction, autoencoders
linear vector using PCA. Autoencoders on the other hand can come up with interesting representations of the input vector in
go further and produce nonlinear representation. PCA deter- the hidden layer. This is often attributed to the smaller number
mines a set of linear variables in the directions with largest of nodes in the hidden layer or every second layer of the two-
variance. The p dimensional input data points are represented layer blocks. But even if there are higher number of nodes
as m orthogonal directions, such that m ≤ p and constitutes a in the hidden layer, a sparsity constraint can be enforced
lower (i.e., less than m) dimensional space. The original data on the hidden units to retain interesting lower dimension
points are projected into the principal directions thus omit- representations of the inputs. To achieve sparsity, some nodes
ting information in the corresponding orthogonal directions. are restricted from firing, i.e., the output is set to a value close
PCA focuses more on the variances rather than covariances to zero.
and correlations and it looks for the linear function with the Figure 6 shows single layer feature detector blocks
most variance [34]. The goal is to determine the direction with of RBMs used in pre-training, which is followed by
The goal of the learning algorithm is to find the optimal (or zero value) of a 2D function. To achieve this, we randomly
values for the weight vectors to solve a class of problem in a pick a point in the curve and slide to the right or left along
domain. the x-axis based on negative or positive value of the deriva-
Some of the well-known training algorithms are: tive or slope of the function at the chosen point until the value
1. Gradient Descent of the y-axis, i.e., function or f(x) becomes zero. The same
2. Stochastic Gradient Descent idea is used in gradient descent, where we traverse or descend
3. Momentum along a certain path in a multi-dimensional weight space if
4. Levenberg–Marquardt algorithm the cost function keeps decreasing and stop once the error rate
5. Backpropagation through time ceases to decrease. Newton’s method is prone to getting stuck
in local minima if the derivative of the function at the current
point is zero. Likewise, this risk is also present when using
A. GRADIENT DESCENT gradient descent on a non-convex function. In fact, the impact
Gradient descent (GD) is the underlying idea in most of is amplified in the multi-dimensional (each dimension repre-
machine learning and deep learning algorithms. It is based sents a weight variable) and multi-layer landscape of DNN
on the concept of Newton’s Algorithm for finding the roots and it result in a sub-optimal set of weights. Cost function
C. MOMENTUM
In the standard SGD, learning rate is used as a fixed multiplier
of the gradient to compute step size or update to the weight.
This can cause the update to overshoot a potential minima,
if the gradient is too steep, or delay the convergence if the
gradient is noisy. Using the concept of momentum in physics,
the momentum algorithm presents a velocity v variable that
configured as an exponentially decreasing average of the
gradient [48]. This helps prevent costly descent in the wrong
direction. In the equation below, α ∈ [0, 1) is the momentum
parameter and is the learning rate.
Velocity Update : v ← αv − g (24)
Actual Update : θ ← θ + v (25)
D. LEVENBERG-MARQUARDT ALGORITHM
Levenberg-Marquadt algorithm (LMA) is primarily used in
solving non-linear least squares problems such as curve fit-
ting. In least squares problems, we try to fit a given data
points with a function with the least amount of sum of the
squares of the errors between the actual data points and points
in the function. LMA uses a combination of gradient descent
and Gauss-Newton method. Gradient descent is employed
FIGURE 10. Error calculation in multilayer neural network [6]. to reduce the sum of the squared errors by updating the
parameters of the function in the direction of the steepest-
descent, while the Gauss-Newton method minimizes the error
is one half the square of the difference between the desired by assuming the function to be locally quadratic and finds the
output minus the current output as shown below. minimum of the quadratic [49].
If the fitting function is denoted by ŷ(t;p) and m data points
1 2
C= yexpected − yactual (23) denoted by (ti , yi ), then the squared error can be written
2 as [49]:
Backpropagation methodology uses gradient descent. Xm y (ti ) − ŷ (ti ; p) 2
In backpropagation, chain rule and partial derivatives are x 2 (p) = (26)
employed to determine error delta for any change in the value i=1 σyi
= (y − ŷ (p))T W y − ŷ (p)
of each weight. The individual weights are then adjusted (27)
to reduce the cost function after every learning iteration of = yT Wy − 2yT W ŷ + ŷT W ŷ (28)
training data set, resulting in a final multi-dimensional (multi-
weight) landscape of weight values [6]. We process through where the measurement error for y (ti ), i.e., σyi is the inverse
all the samples in the training dataset before applying the of the weighting matrix Wii .
updates to the weights. This process is repeated until objective The gradient descent of the squared error function in
(aka cost function) doesn’t reduce any further. relation to the n parameters can be denoted as [49]:
Figure 10 shows the error derivatives in relation to outputs ∂ 2 ∂
x = 2(y − ŷ (p))T W y − ŷ (p)
in each hidden layer, which is the weighted summation of (29)
∂p ∂p
the error derivates in relation to the inputs in the unit in the
∂ ŷ (p)
above layer. E.g., when ∂E/∂zk calculated, the partial error = 2(y − ŷ (p))T W (30)
derivative with respect to wjk to is equal to yj ∂E/∂zk . ∂p
T
= 2(yŷ) W J (31)
hgd = αJT W y − ŷ
B. STOCHASTIC GRADIENT DESCENT (32)
Stochastic Gradient Descent (SGD) is the most common
variation and implementation of gradient descent. In gradient where J is the Jacobian matrix of size m × n used in place
descent, we process through all the samples in the training of the [∂ ŷ/∂p], and hgd is the update in the direction of the
dataset before applying the updates to the weights. While steepest gradient descent.
in SGD, updates are applied after running through a mini- The equation for the Gauss-Newton method update (hgn )
batch of n number of samples. Since we are updating the is as follows [49]:
weights more frequently in SGD than in GD, we can converge
h i
JT WJ hgn = JT W(y − ŷ) (33)
towards global minimum much faster.
53050 VOLUME 7, 2019
A. Shrestha, A. Mahmood: Review of DL Algorithms and Architectures
The Levenberg- Marquardt update [hlm ] is generated by TABLE 3. Deep learning algorithm comparison table.
combining gradient descent and Gauss-Newton methods
resulting in the equation below [49]:
h i
JT WJ + λ diag(JT WJ) hlm = JT W(y − ŷ) (34)
B. HYPERPARAMETER OPTIMIZATION
The learning rate and regularization parameters constitutes
the commonly used hyperparameters in DNN. Learning rate
determines the rate at which the weights are updated. The
purpose of regularization is the prevent overfitting and reg-
ularization parameter affects the degree of influence on the
loss function. CNN’s have additional hyperparameters i.e.,
number of filters, filter shapes, number of dropouts and
max pooling shapes at each convolution layer and number
of nodes in the fully connected layer. These parameters are
very important for training and modeling a DNN. Coming
up with an optimal set of parameter values is a challeng-
ing feat. Exhaustively iterating through each combination
of hyperparameter values is computationally very expensive.
For example, if training and evaluating a DNN with the full
dataset takes ten minutes, then with seven hyperparameters
each with eight potential values will take (87 × 10 min), i.e.,
20,971,520 minutes or almost 40 years to exhaustively train
and evaluate the network on all combinations of the hyperpa-
rameter values. Hyperparameter can be optimized with differ-
ent metaheuristics. Metaheuristics are nature inspired guiding
principles that can help in traversing the search space more
intelligently yet much faster than the exhaustive method.
Particle Swarm Optimization (PSO) is another type of
metaheuristic that can be used for hyperparameter optimiza-
tion. PSO is modeled around the how birds fly around in
search of food or during migration. The velocity and location
of birds (or particles) are adjusted to steer the swarm towards
better solution in the vast search space. Escalante et al. used
PSO for hyperparameter optimization to build a competitive
model that ranked among the top relative to other comparable
methods [52].
Genetic algorithm (GA) is a metaheuristic that is com-
monly used to solve combinatorial optimization problems.
It mimics the selection and crossover processes of species
reproduction and how that contributes to evolution and
improvement of the species prospect of survival. Figure 14a
shows a high-level diagram of the GA. Figure 14b illustrates
the crossover process where parts of the respective genetic FIGURE 14. (a) Genetic algorithm [53]. (b) Crossover in genetic algorithm.
sequence are merged from both the parents to form the new
genetic sequence in the children. The goal is to find a pop-
ulation member (a sequence of numbers resembling DNA population members. Hybridization is the process of mixing
nucleotides) that meets the fitness requirement. Each pop- the primary algorithm (GA in this case) with other operations,
ulation member represents a potential solution. Population like local search. Shrestha and Mahmood [53] incorporated
members are selected based on different methods, e.g., elite, 2-Opt local search method into GA to improve the search
roulette, rank and tournament. for optimal solution. Reference [55] postulates that correctly
Elite method ranks population members by fitness and performed exchanges (e.g., in GA) breeds innovation and
only uses high fitness members for the crossover process. results in creation solutions to hard problems just like in
The mutation process then makes random changes to the real life where collaboration and exchanges between indi-
number sequence and the entire process continues until a viduals, organizations and societies. In additional to GA,
desired fitness or maximum number of iterations are reached. other variations of evolution-based metaheuristics have also
References [53], [54] propose parallelization and hybridiza- been used to evolve and optimize deep learning architectures
tion of GA to achieve better and faster results. Parallelization and hyperparameters. E.g., [56] proposed CoDeepNEAT
provide both speedup and better results as we can periodically framework based on deep neuroevolution technique for
exchange population members between the distributed and finding an optimized architecture to match the task at
parallel operations of genetic algorithms on different set of hand.
D. BATCH NORMALIZATION models and later combining them to solve the larger model.
As the network is getting trained with variations to weights Greedy algorithms are commonly used in supervised pre-
and parameters, the distribution of actual data inputs at training of DNN.
each layer of DNN changes too, often making them all too
large or too small and thus making them difficult to train on F. DROPOUT
networks, especially with activation functions that implement There are few commonly used methods to lower the risk of
saturating nonlinearities, e.g., sigmoid and tanh functions. overfitting. In the dropout technique, we randomly choose
Iofee and Szegedy [59] proposed the idea of batch normal- units and nullify their weights and outputs so that they
ization in 2015. It has made a huge difference in improving do not influence the forward pass or the backpropagation.
the training time and accuracy of DNN. It updates the inputs Figure 16 shows a fully connected DNN on the left and a
to have a unit variance and zero mean at each mini-batch. DNN with dropout to the right. The other methods include
the use of regularization and simply enlarging the training
E. SUPERVISED PRETRAINING dataset using label preserving techniques. Dropout works
Supervised pretraining constitutes breaking down complex better than regularization to reduces the risk of overfitting
problems into smaller parts and then training the simpler and also speeds up the training process. Reference [60]
proposed the dropout technique and demonstrated significant TABLE 4. DL algorithm shortcomings & resolution techniques.
improvement on supervised learning based DNN for com-
puter vision, computational biology, speech recognition and
document classification problems.
1wij (t + 1) = c1wij (t) + α(hvi hj idata − hvi hj imodel (37) generated today making information retrieval very chal-
lenging. Deep learning can help with semantic indexing to
Equation [23] for probability distribution for hidden and enable information to be more readily accessible in search
visible inputs. engines [14], [65]. This involves building models that provide
I
! relationships between documents and keywords the contain to
X
p(hj = 1|v; W) = σ wij vi + aj (38) make information retrieval more effective.
i=1
XJ E. GENERATIVE TOP DOWN CONNECTION
p(vi = 1|h; W) = σ wij hj + bi (39) (GENERATIVE MODEL)
j=1 Much of the training is usually implemented with bottom-
up approach, where discriminatory or recognition models
D. BIG DATA are developed using backpropagation. A bottom-up model is
Big data provides tremendous opportunity and challenge for one that takes the vector representation of input objects and
deep learning. Big data is known for the 4 Vs (volume, veloc- computes higher level feature representations at subsequent
ity, veracity, variety). Unlike the shallow networks, the huge layer with a final discrimination or recognition pattern at the
volume and variety of data can be handled by DNNs and output layer. One of the shortcomings of backpropagation is
significantly improve the training process and the ability to that it requires labeled data to train. Geoffrey Hinton proposed
fit more complex models. On the flip side, the sheer veloc- a novel way of overcoming this limitation in 2007 [66].
ity of data that is generated in real-time can be daunting He proposed a multi-layer DNN that used generative top-
to process. Jajafabadi et al. [47] raises similar challenges down connection as opposed to bottom-up connection to
learning from real-time streaming data such as credit cards mimic the way we generate visual imagery in our dream
usage to monitor for fraud detection. They propose using without the actual sensory input. In top-down generative
parallel and distributed processing with thousands of CPU connection, the high-level data representation or the out-
cores. In addition, we should also use cloud providers that puts of the networks are used to generate the low-level raw
support auto-scaling based on usage and workload. Not all vector representations of the original inputs, one layer at a
data represent the same quality. In the case of computer time. The layers of feature representations learned with this
vision, images from constrained sources, e.g., studios are approach can then be further perfected either in generative
much easier to recognize that the ones from unconstrained models such as auto-encoders or even standard recognition
sources like surveillance cameras. Reference [64] proposes a models [66].
method to utilize multiple images of the unconstrained source In the generative model in Figure 18, since the correct
to enhance the recognition process. upstream cause of the events in each layer is known, a com-
Deep learning can help mine and extract useful patterns parison between the actual cause and the prediction made
from big data and build models for inference, prediction by the approximate inference procedure can be made, and
and business decision making. There is massive volumes the recognition weights, rij can be adjusted to increase the
of structured and unstructured data and media files getting probability of correct prediction.
FIGURE 20. Pretraining of stacked & altered RBM to create a DBM [67].
set and L (x,y) is the loss function with x representing the ing (KSC) is used as the core model
input and y representing the output, i.e., reconstructed • A regularization term is introduced and labels (from
Since unlabeled data is more abundantly available relative during training but not during evaluation or testing, into the
to labeled data, it would be beneficial to make the most of it DNN training process.
with unsupervised or in this case semi-supervised learning.
L. GENETIC ALGORITHM
J. VERY DEEP CONVOLUTIONAL NETWORKS FOR
Genetic Algorithm is a metaheuristic that can be effectively
NATURAL LANGUAGE PROCESSING
used in training DNN. GA mimics the evolutionary processes
of selection, crossover and mutation. Each population mem-
Deep CNN have mostly been used in computer vision, where
ber represents a possible solution with a set of weights. Unlike
it is very effective. Conneau et al. [74] used it for the first
PSO, which includes only one operator for adjusting the solu-
time to NLP with up to 29 convolution layers. The goal is
tion, evolutionary algorithms like GA includes various steps,
to analyze and extract layers of hierarchical representations
i.e., selection, crossover and mutation methods [52]. Popu-
from words and sentences at the syntactic, semantic and con-
lation members undergo several iterations of selection and
textual level. One the major setbacks for lack of earlier deep
crossover based on known strategies to achieve better solution
CNN for NLP is because of deeper networks tend to cause
in the next iteration or generation. GA has undergone decades
saturation and degradation of accuracy. This is in addition to
of improvement and refinements since it was first proposed
the processing overhead of more layers. He et al. [62] states
in 1976 [78]. There are several ways to perform selec-
that the degradation is not caused by overfitting but because
tions, e.g., elite, roulette, rank, tournament [79]. There are
deeper systems are difficult to optimize. Reference [62]
about dozen ways to perform crossovers by Larrañaga et al.
addressed this issue with shortcut connections between the
alone [80]. Selection methodologies represent exploration of
convolution blocks to let the gradients to propagate more
the solution space and crossovers represent the exploitation of
freely and they, along with [74] were able to validate
the selected solution candidates. The goal is to get better solu-
the benefits of the shortcuts with 10/101/152-layers and
tion wider exploration and deeper exploitation. Additional
49 layers respectively. Conneau et al. [74] architecture con-
tweaking can be introduced with mutation. Parallel clusters
sists of series of convolution blocks separated by pooling
of GA can be executed independently in islands and few
that halved the resolution followed by k-max pooling and
members exchanged between the island every so often [81].
classification at the end.
In addition, we can also utilize local search such as greedy
algorithm, Nearest Neighbor or K-opt algorithm to further
K. METAHEURISTICS improve the quality of the solution.
Metaheuristics can be used to train neural networks to Lin et al. [82] demonstrated a successful incorporation
overcome the limitation of backpropagation-based learning. of GA that resulted in better classification accuracy and
When implementing metaheuristics as training algorithm, performance of a Polynomial Neural Network. Standard GA
each weight of the neural network connection is represented operations including selection, crossover and mutation were
by a dimension in the multi-dimensional solution search used on parameters that included partial descriptions (PDs)
space of the problem we are trying to solve. The goal is to of inputs in the first layer, bias and all input features [82].
come as near as possible to the optimal values of weights, GA was further enhanced with the incorporation of the
i.e., a location in the search space that represents the global concept of mitochondrial DNA (mtDNA). In evolution, it is
best solution. Particle Swarm Optimization (PSO) is a type quite evident from casual observation and simple reason that
of metaheuristic inspired by the movement of birds in the crossover of population members with too much similarity
sky consists of particles or candidate solutions move about does not yield much variance in the offspring. Likewise,
in a search space to reach a near optimal solution. In their we can infer that in GA, selection and crossover between
paper [75], N. Krpan and D. Jakobovic ran parallel imple- solutions that are very similar would not result is high degree
mentations using backpropagation and PSO. Their results of exploration of the multi-dimensional solution space.
demonstrate that while parallelization improves the efficacy In fact, it might run the risk of getting pigeonholed into a
of both algorithms, parallel backpropagation is efficient only restricted pattern.
on large networks, whereas parallel PSO has wider influence Diversity is the key to overcoming the risk of getting stuck
on various sizes of problems. in local minima. This risk can be mitigated by exploiting the
Similarly, Dong and Zhou [76] complemented PSO with idea of mtDNA. mtDNA represents one percent of the human
supervised learning control module to guide the search for chromosomes [83]. The concept of incorporating mitochon-
global minima of an optimization problem. The supervised drial DNA into GA was introduced by Shrestha and Mah-
learning module provided real-time feedback with back dif- mood [53]. They describe a way to restrict crossover between
fusion (BD) to retain diversity and social attractor renewal population members or solution candidates based proximity
to overcome stagnation [76]. Metaheuristics provide high on their mtDNA value [53]. Unlike the rest of the 99% DNA,
level guidance inspired by nature and applies them to solve mtDNA is only inherited from the female, thus it is a more
mathematical problems. In a similar way [77] proposes incor- continuous marker of lineage or genetic proximity. The
porating the concepts of intelligent teacher and privileged premise behind this is that offspring of population members
information, which is essentially extra information available with similar genetic makeup doesn’t help with overcoming
O. ADVERSARIAL TRAINING
FIGURE 24. Continental model with mtDNA [53].
Machine learning training and deployment used to be done in
isolated computers, but now they are increasing being done in
the local minima. Figure 24 describes the parallel and dis- a highly interconnected commercial production environment.
tributed nature of their full implementation [53] along with Take a face recognition system where a network could be
the GA operators (selection, mutation and mtDNA incorpo- trained on a fleet of servers with a training dataset imported
rated crossover). The training process is enhanced [53] with from an external data source, and the trained model could
the implementation of continental model, where distributed be deployed on another server which accepts APIs calls with
servers run multiple threads, each running an instance of real time inputs (e.g., images of people entering a building)
GA with mtDNA. Population members are then exchanged and responds with matches. The interconnected architecture
between the servers after fixed number of iterations as shown exposes the machine learning to a wide attack surface. The
in Figure 24. real-time input or training dataset can be manipulated by an
FIGURE 25. GNMT architecture [84] with encoder neural network on the left and decoder neural network on the right.
[10] K. Nagpal et al., ‘‘Development and validation of a deep learning algorithm [38] A. Ng. (Jul. 21, 2018). Autoencoders. UFLDL. [Online]. Available:
for improving Gleason scoring of prostate cancer,’’ CoRR, Nov. 2018. http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders
[11] S. Nevo, ‘‘ML for flood forecasting at scale,’’ CoRR, Jan. 2019. [39] Y. W. Teh and G. E. Hinton, ‘‘Rate-coded restricted Boltzmann machines
[12] A. Esteva et al., ‘‘Dermatologist-level classification of skin cancer with for face recognition,’’ in Proc. Adv. Neural Inf. Process. Syst., 2001,
deep neural networks,’’ Nature, vol. 542, no. 7639, p. 115, 2017. pp. 908–914.
[13] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, [40] G. E. Hinton, ‘‘A practical guide to training restricted Boltzmann
‘‘Deep reinforcement learning: A brief survey,’’ IEEE Signal Process. machines,’’ in Neural Networks: Tricks of the Trade, 2nd ed., G. Montavon,
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017. G. B. Orr, K.-R. Müller, Eds. Berlin, Germany: Springer, 2012,
[14] M. Gheisari, G. Wang, and M. Z. A. Bhuiyan, ‘‘A survey on deep learning pp. 599–619.
in big data,’’ in Proc. IEEE Int. Conf. Comput. Sci. Eng. (CSE), Jul. 2017, [41] S. Hochreiter and J. Schmidhuber, ‘‘Long Short-term Memory,’’ Neural
pp. 173–180. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[15] S. Pouyanfar, ‘‘A survey on deep learning: Algorithms, techniques, and [42] C. Metz, ‘‘Apple is bringing the AI revolution to your phone, in wired,’’
applications,’’ ACM Comput. Surv., vol. 51, no. 5, p. 92, 2018. Tech. Rep., 2016.
[16] R. Vargas, A. Mosavi, and R. Ruiz, ‘‘Deep learning: A review,’’ in Proc. [43] F. A. Gers, J. Schmidhuber, and F. Cummins, ‘‘Learning to forget:
Adv. Intell. Syst. Comput., 2017, pp. 1–11. Continual prediction with LSTM,’’ Neural Comput., vol. 12, no. 10,
[17] M. D. Buhmann, Radial Basis Functions. Cambridge, U.K.: Cambridge pp. 2451–2471, 2000.
Univ. Press, 2003, p. 270. [44] J. Chung. (2014). ‘‘Empirical evaluation of gated recurrent neural net-
[18] A. A. Akinduko, E. M. Mirkes, and A. N. Gorban, ‘‘SOM: Stochas- works on sequence modeling.’’ [Online]. Available: https://arxiv.org/abs/
tic initialization versus principal components,’’ Inf. Sci., vols. 364–365, 1412.3555
pp. 213–221, Oct. 2016. [45] K. Cho. (2014). ‘‘Learning phrase representations using RNN encoder-
[19] K. Chen, ‘‘Deep and modular neural networks,’’ in Springer Handbook decoder for statistical machine translation.’’ [Online]. Available: https://
of Computational Intelligence, J. Kacprzyk and W. Pedrycz, Eds. Berlin, arxiv.org/abs/1406.1078
Germany: Springer, 2015, pp. 473–494. [46] B. Naul, J. S. Bloom, F. Pérez, and S. van der Walt, ‘‘A recurrent neural
[20] A. Y. Ng and M. I. Jordan, ‘‘On discriminative vs. generative classifiers: network for classification of unevenly sampled variable stars,’’ Nature
A comparison of logistic regression and naive Bayes,’’ in Proc. 14th Int. Astron., vol. 2, no. 2, pp. 151–155, 2018.
Conf. Neural Inf. Process. Syst. Cambridge, MA, USA: MIT Press, 2001, [47] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald,
pp. 841–848. and E. Muharemagic, ‘‘Deep learning applications and challenges in big
[21] C. M. Bishop and J. Lasserre, ‘‘Generative or discriminative? Getting the data analytics,’’ J. Big Data, vol. 2, no. 1, p. 1, Feb. 2015.
best of both worlds,’’ Bayesian Statist., vol. 8, pp. 3–24, Jan. 2007. [48] I. Goodfellow, Y. Bengio, and A. Courville, ‘‘Deep learning,’’ in Adaptive
[22] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, ‘‘Unsupervised learning Computation And Machine Learning. Cambridge, MA, USA: MIT Press,
of depth and ego-motion from video,’’ CoRR, Apr. 2017 2016, p. 775.
[23] X.-W. Chen and X. Lin, ‘‘Big data deep learning: Challenges and perspec- [49] H. P. Gavin, ‘‘The Levenberg-Marquardt method for nonlinear least
tives,’’ IEEE Access, vol. 2, pp. 514–525, 2014. squares curve-fitting problems,’’ Tech. Rep., 2016.
[24] Y. LeCun, K. Kavukcuoglu, and C. Farabet, ‘‘Convolutional networks [50] X. Glorot and Y. Bengio, ‘‘Understanding the difficulty of training deep
and applications in vision,’’ in Proc. IEEE Int. Symp. Circuits Syst., feedforward neural networks,’’ in Proc. 13th Int. Conf. Artif. Intell. Statist.,
May/Jun. 2010, pp. 253–256. 2010, pp. 249–256.
[25] G. Gousios, B. Vasilescu, A. Serebrenik, and A. Zaidman, ‘‘Lean GHTor-
[51] J. Martens, ‘‘Deep learning via Hessian-free optimization,’’ in Proc.
rent: GitHub data on demand,’’ in Proc. 11th Work. Conf. Mining Softw.
27th Int. Conf. Int. Conf. Mach. Learn. Haifa, Israel: Omnipress, 2010,
Repositories, Hyderabad, India, 2014, pp. 384–387.
pp. 735–742.
[26] AI-Index. (2019). Top Deep Learning Github Repositories. [Online].
[52] H. J. Escalante, M. Montes, and L. E. Sucar, ‘‘Particle swarm model
Available: https://github.com/mbadry1/Top-Deep-Learning
selection,’’ J. Mach. Learn. Res., vol. 10, pp. 405–440, Feb. 2009.
[27] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, ‘‘Do we
[53] A. Shrestha and A. Mahmood, ‘‘Improving genetic algorithm with fine-
need hundreds of classifiers to solve real world classification problems?’’
tuned crossover and scaled architecture,’’ J. Math., vol. 2016, p. 10,
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.
Mar. 2016.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
[54] K. Sastry, D. Goldberg, and G. Kendall, Genetic Algorithms. 2005.
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998. [55] D. E. Goldberg, The Design of Innovation: Lessons from and for Competent
[29] Y. LeCun and Y. Bengio, ‘‘Convolutional networks for images, speech, Genetic Algorithms. Boston, MA, USA: Springer, 2013.
and time series,’’ in The Handbook of Brain Theory and Neural [56] R. Miikkulainen, ‘‘Evolving deep neural networks,’’ CoRR, Mar. 2017.
Networks, A. A. Michael, Ed. Cambridge, MA, USA: MIT Press, 1998, [57] J. Duchi, E. Hazan, and Y. Singer, ‘‘Adaptive subgradient methods for
pp. 255–258. online learning and stochastic optimization,’’ J. Mach. Learn. Res., vol. 12,
[30] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, ‘‘Convolutional learn- pp. 2121–2159, Jul. 2011.
ing of spatio-temporal features,’’ in Computer Vision. Berlin, Germany: [58] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
Springer, 2010. CoRR, Dec. 2014.
[31] A. Ng. (Jul. 21, 2018). Convolutional Neural Network. UFLDL. [59] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
[Online]. Available: http://ufldl.stanford.edu/tutorial/supervised/ training by reducing internal covariate shift,’’ CoRR, Feb. 2015.
ConvolutionalNeuralNetwork/ [60] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[32] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Schölkopf, ‘‘A machine R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
learning approach for non-blind image deconvolution,’’ in Proc. IEEE from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 1067–1074. 2014.
[33] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation [61] AW Services. (Jul. 21, 2018). Amazon EC2 P2 & P3 Instances. Ama-
learning with deep convolutional generative adversarial networks,’’ CoRR, zon EC2 Instance Types. [Online]. Available: https://aws.amazon.com/ec2/
Nov. 2015. instance-types/p2/ and https://aws.amazon.com/ec2/instance-types/p3/
[34] I. T. Jolliffe, ‘‘Principal component analysis,’’ in Mathematics and Statis- [62] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
tics, 2nd ed. New York, NY, USA: Springer, 2002, p. 487. recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[35] K. Noda, ‘‘Multimodal integration learning of object manipulation behav- Jun. 2016, pp. 770–778.
iors using deep neural networks,’’ in Proc. IEEE/RSJ Int. Conf. Intell. [63] A. J. R. Simpson, ‘‘Uniform learning in a deep neural network via ‘oddball’
Robots Syst., Nov. 2013, pp. 1728–1733. stochastic gradient descent,’’ CoRR, Oct. 2015.
[36] G. E. Hinton and R. R. Salakhutdinov, ‘‘Reducing the dimensionality of [64] L. Best-Rowden, H. Han, C. Otto, B. F. Klare, and A. K. Jain,
data with neural networks,’’ Science, vol. 313, no. 5786, pp. 504–507, ‘‘Unconstrained face recognition: Identifying a person of interest from
2006. a media collection,’’ IEEE Trans. Inf. Forensics Security, vol. 9, no. 12,
[37] M. Wang, H.-X. Li, X. Chen, and Y. Chen, ‘‘Deep learning-based model pp. 2144–2157, Dec. 2014.
reduction for distributed parameter systems,’’ IEEE Trans. Syst., Man, [65] T. A. Letsche and M. W. Berry, ‘‘Large-scale information retrieval with
Cybern., Syst., vol. 46, no. 12, pp. 1664–1674, Dec. 2016. latent semantic indexing,’’ Inf. Sci., vol. 100, nos. 1–4, pp. 105–137, 1997.
[66] G. E. Hinton, ‘‘Learning multiple layers of representation,’’ Trends Cognit. [89] A. Shrestha and A. Mahmood, ‘‘Enhancing siamese networks training with
Sci., vol. 11, no. 10, pp. 428–434, Oct. 2007. importance sampling,’’ in Proc. 11th Int. Conf. Agents Artif. Intell. Prague,
[67] R. Salakhutdinov and G. Hinton, ‘‘Deep Boltzmann machines,’’ in Proc. Czech Republic: SciTePress, 2019, pp. 610–615.
12th Int. Conf. Artif. Intell. Statist., D. D. van and W. Max, Eds. 2009, [90] D. P. Kingma and M. Welling. (2013). ‘‘Auto-encoding variational Bayes.’’
pp. 448–455. [Online]. Available: https://arxiv.org/abs/1312.6114
[68] W. Kuo, B. Hariharan, and J. Malik, ‘‘DeepBox: Learning objectness with [91] D. Silver et al., ‘‘Mastering the game of go with deep neural networks and
convolutional networks,’’ CoRR, May 2015. tree search,’’ Nature, vol. 529, no. 7587, p. 484, 2016.
[69] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, ‘‘Extreme learning machine: The- [92] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau,
ory and applications,’’ Neurocomputing, vol. 70, nos. 1–3, pp. 489–501, ‘‘An introduction to deep reinforcement learning,’’ CoRR, Dec. 2018.
2006. [93] I. J. Goodfellow et al. (2014). ‘‘Generative adversarial networks.’’
[70] J. Tang, C. Deng, and G.-B. Huang, ‘‘Extreme learning machine for [Online]. Available: https://arxiv.org/abs/1406.2661
multilayer perceptron,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 27, [94] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. (2016).
no. 4, pp. 809–821, Apr. 2015. ‘‘Generative adversarial text to image synthesis.’’ [Online]. Available:
[71] M. Gong, J. Liu, H. Li, Q. Cai, and L. Su, ‘‘A multiobjective sparse https://arxiv.org/abs/1605.05396
feature learning model for deep neural networks,’’ IEEE Trans. Neural [95] H. Brighton and C. Mellish, ‘‘Advances in instance selection for instance-
Netw. Learn. Syst., vol. 26, no. 12, pp. 3263–3277, Dec. 2015. based learning algorithms,’’ Data Mining Knowl. Discovery, vol. 6, no. 2,
[72] S. Mehrkanoon, C. Alzate, R. Mall, R. Langone, and J. A. K. Suykens, pp. 153–172, 2002.
‘‘Multiclass semisupervised learning based upon kernel spectral cluster- [96] S. Albelwi and A. Mahmood, ‘‘A framework for designing the architectures
ing,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 4, pp. 720–733, of deep convolutional neural networks,’’ Entropy, vol. 19, no. 6, p. 242,
Apr. 2015. 2017.
[73] R. Langone, R. Mall, C. Alzate, and J. A. K. Suykens, ‘‘Kernel spectral
clustering and applications,’’ CoRR, May 2015.
[74] A. Conneau, H. Schwenk, L. Barrault, and Y. LeCun, ‘‘Very deep convo-
lutional networks for text classification,’’ CoRR, Jun. 2016.
[75] N. Krpan and D. Jakobovic, ‘‘Parallel neural network training with
OpenCL,’’ in Proc. 35th Int. Conv. MIPRO, May 2012, pp. 1053–1057.
AJAY SHRESTHA received the B.S. degree in
[76] W. Dong and M. Zhou, ‘‘A supervised learning and control method to computer engineering and the M.S. degree in com-
improve particle swarm optimization algorithms,’’ IEEE Trans. Syst., Man, puter science from the University of Bridgeport,
Cybern. Syst., vol. 47, no. 7, pp. 1135–1148, Jul. 2017. CT, USA, in 2002 and 2006, respectively, where he
[77] V. Vapnik and R. Izmailov, ‘‘Learning using privileged information: Simi- is currently pursuing the Ph.D. degree in computer
larity control and knowledge transfer,’’ J. Mach. Learn. Res., vol. 16, no. 1, science and engineering.
pp. 2023–2049, Jan. 2015. He has guest lectured at Pennsylvania State Uni-
[78] J. R. Sampson, Adaptation in Natural and Artificial Systems, vol. 18, no. 3, versity. He is also an Adjunct Faculty with the
J. H. Holland, Ed. Philadelphia, PA, USA: SIAM, 1976, pp. 529–530. School of Engineering, University of Bridgeport,
[79] N. M. Razali and J. Geraghty, ‘‘Genetic algorithm performance with and with Thermo Fisher Scientific, Branford,
different selection strategies in solving TSP,’’ in Proc. world Congr. Eng., CT, USA, as a Manager of Technical Operations. His research interests
2010, pp. 1–6. include machine learning and metaheuristics. He has served as a Technical
[80] P. Larrañaga, C. M. H. Kuijpers, R. H. Murga, I. Inza, and S. Dizdarevic, Committee Member of the International Conference on Systems, Computing
‘‘Genetic algorithms for the travelling salesman problem: A review of rep- Sciences and Software Engineering (SCSS). He received the Academic
resentations and operators,’’ Artif. Intell. Rev., vol. 13, no. 2, pp. 129–170,
Excellence Award and the Graduate Research Assistantship for his under-
Apr. 1999.
[81] D. Whitley, ‘‘A genetic algorithm tutorial,’’ Statist. Comput., vol. 4, no. 2,
graduate and graduate studies, respectively. He has been serving as the
pp. 65–85, Jun. 1994. Chapter Vice President and other officers of Upsilon Pi Epsilon (UPE),
[82] C.-T. Lin, M. Prasad, and A. Saxena, ‘‘An improved polynomial neural since 2014, and received the UPE Executive Council Award presented by
network classifier using real-coded genetic algorithm,’’ IEEE Trans. Syst., the UPE Executive Council, in 2016.
Man, Cybern., Syst., vol. 45, no. 11, pp. 1389–1401, Nov. 2015.
[83] Y. Guo et al., ‘‘The use of next generation sequencing technology to study
the effect of radiation therapy on mitochondrial DNA mutation,’’ Mutation
Res./Genetic Toxicol. Environ. Mutagenesis, vol. 744, no. 2, pp. 154–160,
2012. AUSIF MAHMOOD (SM’82) received the M.S.
[84] Y. Wu, ‘‘Google’s neural machine translation system: Bridging the gap and Ph.D. degrees in electrical and computer engi-
between human and machine translation,’’ CoRR, Sep. 2016.
neering from Washington State University, USA.
[85] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li, ‘‘Multi-instance multi-
He is currently the Chair Person of the Com-
label learning,’’ Artif. Intell., vol. 176, no. 1, pp. 2291–2320, 2012.
[86] L. Huang, A. D. Joseph, B. Nelson, B. I. P. Rubinstein, and J. D. Tygar, puter Science and Engineering Department and a
‘‘Adversarial machine learning,’’ in Proc. 4th ACM Workshop Secur. Artif. Professor with the Computer Science and Engi-
Intell., Chicago, IL, USA, 2011, pp. 43–58. neering Department and the Electrical Engineering
[87] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Department, University of Bridgeport, Bridgeport,
Approach. London, U.K.: Springer, 2015. CT, USA. His research interests include parallel
[88] R. Hadsell, S. Chopra, and Y. LeCun, ‘‘Dimensionality reduction by learn- and distributed computing, computer vision, deep
ing an invariant mapping,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. learning, and computer architecture.
Pattern Recognit. (CVPR), Jun. 2006, pp. 1735–1742.