Crash Course DL
Crash Course DL
by Jakub Boksansky
1 Introduction
As I recently went through a journey of learning how to make use of deep learning (in the context
of computer graphics), I thought it would be good to write down some notes to help others get up to
speed quickly. The goal of this article is to make the reader familiar with terms and concepts used in deep
learning and to implement a basic deep learning algorithm. This should make it easier to study and
understand other deep learning resources. This article comes with the source code and a sample
application written in HLSL/DirectX 12 available at https://github.com/boksajak/Dx12NN.
2 Neural Network
Deep learning algorithms are powered by artificial neural networks. These are collections of
interconnected neurons (also called units), usually organized into layers. When a neural network has large
number of layers, we say that the network is deep, giving us a name deep learning. Each neuron has its
inputs connected to other neurons via weighted connections, performs a simple computation over them,
and passes its output to other neurons.
The type of neural network that we’re going to use in this article is called multilayer perceptron
(MLP). It is relatively simple, but also powerful, and is widely used in practice (e.g., for neural radiance
cache (NRC) and neural radiance fields (NeRFs).
o Neurons are organized into layers, where first layer is called input layer, last layer is output
layer and between them are hidden layers.
o Each neuron has its inputs connected to outputs of all neurons in the previous layer.
Therefore, we say that MLPs are fully connected networks, see the Figure 1 for an example.
o This network architecture forms a directed acyclic graph, meaning that information only flows
in one way, starting in the input layer and propagating through hidden layers to the output
layer. We say that such network is a feedforward network.
▪ Note: Other network types, where information can also flow backwards (via feedback
connections) are called recurrent networks. Some networks also have memory cells
to store values from previous evaluations of the network.
Number of neurons and layers in the network determines how powerful the network is. Larger
networks have the ability to learn more complex functions (we say that they have a larger capacity) but
are typicallyf more difficult to train and slower to evaluate. When designing a deep learning algorithm, we
need to find a network size which is just right for the task. E.g., the NeRF paper uses MLP with 9 hidden
layers consisting of 256 neurons each, and NRC uses 5 hidden layers with 64 neurons.
Note that these networks are small enough that they can be implemented in compute shaders
and executed per pixel. In practice, some clever optimizations are necessary but it’s not impossible to have
a deep learning algorithm running in real-time on current GPUs. It is also common to run neural networks
using types with reduced precision, like FP16 or even smaller.
Figure 1 Architecture of a multi-layer perceptron neural network with 2 neurons in the input layer, 3 neurons per hidden layer
and 1 neuron in the output layer.
1. First, we sum up activations of all neurons from the previous layer connected to this neuron, weighted
by the strength of the connection (this strength is also called the weight)
2. We add bias of the evaluated neuron to the sum. Bias is a per-neuron value which helps to represent
non-linear relationships between network inputs and outputs. Without the bias, output of the
network for zero input could only be a zero.
3. We apply the activation function to this sum to produce final neuron activation. This must be a non-
linear function, which introduces another non-linearity between inputs and outputs. Usually, we use
a simple function like max(0, 𝑥). Without the activation function, outputs of the network could only
be a linear combination of the inputs.
𝑂𝑖 = 𝜎 ((∑ 𝑂𝑗 ∗ 𝑤𝑖𝑗 ) + 𝑏𝑖 )
𝑗
where 𝑂𝑖 is the activation of the evaluated neuron, 𝜎 is the activation function, 𝑀 is the set of
input neurons connected to the evaluated neuron, 𝑂𝑗 is the activation of the input neuron 𝑗 from previous
layer, 𝑤𝑖𝑗 is the weight of the connection between neurons 𝑖 and 𝑗 and 𝑏𝑖 is the bias value of the evaluated
neuron.
This means, that the output of the network for specific input is given by weights and biases and
we need to set them accordingly to produce desired results. The process of adjusting weights and biases
to make the network do what we want is called training.
o Classification: Categorizes the input into one (or more) predefined categories. These neural
networks have one output neuron for each category and assign a probability of input
belonging to each category as its output. E.g., we can have an input image of a hand-written
digit and train the MLP to assign probability of the digit belonging into categories representing
digits from 0 to 9. This exercise is a common “hello world” example in deep learning and there
is a freely available set of images of hand-written digits called MNIST that can be used for it.
o Regression: For given input, the regression calculates a continuous numerical value on the
output (also called a prediction). The goal is to train the network to perform a desired
mapping between inputs and output values. E.g., the NRC algorithm trains the network to
map inputs like surface position and normal to radiance values.
In our sample application we will train the network to do regression, specifically it will learn to
represent a 2D texture. We will provide UV coordinates of the texture as inputs, and we’ll expect RGB
value of the corresponding texel on the output. We want it to learn the mapping:
(𝑢, 𝑣) → (𝑅, 𝐺, 𝐵)
For this example, our network will have 2 neurons in the input layer, corresponding to the 𝑢 and
𝑣 coordinates in the texture. On the output, we will have 3 neurons corresponding to the RGB values of
the texel. In practice, it is common to encode the input into different representation. A naïve encoding
which simply assigns inputs to input neurons as they are is called identity, and while it works, some clever
encoding schemes usually perform much better. We’ll talk more about input encodings in section 4. Note
that the input layer doesn’t perform any computation – instead of calculating activation of the neurons in
the input layer, we simply assign input values as their activations.
Before we can use the MLP, it must be trained to perform our desired mapping well. Training is
relatively complex and will be described in section 3, so let’s now assume that we have trained the
network, obtained the correct weights and biases and we want to calculate the network output
(prediction) for given input – this process is also called inference and because the information flows
forward through the network, it’s sometimes also called the forward pass.
o We will index layers starting from 0: input layer has index 0, and output layer has index
LAYER_COUNT – 1. Because the input layer doesn’t do any computation, it might be possible
to skip it and start indexing from the first hidden layer, but we want to keep things clear and
simple.
o Connections between neurons will “belong” to the layer they are leading to. This means that
layer 0 (input layer) doesn’t have any connections, and layer index LAYER_COUNT – 1 has
connections from last hidden layer to the output layer.
o Neurons in each layer are indexed from 0 to NUM_NEURONS_PER_LAYER
With this in mind, let’s define some helper functions to access data in global arrays:
Base offsets used in these functions are pre-calculated for each layer based on number of layers
and number of neurons per layer in the network. Details can be found in the source code accompanying
this article in the function createComputePasses.
// Calculate activations for every layer, going forward through the MLP network
for (uint layer = 1; layer < LAYER_COUNT; layer++)
{
const uint neuronCountCurrentLayer = neuronsPerLayer[layer];
const uint neuronCountPreviousLayer = neuronsPerLayer[layer - 1];
neuronValue += nnWeights[weightDataIndex] *
activations[previousNeuronDataIndex];
}
activations[neuronDataIndex] = ACTIVATION_FUNCTION(neuronValue);
}
}
Note that our code has allocated an array called activations where we store activations of all
neurons during the forward pass. But for inference we only need to store activations of 2 layers at any
time: the one that we are evaluating and the previous layer. As an optimization, we can allocate two
smaller arrays with the size NUM_MAX_NEURONS_PER_LAYER and ping-pong between them. However,
during the training we will need to store activations for all neurons from the forward pass to perform a
backpropagation pass.
Some of the functions commonly used are ReLU (rectified linear unit), leaky ReLU and sigmoid:
0.01𝑥, 𝑥 < 0 1
𝜎𝑅𝑒𝐿𝑈 (𝑥) = max(0, 𝑥) 𝜎𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 (𝑥) = { 𝜎𝑆𝑖𝑔𝑚𝑜𝑖𝑑 = 1+𝑒 −𝑥
𝑥, 𝑥 ≥ 0
Even though ReLU and leaky ReLU are very simple, they work well and are widely used, e.g., by
both NeRF and NRC papers. The 0.01 value (slope) in leaky ReLU implementation is a variable and you can
experiment with different values. Usually, the whole network uses the same activation function, but it is
not uncommon for output layer to have a different function than hidden layers.
Note: the derivative of ReLU and leaky ReLU is undefined at point zero, because the function is
discontinuous there, but in practice we must use a value of 0 in that point.
For our sample application, we use leaky ReLU by default. Sigmoid function is also available, but
the training takes a much longer time compared to ReLU and leaky ReLU.
3 Training
3.1 How the Neural Network Learns
In this article, we implement neural network training based on a common supervised learning
method. Supervised learning means that we evaluate the neural network on inputs for which we know
correct outputs that we’d like to get, and we adjust it to give us results closer to desired outputs. Each
input is called a sample or an example, and all inputs available for training are called a training set.
For each sample from the training set, we start by comparing network’s prediction to the correct
output using a cost function. This function tells us how good the prediction was (lower is better). Then,
we will update weights and biases in a way that minimizes this cost function. This is done using an
algorithm called gradient descent. The gist of it is that we first calculate how the cost function changes if
we change individual weights and biases, and then we adjust them in a direction that lowers the cost. In
the following text, we will often refer to “weights and biases”, but I’m only going to write “weights” to
keep the text more readable. Keep in mind that what applies to weights also applies to biases here.
To know how to adjust the weights to lower the cost we need to calculate a value for each weight that
tells us how the cost function changes when that specific weight changes. These are called partial
derivatives of a cost function wrt. the weight. The vector of all partial derivatives of our weights is called
a gradient.
Now we can imagine the cost function as a multi-dimensional surface. The gradient points in a
direction where the cost function rises most steeply. We can think about stochastic descent as a ball rolling
down the hill on this surface (position of the ball is given by the current network state and the cost function
value). We want to get the ball as low as possible to minimize the cost. By calculating gradients (which
point up the hill on this surface), and moving in opposite direction, we basically roll the ball downhill to
some local minimum. Gradient is calculated using the algorithm called backpropagation which we’ll
discuss in section 3.5. The process of adjusting the weights is also called an optimization of the network.
Once we have calculated the gradient, we can simply nudge the weights in an opposite direction by
some small amount. This method of minimizing the cost function will eventually get us to a local minimum
of the cost function. Note that this method doesn’t find a global minimum (unless it gets very lucky), but
in practice this is not a huge issue for highly dimensional problems which typically have many local minima,
with values similar to the global minimum.
We repeat this process many times to iteratively lower the cost. In each step, weights are not adjusted
by the whole magnitude of the gradient, but we first multiply it by a small number (e.g., 0.0001) to take
smaller steps. This multiplication constant is called a learning rate, and it tells us how fast the gradient
descent should progress along the path to the local minimum. Without the learning rate, gradient descent
would be taking very large steps, while being unstable and oscillating around the local minimum. Picking
a right learning rate is critical in order to ensure that training will be stable, but also sufficiently fast. In
practice, we will often have an adaptive learning rate. We will start with highest learning rate, and we
lower it after each training step according to some schedule (e.g., linearly or exponentially). In practice,
more advanced optimizers are used to adjust the weights, like the Adam optimizer that we’ll describe in
section 5.
Learning rate is what we call a hyperparameter – it’s a parameter of a neural network implementation
which is not learned by training, but rather set by a user. More advanced training algorithms have many
hyperparameters. Important thing to realize is that we don’t always have to set hyperparameters manually,
but we can have other optimizing algorithm (even one based on machine learning) finding the optimal
values of hyperparameters for us.
Let’s now look at the overview of how the whole training algorithm with gradient descent works. The
single training step does the following:
Going through all the inputs in the training set is called an epoch and we’ll need many epochs before
the network converges to a local minimum of a cost function. Going through all the inputs in every step
can be cumbersome, as the algorithm goes through the whole data set every time. As an optimization, we
can use a modified method called stochastic gradient descent (SGD).
Remember that our sample application wants to train the network to represent a 2D image,
mapping UV coordinates to RGB texel values. We will use SGD for that, taking a batch of 2048 samples per
each training step, and we’ll take one training step per frame. If we had to go through all the texels in every
training step, the memory requirements and runtime performance would be unusable. For our example,
we don’t even need to store the training set explicitly, we can simply generate desired number of random
samples in run-time by randomly sampling the texture.
To solve this, we should also track a validation error measured on a data set which is separate
from the training set. This error will be high at the beginning (just like the training error), and will get lower
with training, but after some time it will start to rise again. This rise happens at the point when neural
network stops generalizing well to data not seen by training, and it is the point when we should stop
training. It is common to create a validation set from the training data that we have available by splitting
it into 80% training set and 20% validation set. It is, however, necessary to make sure that training and
validation sets don’t mix. This method of stopping the training when validation error start rising is called
early stopping.
Note that some algorithms like NRC don’t ever stop training – they run in real time to make sure
that neural network adapts to changes in the scene. Our sample application also runs the training
continuously because for our use case, the generalization is not a problem, we simply want it to learn one
input image as good as it can.
There are also other cost functions used in practice, like the L1 loss or L2 loss. We can use any
function which gives lower score to better outcomes, but the function must have a derivative, which we’ll
need for calculating the gradient.
Cost function is often also called a loss function and its value simply a loss. Cost functions can also
have additional regularization terms which impose additional cost, e.g., for case when weights and biases
get very large (this is called weight decay). This will force the weights to be as small as possible, which can
help to prevent overfitting.
3.5 Backpropagation
Let’s now discuss how we calculate the gradient using the backpropagation algorithm. As the
name suggests, the algorithm starts at the output layer and continues backwards through the network. It
is therefore also called the backward pass through the network.
The goal of this algorithm is to calculate partial derivative for each weight and bias wrt. cost
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
function, which we designate as and . Before we start let’s formally define some things we’ll
𝜕𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗 𝜕𝑏𝑖𝑎𝑠𝑖
need to calculate those partial derivatives:
𝑍𝑖 = ∑ 𝑤𝑖𝑗 𝑂𝑗 + 𝑏𝑖
𝑂𝑖 = 𝜎(𝑍𝑖 )
where 𝑍𝑖 is the output of the neuron 𝑖 without applying activation function, 𝑤𝑖𝑗 is a weight of the
connection to the neuron 𝑗, 𝑂𝑗 is the activation of connected neuron𝑗 , 𝑏𝑖 is the bias value, 𝑂𝑖 is the output
with activation function applied, and 𝜎 is the activation function.
So how can we calculate these partial derivatives knowing the cost function value? The trick is to
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
split calculations of 𝜕𝑤𝑒𝑖𝑔ℎ𝑡 and 𝜕𝑏𝑖𝑎𝑠 into several derivatives which are easier to calculate and combine
𝑖𝑗 𝑖
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
them using a chain rule. Let’s start with the simplest, which is 𝜕𝑏𝑖𝑎𝑠 . This is the same as 𝜕𝑍𝑖
, and is also
𝑖
called an error of the neuron:
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑂𝑖
𝜕𝑏𝑖𝑎𝑠𝑖
= 𝜕𝑍𝑖
= 𝜕𝑂𝑖
∙ 𝜕𝑍
𝑖
We have now calculated the partial derivative of the bias term of the neuron in the output layer,
and we can use it to adjust this neuron’s bias. Next, let’s look at the derivatives of weights connecting to
this neuron, splitting it again like:
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑍𝑖
= ∙
𝜕𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗 𝜕𝑍𝑖 𝜕𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗
𝜕𝑐𝑜𝑠𝑡 𝑖 𝜕𝑍
The term 𝜕𝑍𝑖
is the same error that we calculated before, and the new term 𝜕𝑤𝑒𝑖𝑔ℎ𝑡 tells us
𝑖𝑗
how the neuron value without activation function changes when we change that weight. This is a partial
derivative of (𝑤𝑖𝑗 𝑂𝑗 + 𝑏𝑖 ) wrt. 𝑤𝑖𝑗 , which just boils down to the activation of the connected neuron:
𝜕𝑍𝑖
= 𝑂𝑗
𝜕𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗
While these derivatives look intimidating at first, they boil down to a very simple calculation:
𝜕𝑐𝑜𝑠𝑡
= (𝑡𝑎𝑟𝑔𝑒𝑡𝑖 − 𝑂𝑖 ) ∙ 𝜎 ′ (𝑂𝑖 )
𝜕𝑏𝑖𝑎𝑠𝑖
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
= ∙𝑂
𝜕𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗 𝜕𝑏𝑖𝑎𝑠𝑖 𝑗
With this knowledge, we can now implement gradient calculation for the output layer:
// Gradient of bias
const uint neuronDataIndex = getNeuronDataIndex(LAYER_COUNT - 1, neuron);
// Gradient of weights
for (uint previousNeuron = 0; previousNeuron < neuronCountPreviousLayer;
previousNeuron++)
{
const uint previousNeuronDataIndex = getNeuronDataIndex(LAYER_COUNT - 2,
previousNeuron);
const float dCost_weight = dCost_Z * activations[previousNeuronDataIndex];
const uint weightIndex = getConnectionDataIndex(LAYER_COUNT - 1, previousNeuron,
neuron);
InterlockedAdd(gradientWeights[weightIndex], packFloat(dCost_weight));
}
What remains is to calculate the same partial derivatives for the hidden layer, which is just slightly
𝜕𝑐𝑜𝑠𝑡
more complicated. Because the neurons in the hidden layer are connected to other neurons, their 𝜕𝑂𝑖
values are not dependent on the cost function directly, but on the connected neurons. Specifically:
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑍𝑗
=∑ ∙
𝜕𝑂𝑖 𝜕𝑍𝑗 𝜕𝑂𝑖
𝑗
𝜕𝑐𝑜𝑠𝑡
Where 𝑗 is the j-th neuron connected to the output of evaluated neuron 𝑖, 𝜕𝑍𝑗
is the error of the j-th
𝜕𝑍𝑗
neuron evaluated previously, and 𝜕𝑂𝑖
tells us how the output of neuron 𝑗 without activation function
changes when the output of the evaluated neuron changes. This is a derivative of (𝑤𝑖𝑗 𝑂𝑗 + 𝑏𝑖 ) wrt.𝑂𝑗 , so
simply a strength of the connection between the neurons:
𝜕𝑍𝑗
= 𝑤𝑒𝑖𝑔ℎ𝑡𝑖𝑗
𝜕𝑂𝑖
𝜕𝑐𝑜𝑠𝑡
The code for hidden layer gradient is almost the same, except for the calculation, which is:
𝜕𝑂𝑖
// Load target value to learn for this input from reference image
const float3 target = targetTexture[uvs * float2(gData.outputWidth - 1,
gData.outputHeight - 1)].rgb;
// First run forward pass to evaluate network activations for given input
float activations[LAYER_COUNT * MAX_NEURONS_PER_LAYER];
forwardPass(uvs, activations);
The next step – optimization – runs once after calculating gradient for every batch, reads the
gradient and adjusts the weights and biases accordingly.
Because of this, we want to start with different initial weight and bias setting for each neuron to
“break the symmetry”. Therefore, we initialize the network with some small random numbers centered
around zero. We can take these numbers, e.g., from the uniform or normal distribution. To do that we
must pick a range in which to generate random numbers, and in case of normal distribution also the
standard deviation. These can either be hyperparameters tuned manually, or better, we can use one of the
common strategies for setting them automatically. Such strategies are, e.g., LeCun uniform, Xavier
uniform and their variations using normal distribution. These have been introduced in paper
Understanding the difficulty of training deep feedforward neural networks.
In our code sample we use the Xavier uniform initialization. It generates weights using a uniform
random distribution within the range [−𝑥, 𝑥], where 𝑥 is dependent on number of neurons in layers that
the weight is connecting:
6
𝑥 =√
𝑛𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝐿𝑎𝑦𝑒𝑟 + 𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝐿𝑎𝑦𝑒𝑟
Random initial values for weights are enough to break the symmetry and ensure proper learning,
and therefore it is common to initialize biases to zeroes, or some small constant.
Having inputs in higher-dimensional space will make it easier for neural network to discover more
complex relationships between inputs and outputs. Implementation of this encoding can be found in the
sample code in function called frequencyEncoding.
For more advanced encodings, I recommend looking at the one-blob encoding, introduced in
Neural Importance Sampling paper, which extends one-hot encoding to have multiple entries activated,
instead of just one. This essentially activates or shuts down parts of the network, depending on the input
value.
Another useful encoding is a hash-grid encoding introduced in Instant Neural Graphics Primitives
with a Multiresolution Hash Encoding paper.
5 Improving the Network – Adam Optimizer
So far, we have applied gradients during the training using a very simple way – we just multiplied
it by the learning rate and subtracted it from current weights and biases. While this works, it is not optimal
for several reasons: fixed learning rate is usually suboptimal at the beginning, when we want to take larger
steps to proceed faster, but it is also suboptimal at the end, when it can oscillate around the optimal
solution because the fixed step is too large to get into the valley of local minimum. In practice, we want to
use an adaptive learning rate instead.
In this section we implement an improved optimizer called Adam (adaptive moment estimation),
described in the paper Adam: A Method for Stochastic Optimization. Adam adds two ingredients to the
optimization process: adaptive learning rate and momentum.
We have described adaptive learning rate before, so let’s now look at the momentum. Remember
our “ball rolling” example where we stated that minimizing the cost function is like rolling the ball on its
surface to find a lowest point. Just like in real world, where the ball has certain momentum, we can add
momentum to our algorithm by taking into consideration the gradients in previous steps and applying a
weighted average of them, instead of just latest gradients. This way, we progress to the optimum faster
and we have a higher chance of escaping shallow valleys with local minima to find even lower local
minimum.
The updated mean and variance are stored next to weights and biases. Note that we still have to
set a base learning rate when using this algorithm, the paper suggests a value of 0.001. In practice, the
default values suggested in the paper for 𝛽1 = 0.9and 𝛽2 = 0.999 are used as well. The default value for
𝜀 which prevents division by zero is 10−8. Values 𝛽1𝑇 and 𝛽2𝑇 are derived from 𝛽1 and 𝛽2 and adjusted after
each training step as follows:
7 Next Topics
There is a lot more to deep learning than what I described so far in this article. Before we wrap
up, I want to mention at least two topics worth studying next, that you will encounter often when reading
deep learning papers focused on computer graphics: auto-encoders and convolutional networks.
Auto-encoder is a type of neural network consisting of encoder and a decoder. Encoder translates
input data into more compact representation (in latent space) and decoder is used to decompress it to
original representation. It can be used for applications like data compression, denoising and image
reconstruction.
Convolutional networks (CNNs) are specialized networks used mostly in image processing. They
contain 2 types of hidden layers: convolutional layers intended to detect certain features in images, and
pooling layers which downsample intermediate results into more compact representation.
While I was first learning about deep learning, I found great insight in the paper titled “Multilayer
Feedforward Networks are Universal Approximators” from 1989. It says that MLPs with at least one hidden
layer are universal approximators, meaning they can approximate any function to a degree allowed by
the network capacity. It also says that failure to approximate given function can be attributed to
inadequate learning, insufficient network capacity, or non-deterministic relation between network inputs
and outputs. In practice, we can therefore use MLPs to replace parts of our algorithms which contain
difficult functions mapping data from one space to another.
8 Conclusion
In this article, I have described a few concepts and algorithms used in deep learning that I found
most interesting for my use case (representing an image using an MLP), but there is much more to explore.
With the knowledge provided here, I hope that you’ll have more fun reading deep learning papers and
resources and implement your own deep learning ideas.
I recommend looking at the book Deep Learning: A Visual Approach by Glassner, the book Deep
Learning by Goodfellow et al. which is freely available online, deep learning videos by 3Blue1Brown, neural
network blogs by demofox, paper titled “Deep learning” by LeCun et al. from 2015, and graphics papers
such as NeRF and NRC. When implementing neural networks, I recommend reading an article How to
accelerate AI applications on RDNA 3 using WMMA for insights how to use AI accelerating instructions of
current GPUs. Finally, I recommend experimenting with libraries like pyTorch which enable much faster
prototyping, than implementing everything from scratch.