Deep Learning Curriculum
Deep Learning Curriculum
1. Biological Inspiration
● Input Layer
○ Accepting input data
● Hidden Layers
○ Feature extraction
● Output Layer
○ Producing final predictions
5. Activation Functions
● Sigmoid Function
○ Characteristics and limitations
● Hyperbolic Tangent (tanh)
○ Comparison with sigmoid
● ReLU (Rectified Linear Unit)
○ Advantages in mitigating vanishing gradients
● Leaky ReLU and Parametric ReLU
○ Addressing the dying ReLU problem
● Softmax Function
○ Multi-class classification outputs
6. Forward Propagation
7. Loss Functions
8. Backpropagation
● Momentum
○ Accelerating SGD
● Nesterov Accelerated Gradient
○ Looking ahead to the future position
● AdaGrad
○ Adaptive learning rates
● RMSProp
○ Fixing AdaGrad's diminishing learning rates
● Adam
○ Combining momentum and RMSProp
● L1 and L2 Regularization
○ Adding penalty terms to the loss function
● Dropout
○ Preventing overfitting by randomly dropping neurons
● Early Stopping
○ Halting training when validation loss increases
● Learning Rate
○ Impact on convergence
● Batch Size
○ Trade-offs between speed and stability
● Number of Epochs
○ Avoiding overfitting
● Network Architecture
○ Deciding depth and width
● Techniques:
○ Grid search
○ Random Search
○ Bayesian optimization
● Xavier/Glorot Initialization
● He Initialization
● High dimensionality
● Lack of spatial invariance
2. Advantages of CNNs
● Parameter sharing
● Local connectivity
3. Convolution Operation
● Understanding Kernels/Filters
○ Edge detection filters
○ Feature extraction
● Mathematical Representation
○ Convolution in 2D and 3D
● Hyperparameters
○ Kernel size, depth
● Stride and Padding
○ Controlling output dimensions
○ Types of padding: same vs. valid
4. Activation Functions
5. Pooling Layers
● Purpose
○ Dimensionality reduction
○ Translation invariance
● Types of Pooling
○ Max pooling
○ Average pooling
● Pooling Size and Stride
8. CNN Architecture
Layer Stacking
Feature Maps
Visualization
● Techniques
○ Rotation, flipping, cropping
○ Color jitter, noise addition
● Purpose
○ Reducing overfitting
○ Increasing dataset diversity
11. LeNet-5
● Architecture Details
○ Layers, activations
● Contributions
○ Handwritten digit recognition
12. AlexNet
● Breakthroughs
○ Deeper network
○ Use of ReLU
● Impact on ImageNet Challenge
● Inception Modules
○ Parallel convolutional layers
● Motivation
○ Efficient computation
● Residual Blocks
○ Identity mappings
○ Shortcut connections
● Solving Vanishing Gradient Problem
● Variants
○ ResNet-50, ResNet-101
16. MobileNets
Semantic Segmentation
21. U-Net
● Encoder-Decoder Architecture
● Skip Connections
22. Autoencoders
● Convolutional Autoencoders
○ Image reconstruction
● Variational Autoencoders (VAE)
● DCGAN
○ Using CNNs in GANs
● Applications
○ Image generation
○ Super-resolution
1. Architecture of RNNs
● Vanishing Gradients
○ Gradients diminish over long sequences
● Exploding Gradients
○ Gradients grow exponentially
● Solutions
○ Gradient clipping
○ Advanced architectures (e.g., LSTMs, GRUs)
5. LSTM
6. GRU
6. Deep RNNs
7. Bidirectional RNNs
8. Applications of RNNs
Seq2Seq Networks
1. Encoder-Decoder Networks
● Encoder
○ Processes the input sequence and encodes it into a fixed-length
context vector.
○ Architecture: Typically uses Recurrent Neural Networks (RNNs), Long
Short-Term Memory (LSTM), or Gated Recurrent Units (GRUs).
● Decoder
○ Generates the output sequence from the context vector.
○ Architecture: Similar to the encoder but focuses on producing outputs.
C. Mathematical Formulation
D. Implementation Details
● Concept
○ Calculates alignment scores using a feedforward network
● Characteristics
○ Considered more computationally intensive due to additional
parameters.
● Concept
○ Calculates alignment scores using dot products.
○ Scaled Dot Product: Adjusts for dimensionality.
● Characteristics
○ More efficient than additive attention.
3. Transformer Architectures
● Sequential Processing
○ RNNs process inputs sequentially, hindering parallelization.
● Long-Term Dependencies
○ Difficulty in capturing relationships between distant tokens.
B. Introduction to Transformers
● Key Innovations
○ Self-Attention Mechanism: Allows the model to relate different
positions of a single sequence to compute representations.
○ Positional Encoding: Injects information about the position of the tokens
in the sequence.
● Advantages
○ Improved parallelization.
○ Better at capturing global dependencies.
C. Components of Transformer Architecture
1. Multi-Head Self-Attention
● Concept
○ Multiple attention mechanisms (heads) operating in parallel.
● Process
○ Query (Q), Key (K), and Value (V) matrices are computed from input
embeddings.
○ The attention mechanism calculates a weighted sum of the values, with
weights derived from the queries and keys.
2. Positional Encoding
● Purpose
○ Since transformers do not have recurrence or convolution, positional
encoding provides the model with information about the position of
each token.
● Techniques
○ Sinusoidal Functions:
○ Learned Embeddings
3. Feedforward Networks
● Architecture
○ Position-wise fully connected layers applied independently to each
position.
● Activation Functions
○ Typically ReLU or GELU.
4. Layer Normalization
● Purpose
○ Normalizes inputs across the features to stabilize and accelerate
training.
5. Residual Connections
● Purpose
○ Helps in training deeper networks by mitigating the vanishing gradient
problem.
● Implementation
○ Adding the input of a layer to its output before applying the activation
function.
● Encoder Stack
○ Composed of multiple identical layers, each containing:
■ Multi-head self-attention layer.
■ Feedforward network.
● Decoder Stack
○ Similar to the encoder but includes:
■ Masked multi-head self-attention layer to prevent positions from
attending to subsequent positions.
■ Encoder-decoder attention layer.
E. Implementing Transformers
● Key Steps
○ Embedding Layer: Converts input tokens into dense vectors.
○ Adding Positional Encoding: Combines positional information with
embeddings.
○ Building Encoder and Decoder Layers: Stack multiple layers as per the
architecture.
○ Output Layer: Generates final predictions, often followed by a softmax
function.
4. Types of Transformers
● Purpose
○ Focused on language generation tasks.
● Architecture
○ Uses only the decoder part of the transformer with masked
self-attention to prevent information flow from future tokens.
● Training Objective
○ Causal Language Modeling (CLM): Predicting the next word in a
sequence.
● RoBERTa
○ Improves on BERT by training with larger batches and more data.
● ALBERT
○ Reduces model size by sharing parameters and factorizing
embeddings.
● T5 (Text-to-Text Transfer Transformer)
○ Treats every NLP task as a text-to-text problem.
5. Fine-Tuning Transformers
A. Concept of Fine-Tuning
● Transfer Learning
○ Adapting a pre-trained model to a downstream task with task-specific
data.
B. Steps in Fine-Tuning
C. Best Practices
● Text Classification
● Named Entity Recognition
● Question Answering
● Text Summarization
6. Pre-Training Transformers
A. Pre-Training Objectives
B. Data Preparation
● Corpus Selection
○ Large and diverse datasets (e.g., Wikipedia, Common Crawl).
● Tokenization Strategies
○ WordPiece: Used by BERT.
○ Byte-Pair Encoding (BPE): Used by GPT.
C. Training Strategies
● Distributed Training
○ Using multiple GPUs or TPUs.
● Mixed Precision Training
○ Reduces memory usage and increases speed.
● Optimization Algorithms
○ Adam optimizer with weight decay (AdamW).
D. Challenges in Pre-Training
● Compute Resources
○ Requires significant computational power.
● Data Quality
○ Noisy data can affect model performance.
E. Evaluation of Pre-Trained Models
● Benchmarking
○ Using datasets like GLUE, SQuAD to assess performance.
● Ablation Studies
○ Understanding the impact of different components.
7. Optimizing Transformers
A. Computational Challenges
B. Optimization Techniques
● Sparse Attention
○ Reduces the number of computations by focusing on local patterns.
● Linearized Attention (Linformer)
○ Approximates attention to reduce complexity.
● Reformer
○ Uses locality-sensitive hashing to reduce complexity.
2. Model Compression
● Quantization
○ Reducing the precision of weights (e.g., from 32-bit to 8-bit).
● Pruning
○ Removing less important weights or neurons.
● Knowledge Distillation
○ Training a smaller model (student) to replicate the behavior of a larger
model (teacher).
C. Hardware Considerations
D. Software Tools
● Optimized Libraries
○ Hugging Face Transformers: Provides optimized implementations.
○ DeepSpeed: Optimizes memory and computation.
○ NVIDIA Apex: Enables mixed precision training.
A. Text Classification
● Sentiment Analysis
○ Classifying text as positive, negative, or neutral.
● Topic Classification
○ Categorizing text into predefined topics.
B. Question Answering
● Implementing QA Systems
○ Using models like BERT to find answers within a context.
● Datasets
○ SQuAD, TriviaQA.
C. Machine Translation
● Transformer Models
○ Implementing translation systems without RNNs.
● Datasets
○ WMT datasets.
D. Text Summarization
● Abstractive Summarization
○ Generating concise summaries using models like T5.
● Datasets
○ CNN/Daily Mail, Gigaword.
E. Language Generation
● Chatbots
○ Creating conversational agents using GPT models.
● Story Generation
○ Generating coherent narratives.
● Sequence Labeling
○ Identifying entities like names, locations, dates.
● Fine-Tuning
○ Adapting pre-trained models for NER tasks.