ML Notes
ML Notes
UNIT-I
Introduction: Machine learning, terminologies in machine learning, Perspectives and issues in machine
learning, application of Machine learning, Types of machine learning: supervised, unsupervised, semi-
supervised learning. Review of probability, Basic Linear Algebra in Machine Learning Techniques, Dataset
and its types, Data preprocessing, Bias and Variance in Machine learning, Function approximation,
Overfitting
UNIT-II
Regression Analysis in Machine Learning: Introduction to regression and its terminologies, Types of
regression, Logistic Regression
Simple Linear regression: Introduction to Simple Linear Regression and its assumption, Simple Linear
Regression Model Building, Ordinary Least square estimation, Properties of the least-squares estimators and
the fitted regression model, Interval estimation in simple linear regression, Residuals
Multiple Linear Regression: Multiple linear regression model and its assumption.
Interpret Multiple Linear Regression Output (R-Square, Standard error, F, Significance F, Coefficient P
values)
Access the fit of multiple linear regression model (R squared, Standard error)
Feature Selection and Dimensionality Reduction: PCA, LDA, ICA
UNIT-III
Introduction to Classification and Classification Algorithms: What is Classification General Approach to
Classification, k-Nearest Neighbour Algorithm, Random Forests, Fuzzy Set Approaches
Support Vector Machine: Introduction, Types of support vector kernel – (Linear kernel, polynomial kernel,
and Gaussian kernel), Hyperplane – (Decision surface), Properties of SVM, and Issues in SVM.
Decision Trees: Decision tree learning algorithm, ID-3algorithm, Inductive bias, Entropy and information
theory, Information gain, Issues in Decision tree learning.
Bayesian Learning - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve Bayes classifier,
Bayesian belief networks, EM algorithm.
Ensemble Methods: Bagging, Boosting and AdaBoost and XBoost,
Classification Model Evaluation and Selection: Sensitivity, Specificity, Positive Predictive Value, Negative
Predictive Value, Lift Curves and Gain Curves, ROC Curves, Misclassification Cost Adjustment to Reflect
Real-World Concerns, Decision Cost/Benefit Analysis
UNIT – IV
Introduction to Cluster Analysis and Clustering Methods: The Clustering Task and the Requirements for
Cluster Analysis.
Overview of Some Basic Clustering Methods: - k-Means Clustering, k-Medoids Clustering,
Density-Based Clustering: DBSCAN - Density-Based Clustering Based on Connected Regions with High
Dens, Gaussian Mixture Model algorithm, Balance Iterative Reducing and Clustering using Hierarchies
(BIRCH) , Affinity Propagation clustering algorithm, Mean-Shift clustering algorithm, ordering Points to
Identify the Clustering Structure (OPTICS) algorithm, Agglomerative Hierarchy clustering algorithm,
Divisive Hierarchical , Measuring Clustering Goodness
UNIT 1
Machine Learning (ML)
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to
automatically learn and improve from experience without being explicitly programmed.
In simpler terms, it allows machines to make decisions or predictions based on data. The core concept
revolves around the idea that systems can learn from data, identify patterns, and make decisions with
minimal human intervention.
Key Terminologies:
1. Model:
A mathematical representation of a process that the machine learning algorithm tries to learn from data.
Example: A linear regression model that predicts house prices based on features like size and
location.
2. Algorithm:
The method or procedure used to train the model from data. It defines the logic and rules by which the
model makes predictions.
Example: Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbours.
3. Training:
The process of feeding data into a machine learning algorithm to build a model.
Example: Training a neural network on labelled images to classify them.
4. Training Data:
The dataset used to teach the model. The model learns patterns, relationships, and trends from this data.
Example: A dataset containing labelled data of houses with their features and corresponding
prices.
5. Test Data:
The dataset used to evaluate the performance of a trained model. This data has not been used during the
training phase and is meant to test the model’s generalization ability.
Example: A separate set of house prices that the model has not seen during training.
6. Feature:
An individual measurable property or characteristic of the data. Features are the input variables that help
the model make predictions.
Example: In a house price prediction model, features could include the number of bedrooms,
location, and size of the house.
7. Label:
The output or result that the model is trying to predict. In supervised learning, labels are known and
used to train the model.
Example: The actual price of a house in the house price prediction model.
8. Overfitting:
When a model learns the training data too well, including noise and irrelevant details, causing it to
perform poorly on new, unseen data.
Example: A decision tree that perfectly predicts the training data but performs badly on test
data.
9. Underfitting:
When a model is too simple and fails to capture the underlying trends in the data, leading to poor
performance on both training and test data.
Example: A linear model trying to fit complex, non-linear data and failing to capture the data's
nuances.
10. Recall (Sensitivity):
The ratio of correctly predicted positive observations to all actual positives. It shows how well the
model identifies positive cases.
Formula: Recall = TP / (TP + FN) {TP: True Positive // FN: False
Negative}
11. F1 Score:
The harmonic mean of precision and recall. It provides a balance between precision and recall, especially
when dealing with imbalanced datasets.
How it works:
1. The model is provided with training data containing input-output pairs.
2. The model makes predictions on the input data.
3. The prediction is compared to the actual output using a loss function.
Examples of Supervised Learning:
Classification: The task of predicting a discrete label from the input data.
Example: Email spam detection, where emails are classified as "spam" or "not spam."
Regression: The task of predicting a continuous value based on input data.
Algorithms Used in Supervised Learning:
Linear Regression
Logistic Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Neural Networks
2. Unsupervised Learning
Definition:
In unsupervised learning, the algorithm is trained on data that does not have any labelled output. The
goal is to discover hidden patterns, structures, or relationships in the data.
Key Characteristics:
Unlabeled Data: The model is provided with input data without corresponding output labels.
Goal: Find patterns, groupings, or structure in the data.
Applications: Primarily used for clustering, association, and dimensionality reduction.
How it works:
1. The algorithm explores the input data and tries to learn the underlying patterns.
2. The model groups similar data points together or identifies hidden relationships between data
features.
Examples of Unsupervised Learning:
Clustering: Grouping data into clusters where points in the same group are more similar to each
other than to those in other groups.
Association: Discovering relationships or associations between variables in large datasets.
Dimensionality Reduction: Reducing the number of features in the data while preserving the
key information.
Algorithms Used in Unsupervised Learning:
K-Means Clustering
Hierarchical Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Apriori Algorithm (for association rule learning)
Principal Component Analysis (PCA)
t-SNE (t-Distributed Stochastic Neighbor Embedding)
3. Semi-Supervised Learning
Definition:
Semi-supervised learning is a hybrid approach that combines both labeled and unlabeled data. It lies
between supervised and unsupervised learning. In many real-world applications, obtaining labeled data is
expensive or time-consuming, while unlabeled data is abundant.
Semi-supervised learning leverages a small amount of labeled data with a large amount of unlabeled data to
improve learning accuracy.
Key Characteristics:
Combination of Labeled and Unlabeled Data: A small portion of the data is labeled, and a large
portion is unlabeled.
Goal: Use labeled data to guide the learning process, but also leverage the unlabeled data to uncover
additional patterns or relationships.
Applications: Often used in situations where labeled data is scarce or expensive to obtain.
How it works:
1. The algorithm starts by learning from the small set of labeled data.
2. Then, it uses the patterns learned from the labeled data to label the unlabeled data or learn
hidden structures.
3. The model improves its performance by incorporating both labeled and unlabeled data in its
training process.
Examples of Semi-Supervised Learning:
Image Classification: Labeling thousands of images manually can be labor-intensive, so a small
set of labeled images is used along with a large set of unlabeled images.
Speech Recognition: Manually labeling vast amounts of speech data is costly. Semi-supervised
learning can be used to improve speech recognition systems with minimal labeled data.
Algorithms Used in Semi-Supervised Learning:
Self-training
Co-training
Generative Models (such as Variational Autoencoders or Gaussian Mixture Models)
Graph-Based Methods (such as Label Propagation)
Review of Probability
Experiment: Any process that leads to a well-defined outcome. For ex: rolling a die or flipping a coin.
Outcome: A possible result of an experiment.
Sample Space (S): The set of all possible outcomes of an experiment.
Event (E): A subset of the sample space. It represents one or more outcomes that are of interest.
Probability (P): A numerical value between 0 and 1 that represents the likelihood of an event occurring
Basic Probability Concepts in Machine Learning
1. Random Variable:
o A random variable is a variable whose possible values are outcomes of a random phenomenon.
o Types:
Discrete Random Variable: Takes on distinct values (e.g., number of heads in coin
tosses).
Continuous Random Variable: Takes on any value within a range (e.g., temperature).
2. Probability Distribution:
o Describes how the probabilities are distributed over the values of a random variable.
o For Discrete Random Variables: Probability Mass Function (PMF) gives the probability of
each specific value.
o For Continuous Random Variables: Probability Density Function (PDF) gives the
probability of values in a range.
3. Joint Probability:
o The probability of two or more events occurring together.
o Example: The probability that a student is both a high scorer and attends all classes.
4. Marginal Probability:
o The probability of a single event occurring, irrespective of other events.
o Example: The probability that a student is a high scorer, ignoring their class attendance.
5. Conditional Probability:
o The probability of an event occurring given that another event has already occurred.
o Formula: P(A∣ B) = P(A∩B) / P(A | B)
o Example: The probability that a student is a high scorer given that they attend all classes.
6. Independence:
o Two events are independent if the occurrence of one event does not affect the probability of the
other.
o Formula: P(A∩B)=P(A)×P(B)
o Example: Tossing two coins; the outcome of one toss doesn't affect the other.
7. Bayes’ Theorem:
o A method to calculate the conditional probability of an event based on prior knowledge of
related events.
o Formula:
o Example: Given the probability of having a disease and the probability of testing positive,
Bayes’ theorem helps find the probability of having the disease given a positive test result.
8. Expectation (Expected Value):
o The expected value of a random variable is the long-term average value of repetitions of the
experiment.
o Formula:
Example: The expected number of heads in 10 coin tosses (each with a 50% chance of heads)
o
is 10×0.5=5.
9. Variance and Standard Deviation:
o Variance measures how much the values of a random variable differ from the expected value.
o Formula:
oStandard Deviation is the square root of the variance, giving the spread of data.
o Example: In coin tosses, variance tells us how far the actual number of heads will typically be
from the expected value.
10. Probability in ML Models:
o Classification: Models like Naive Bayes or logistic regression use probabilities to classify
data.
o Generative vs Discriminative Models:
Generative Models: Learn the joint probability distribution P(X,Y) and then predict
P(Y∣ X). Example: Naive Bayes.
Discriminative Models: Learn the conditional probability distribution P(Y∣ X).
Example: Logistic regression.
Formula:
Matrices:
Definition: A matrix is a 2D array of numbers. It is used to represent multiple data points.
o Example:
o Each row can represent a data point and each column a feature.
Operations:
o Matrix Multiplication: Used to transform data or compute weighted
sums.
o Transpose: Flipping rows and columns of a matrix.
Geometrical Interpretation:
3. Visual Representation
The relationship between bias, variance, and the error can often be visualized:
Total Error:
Graph:
Function Approximation
In machine learning, function approximation refers to the process of learning a mapping (function) from
input data to output data.
Types of Function Approximation:
1. Linear Function Approximation:
o Definition: Approximates the target function using a linear model, where the output is a linear
combination of the input features.
2. Non-Linear Function Approximation:
o Definition: In many cases, the relationship between inputs and outputs is not linear, and non-
linear models like neural networks, decision trees, or polynomial regression are used to
capture complex patterns.
Steps in Function Approximation:
1. Model Selection:
2. Training:
3. Optimization:
4. Testing:
Causes of Overfitting
1. Complex Models:
o Definition: Models that are too complex, such as deep decision trees or large neural networks,
can capture every detail in the training data, including noise.
o Example: A decision tree that grows very deep and tries to fit every possible variation in the
training data may become too specific to that dataset.
2. Insufficient Training Data:
o Definition: When the training data is small or lacks diversity, the model is more likely to learn
the specific characteristics of the training set rather than general trends.
o Example: If a model is trained on a limited dataset with a narrow range of features,
Types of Regression
Linear Regression: The simplest form of regression, assuming a linear relationship between the
variables.
Polynomial Regression: Models a nonlinear relationship using a polynomial equation.
Logistic Regression: Used for classification problems, predicting binary outcomes (e.g., 0 or 1).
Elastic Net Regression
o Description: Combines the penalties of both Ridge and Lasso regression, allowing for both
feature selection and regularization.
Logistic Regression
Purpose: Used to predict the probability of an event occurring.
Equation: The logistic function (sigmoid function) is used to map the linear combination of
independent variables to a probability between 0 and 1.
Applications:
o Email spam detection
o Medical diagnosis
Example:
Scenario 1: Predicting Loan Approval (Yes/No Problem)
Problem:
o You want to predict whether a person’s loan will be approved or not based on their income and
credit score.
o The output is either 0 (No) or 1 (Yes) — it is a binary value.
Solution:
Use Logistic Regression, a type of generalized regression.
o Logistic regression applies a "log link function" that converts the output into probabilities
(values between 0 and 1).
Applications:
Predicting sales based on advertising spend
Estimating house prices based on features like area or location
Objective of OLS:
The goal of OLS estimation is to find the values of the coefficients (β0 and β1 ) that minimize the sum of
the squared differences between the observed values and the predicted values. These squared differences are
referred to as residuals.
The general linear regression model is:
Residuals(error)
Residuals: The differences between the actual values and the predicted values.
Analysis of residuals: Used to check the assumptions of the model and identify any potential
problems.
Residual plots: Can help visualize the residuals and check for patterns or outliers
Residuals Calculation:
For each data point, the residual is:
Properties of Residuals:
1. Sum of Residuals: The sum of residuals is always zero:
3. No Autocorrelation: The residuals should not exhibit patterns over time or with respect to any
independent variable. Autocorrelation in residuals is often detected with the Durbin-Watson test.
Where:
Y is the house price,
β₀ is the base price (intercept),
β₁ , β₂ , β₃ are the coefficients (the impact of each factor on the price),
ϵ is the error term (things we cannot measure perfectly).
Coefficient P-Values:
What it is: Each independent variable (like size of the house or number of bedrooms) has a p-value,
which shows if that variable is helping to predict the outcome.
Easy interpretation:
o If a p-value for a variable (e.g., size) is less than 0.05, it’s important for the prediction.
o If a p-value is greater than 0.05, that variable might not be significant and can be ignored or
removed from the model.
Easy Example:
If you predict that tomorrow the temperature will be 30°C, but it's actually 33°C, the difference (3°C)
is part of the error.
If the standard error is 5°C, this means your temperature predictions are typically off by about 5°C.
The smaller the SE, the more accurate your model is.
Key Point: Smaller SE is better because it means your predictions are closer to the actual values.
Adjusted R-Squared
What it is: Adjusted R² is a modified version of R² that considers how many predictors (variables) are in the
model. It helps you compare models with different numbers of predictors.
Easy Example:
Suppose you have two models predicting ice cream sales:
o Model A uses temperature and sunny days as predictors.
o Model B uses temperature, sunny days, and humidity.
Adjusted R² will tell you if adding humidity (a new variable) to the model improves predictions or if it
just complicates things.
If Adjusted R² increases after adding humidity, it means the new variable is useful. If it decreases, it
means the new variable is not helping much and might even hurt the model.
Key Point: Adjusted R² helps prevent overfitting by penalizing models with too many unnecessary variables.
Steps
1. Find the Centers: See where each group of points is concentrated.
2. Measure the Distance and Spread: Check how far apart the groups are and how scattered the points
are within each group.
3. Draw the Best Line: Find a line or direction that best divides the groups for clear classification.
Example Use Case: Face recognition or disease diagnosis based on patient data.
Advantages of LDA:
Focuses on maximizing class separability, which improves classification performance.
Helps in visualizing multi-class data in fewer dimensions while maintaining class distinctions.
UNIT-3
Introduction to Classification and Classification Algorithms
1. What is Classification?
Definition: Classification is a supervised machine learning technique used to categorize data
into predefined labels or classes based on its attributes.
Analogy: Think of sorting mail into categories like "Personal," "Work," and "Spam." The goal
is to place each email in the correct category based on its content.
Example: Classify whether a fruit is an apple or orange based on features like color and size.
4. Random Forest
Definition: An ensemble method that combines multiple decision trees to improve
classification performance.
Key Points:
o Builds many decision trees during training.
o Combines the output of all trees (majority voting) for the final classification.
Advantages:
o Handles large datasets efficiently.
o Reduces overfitting compared to a single decision tree.
Disadvantages:
o Requires more computational resources.
o Less interpretable than a single decision tree.
Example:
o Predict whether a loan applicant is "Creditworthy" or "Not Creditworthy" based on
features like income, credit score, and employment history.
Numerical Aspect:
o Decision Tree Splitting:
Advantages:
o Effective for complex problems with overlapping classes.
o Provides a degree of confidence for each class.
Disadvantages:
o Requires careful design of membership functions.
o Computationally intensive.
Example:
o Classify the "risk level" of patients (Low, Medium, High) based on fuzzy inputs like
blood pressure and heart rate.
Recommended Resources
o "k-Nearest Neighbour Algorithm" by Simplilearn
o "Random Forest Algorithm Explained" by StatQuest
o "Fuzzy Logic with Examples" by Neso Academy
Advantages of SVM
Effective for high-dimensional datasets.
Works well for both linear and non-linear classification.
Robust to overfitting, especially in high-dimensional spaces.
Disadvantages of SVM
Computationally expensive for large datasets.
Requires careful selection of kernel functions and parameters.
Can be sensitive to outliers.
Recommended Resources
1. YouTube:
o "Support Vector Machine Explained" by StatQuest
o "SVM Kernels - Linear, Polynomial, RBF" by Great Learning
Example: For a dataset with two classes (e.g., cats and dogs), the hyperplane is the decision
boundary that separates the feature representations of cats from those of dogs.
Properties of SVM
1. Margin Maximization:
o SVM seeks to maximize the margin between the hyperplane and the nearest data points
(support vectors).
o Larger margins reduce overfitting and improve model generalization.
2. Support Vectors:
o Only the data points closest to the hyperplane (support vectors) are used to define the
decision boundary.
o These points are critical for training the SVM.
3. Kernel Trick:
o SVM can handle non-linearly separable data by using kernel functions to transform it
into a higher-dimensional space where it becomes linearly separable.
4. Dual Representation:
o The optimization problem in SVM can be expressed in terms of Lagrange multipliers,
allowing efficient computation.
5. Robustness to High Dimensions:
o SVM performs well in datasets with many features (e.g., text classification with
thousands of words).
Issues in SVM
1. High Computational Cost:
o Training an SVM can be computationally expensive for large datasets, especially with
non-linear kernels.
2. Choice of Kernel:
o Selecting the appropriate kernel function (e.g., linear, polynomial, or Gaussian) and
tuning its parameters can be challenging and critical for model performance.
3. Sensitivity to Outliers:
o SVM is sensitive to noise and outliers, as they can affect the position of the hyperplane.
4. Imbalanced Data:
o SVM struggles with imbalanced datasets, as it assumes equal importance for all classes.
This may result in a biased hyperplane.
5. Interpretability:
o Compared to simpler models like decision trees, SVM is less interpretable, especially
when using complex kernels.
Recommended Resources
o "SVM Explained Visually" by StatQuest
o "Understanding the SVM Hyperplane and Support Vectors" by Edureka
2. Bayes’ Theorem
Formula:
Key Points:
Prior probability is updated using new evidence to compute the posterior probability.
The posterior becomes the new prior as more evidence accumulates.
Concept Learning
Definition: Concept learning involves finding a hypothesis HHH that best explains the
observed data DDD.
Bayesian Perspective:
o All possible hypotheses are considered.
o The best hypothesis is the one with the highest posterior probability P(H∣ D)
Key Equation:
Steps:
1. Compute the prior probability P(C)P(C)P(C) for each class.
2. Compute the likelihood P(X∣ C)P(X|C)P(X∣ C) for each feature assuming independence.
3. Use Bayes’ theorem to compute the posterior probability for each class.
4. Choose the class with the highest posterior probability.
Example: Email Spam Classification:
Features: Words in the email (e.g., "money," "free").
Class: Spam or not spam.
Assumes the presence of "money" and "free" are independent indicators.
Example:
Clustering customer data based on purchase behavior where some features are missing.
Boosting
Concept: Trains models sequentially, where each subsequent model focuses on correcting the
errors of the previous ones.
Key Points:
o Reduces bias and improves accuracy.
o Can be sensitive to noise and outliers.
UNIT – IV
Introduction to Cluster Analysis and Clustering Methods: The Clustering Task and the Requirements for
Cluster Analysis.
Overview of Some Basic Clustering Methods: - k-Means Clustering, k-Medoids Clustering, Measuring
Clustering Goodness
(k-Medoids Clustering) Density-Based Clustering: DBSCAN - Density-Based Clustering Based on
Connected Regions with High Dens, Gaussian Mixture Model algorithm,
ADV clustering algo:
Balance Iterative Reducing and Clustering using Hierarchies (BIRCH) ,
Affinity Propagation clustering algorithm,
Mean-Shift clustering algorithm, ordering Points to Identify the Clustering Structure (OPTICS) algorithm,
Agglomerative Hierarchy clustering algorithm,
Divisive Hierarchical
UNIT-4
Cluster Analysis:
Definition: Cluster analysis is a type of unsupervised learning technique used to group similar
data points into clusters, where the points in a cluster are more like each other than to those in
other clusters.
Objective: The goal of clustering is to explore the inherent structure of the data and to
categorize data into meaningful groups without pre-defined labels.
Real-World Analogy:
Cluster analysis is like organizing a collection of books in a library. Instead of grouping them by title
or author, you group them by similarity, such as genre, themes, or writing style. The books in each
cluster are more like each other than to those in other clusters.
o Manhattan Distance: The sum of absolute differences between two points. (mod. of
diff btw two point)
o Cosine Similarity: Measures the cosine of the angle between two vectors, often used
in text mining.
b. Homogeneity within Clusters:
Definition: A good clustering algorithm should produce groups where the items within each
cluster are as similar as possible.
Requirement: Ideally, data points within a cluster should exhibit high similarity and data
points across clusters should exhibit high dissimilarity.
c. Heterogeneity between Clusters:
Definition: The dissimilarity between clusters should be maximized, meaning that clusters
should be as distinct as possible.
Example: In customer segmentation, different customer types (e.g., young vs. old, low-
income vs. high-income) should form separate clusters.
d. Scalability:
Definition: The ability of a clustering algorithm to handle large datasets effectively.
Challenge: Many clustering algorithms become inefficient as the size of the data increases.
Example: Algorithms like K-Means are scalable, while others, such as hierarchical clustering,
are less scalable with large datasets.
e. Interpretability:
Definition: The results of the clustering should be easy to interpret and explain.
Challenge: Some clustering algorithms, like DBSCAN, can produce clusters that are difficult
to interpret in practical terms.
f. Assumptions about Data Distribution:
Different clustering algorithms may assume different data distributions.
o K-Means assumes that clusters are spherical and equally sized.
o DBSCAN assumes that clusters are dense regions of data separated by sparse regions.
o Gaussian Mixture Models (GMM) assume data is generated from a mixture of
several Gaussian distributions.
Types of Clustering Methods
Clustering methods can be broadly categorized into several approaches, each with different
assumptions and applications.
a. Partitioning Methods
Description: These methods divide the data into a specified number of clusters.
Example: K-Means Clustering
o How it works: The algorithm selects k initial centroids and iteratively refines them to
minimize the sum of squared distances within clusters.
b. Hierarchical Methods
Description: These methods build a hierarchy of clusters, creating a tree-like structure
(dendrogram).
Example: Agglomerative Hierarchical Clustering
o How it works: Starts with each data point as its own cluster and merges the closest
clusters iteratively.
c. Density-Based Methods
Description: These methods define clusters as areas of high density separated by areas of low
density.
Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
o How it works: Identifies clusters based on dense regions of data points and considers
points in sparse regions as noise.
d. Model-Based Methods
Description: It’s like saying each group (cluster) follows a specific pattern, and model-based
methods find and fit these patterns to organize the data into meaningful clusters.
Example: Gaussian Mixture Models (GMM)
o How it works: Assumes data is a mixture of several Gaussian distributions, and tries to
estimate the parameters of these distributions.
e. Grid-Based Methods
Description: These methods partition the data space into a finite number of cells (grid) and
perform clustering based on the grid structure.
Example: STING (Statistical Information Grid-Based Clustering)
o How it works: Divides the dataset into grid cells, and uses statistical measures to
determine clusters.
Overview of Some Basic Clustering Methods
Clustering is an unsupervised learning technique that groups similar data points together. Here’s an
overview of some widely used clustering algorithms:
1. k-Means Clustering
Definition:
k-Means is a partitioning-based clustering algorithm that divides the data into k distinct clusters,
where each data point belongs to the cluster whose center (centroid) is closest.
https://youtu.be/5FpsGnkbEpM?si=DiZn6-DbbI5SSh0p
Real-World Example:
2. k-Medoids Clustering
Definition: k-Medoids is similar to k-Means, but instead of using the mean of the points to represent
the centroid of a cluster, it uses the most centrally located point (medoid).
It minimizes the sum of dissimilarities between points and the representative medoid.
Advantages:
Less sensitive to outliers than k-Means since medoids are less affected by extreme values.
Can work with arbitrary distance metrics (e.g., Manhattan distance, cosine similarity).
Disadvantages:
Computationally more expensive than k-Means.
Requires the number of clusters k to be pre-defined.
Not suitable for very large datasets.
Real-World Example:
Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
Definition: DBSCAN is a density-based clustering algorithm that groups together closely packed
points, while marking points in low-density regions as outliers. It does not require the number of
clusters to be predefined.
Advantages:
Can discover clusters of arbitrary shape.
Does not require the number of clusters to be specified in advance.
Can handle noise and outliers effectively.
Works well with datasets containing clusters of varying shapes and densities.
Disadvantages:
Sensitive to the choice of ϵ\epsilon and MinPtsMinPts parameters.
Struggles with datasets of varying density, where some clusters may be harder to identify.
Computationally expensive for large datasets.
Real-World Example:
DBSCAN is widely used in spatial data clustering, such as identifying areas of high customer
activity in retail sales, or in geographic data analysis, where it helps to find densely populated regions
in a map.
Advantages:
Can model clusters of elliptical shapes, unlike k-Means (which assumes spherical clusters).
Provides probabilities for cluster membership, which can be useful for decision-making.
Can model complex data distributions.
Disadvantages:
Computationally intensive and requires careful initialization.
Assumes data is generated from Gaussian distributions, which may not always be the case.
The number of clusters kk must be specified.
Real-World Example:
GMM can be used in image segmentation, where the algorithm assigns pixels in an image to different
regions based on color distributions, modeling the color distribution as a mixture of Gaussians.
Real-World Example:
BIRCH is often used in large-scale data analysis like customer segmentation in large retail stores,
where millions of customer records need to be processed quickly.
Real-World Example:
Mean-Shift clustering is popular in image segmentation, where it helps segment an image into
regions based on color and texture, without needing to predefine the number of regions.
Recommended Resources
1. YouTube:
o "Understanding Affinity Propagation Clustering" by Data School.
o "Mean-Shift Clustering Algorithm - Machine Learning" by Simplilearn.
Advantages:
Does not require the number of clusters to be specified in advance.
Produces a hierarchical tree (dendrogram) that provides insight into the data structure.
Can handle clusters of arbitrary shapes.
Disadvantages:
Computationally expensive for large datasets (especially when the number of data points is
large).
Sensitive to noise and outliers.
Real-World Example:
Agglomerative hierarchical clustering is used in gene expression analysis, where the goal is to group
similar genes based on their expression patterns across multiple conditions.
Real-World Example:
Divisive hierarchical clustering can be used in document classification, where initially, all
documents are in one cluster, and the task is to split them based on the topic until each document is in
its own topic-based cluster.
Where:
o a(i)a(i) is the average distance between point ii and all other points in the same cluster.
o b(i)b(i) is the average distance between point ii and all points in the nearest cluster.
Davies-Bouldin Index (DBI):
This measures the average similarity ratio of each cluster with the one most similar to it. A
lower Davies-Bouldin index indicates better clustering.
Formula:
Where:
Recommended Resources
1. YouTube:
o "Agglomerative Clustering - Machine Learning" by StatQuest.
o "Divisive Hierarchical Clustering" by Data Science Society.
o "Measuring Clustering Performance - Machine Learning" by Simplilearn.