Unit 7 ML
Unit 7 ML
Data collection and preprocessing are critical steps in the data analysis
and machine learning pipeline.
2. Identify Data Sources: Determine where your data will come from. It might be databases, APIs,
web scraping, sensors, surveys, or existing datasets.
3. Data Gathering: Collect the data from the identified sources. This can involve writing scripts or
programs to automate data retrieval. Ensure that you have the necessary permissions and rights to
use the data.
4. Data Quality Check: Examine the data for quality issues such as missing values, duplicates,
outliers, and inconsistencies. Clean the data as needed to address these issues.
Data collection
5. Data Integration: If your data comes from multiple sources, you may need to integrate it into a
single dataset. This may involve data merging, joining, or concatenation.
6. Data Storage: Decide on an appropriate storage format and location for your data. Common
options include relational databases, NoSQL databases, data lakes, or simple file formats like CSV
or JSON.
7. Data Documentation: Maintain documentation that describes the data sources, collection methods,
and any transformations or cleaning steps performed. This documentation is crucial for
reproducibility.
Data Preprocessing
1. Handling Missing Data: Decide how to deal with missing values. You can either remove rows with missing data,
impute missing values with statistical methods, or use advanced imputation techniques.
2. Outlier Detection and Treatment: Identify and handle outliers that can skew your analysis. You can remove outliers,
transform them, or use robust statistical methods.
3. Feature Selection: Choose relevant features (columns) that are likely to contribute to your analysis or machine
learning model. Feature selection can reduce dimensionality and prevent overfitting.
4. Feature Engineering: Create new features that can provide more information or improve model performance. This
might involve mathematical transformations, aggregation, or creating categorical variables from continuous data.
5. Scaling and Normalization: Scale or normalize your data to ensure that different features have similar scales.
Common techniques include min-max scaling and z-score normalization.
Data Preprocessing
6. Encoding Categorical Data: Convert categorical variables into numerical format using techniques like one-hot
encoding or label encoding, depending on the nature of the data and the machine learning algorithm you plan to use.
7. Data Splitting: Divide your dataset into training, validation, and test sets for machine learning tasks. The training set
is used to train models, the validation set is used for hyperparameter tuning, and the test set is reserved for evaluating
model performance.
8. Data Transformation: Some machine learning algorithms require specific data transformations, such as principal
component analysis (PCA) or time series decomposition.
9. Data Imbalance Handling: If you're dealing with imbalanced datasets in classification tasks, consider techniques
like oversampling, under sampling, or using different evaluation metrics.
10. Data Visualization: Visualize your data to gain insights and identify patterns or anomalies. Data visualization tools
can help you explore the data's characteristics.
Outlier analysis using Z-Score
Outlier analysis using Z-Score, also known as the standard score, is a statistical technique
used to identify and deal with outliers in a dataset. Outliers are data points that deviate
significantly from the rest of the data and can distort statistical analysis and machine
learning models. Z-Score helps you quantify how far each data point is from the mean and
provides a threshold for identifying outliers.
Perform outlier analysis using Z-Score
Calculate the mean (average) and standard deviation of your dataset. These statistics
describe the central tendency and the spread of the data, respectively.
Where:
Z-Score (Z) = (x - μ) / σ
Z-Score represents how many standard deviations a data point is away from the mean. A Z-Score of 0 means
the data point is exactly at the mean, positive values indicate data points above the mean, and negative values
indicate data points below the mean.
Perform outlier analysis using Z-Score
Set a Threshold for Identifying Outliers: Decide on a threshold Z-Score value beyond which data points are considered
outliers. A common threshold is, for example, Z > 2 or Z < -2, which corresponds to data points that are more than two
standard deviations away from the mean.
Identify Outliers: Data points with Z-Scores exceeding the chosen threshold are considered outliers. You can create a new
binary variable to label them as outliers (1) or not (0).
Handle Outliers:
1. Remove Outliers: Exclude outlier data points from your analysis, especially if you believe they are erroneous or
irrelevant.
2. Transform Data: Apply transformations to mitigate the impact of outliers, such as log transformations or winsorization
(capping extreme values).
3. Keep Outliers: In some cases, outliers may be of interest, and you may want to analyze them separately or understand
why they exist.
Perform outlier analysis using Z-Score
Reanalyze Data: After handling outliers, you can recompute summary statistics, visualize the data, or build machine
learning models with the cleaned dataset.
Z-Score-based outlier analysis is a simple yet effective method for identifying and managing outliers in your data.
However, the choice of the Z-Score threshold is somewhat subjective and should be guided by domain knowledge and the
specific goals of your analysis. Additionally, be aware that Z-Score-based methods may not work well for datasets with
non-normal distributions, and alternative techniques may be more appropriate in such cases.
Model selection & evaluation
Model selection and evaluation are crucial steps in the process of building and deploying
machine learning models. Selecting the right model and assessing its performance correctly
are essential for achieving the best results in your machine learning project.
Model Selection
1. Define Your Goals: Clearly articulate the objectives of your machine learning project. Understand what
you want to predict or accomplish with your model.
2. Choose Candidate Models: Based on your problem type (classification, regression, clustering, etc.) and
the nature of your data, select a set of candidate machine learning algorithms. Common choices include
linear regression, decision trees, random forests, support vector machines, neural networks, etc.
3. Feature Selection/Engineering: Before building and comparing models, carefully select and preprocess
your features. Feature engineering may involve creating new features, transforming data, and handling
missing or categorical data.
4. Split the Data: Divide your dataset into training, validation, and test sets. The training set is used for
model training, the validation set for hyperparameter tuning and model selection, and the test set for final
model evaluation.
Model Selection
5. Train Models: Train each candidate model using the training data. Tune hyperparameters using the
validation set to find the best-performing configuration for each model.
6. Cross-Validation: Perform cross-validation on the training data to assess each model's performance more
robustly. Common techniques include k-fold cross-validation.
7. Evaluate Model Complexity: Consider the trade-off between model complexity and performance.
Simpler models are less likely to overfit but may have lower predictive power, while complex models may
capture more nuances but are more prone to overfitting.
8. Select the Best Model: Based on cross-validation results and your evaluation criteria (e.g., accuracy,
precision, recall, F1-score for classification; RMSE, MAE for regression), choose the best-performing
model.
Model Evaluation
1. Test Data Evaluation: Once you've selected your best model, evaluate its performance on the test dataset,
which it has never seen before. This step provides a realistic estimate of how well your model will
perform on unseen data.
2. Performance Metrics: Choose appropriate evaluation metrics based on your problem type. For
classification, you might use accuracy, precision, recall, F1-score, ROC AUC, etc. For regression,
common metrics include RMSE, MAE, R-squared, etc.
3. Confusion Matrix (Classification): Analyze the confusion matrix to understand the model's performance
regarding true positives, true negatives, false positives, and false negatives. This can help you make
informed decisions about trade-offs between precision and recall.
Model Evaluation
4. Visualizations: Create visualizations, such as ROC curves, precision-recall curves, or residual plots, to
gain insights into your model's behavior.
5. Business Impact: Consider the business or real-world implications of your model's performance.
Evaluate whether the model meets the desired objectives and whether it aligns with the project's goals.
6. Bias and Fairness: Assess the model for biases, fairness, and ethical concerns. Ensure that it doesn't
discriminate against certain groups or exhibit unintended behavior.
7. Interpretability: If model interpretability is important, use techniques such as feature importance analysis
or model-agnostic interpretability tools to understand how the model makes predictions.
Model Evaluation
8. Iterate and Refine: Depending on the evaluation results, you may need to iterate on the model selection,
feature engineering, or data preprocessing steps to improve model performance.
9. Documentation: Maintain thorough documentation of the selected model, its hyperparameters, and the
evaluation results. This documentation is crucial for reproducibility and future reference.
Optimization of tuning parameters
learning models. Hyperparameters are parameters of the model that are not learned from the
data but are set before training. Tuning these hyperparameters can significantly impact a
model's performance.
Optimization of tuning parameters
Define Your Search Space:
Start by identifying the hyperparameters you want to tune. These may include learning rates, regularization
strengths, tree depths, kernel types, etc., depending on the algorithm you're using.
•Grid Search: In this method, you specify a set of possible values for each hyperparameter. The algorithm
then evaluates all possible combinations, creating a grid of hyperparameter configurations. Grid search is
straightforward but can be computationally expensive.
•Random Search: Random search selects random combinations of hyperparameters from predefined
ranges. It's often more efficient than grid search and can find good hyperparameters faster.
Optimization of tuning parameters
Cross-Validation:
Split your training data into multiple subsets for cross-validation. Common choices include k-fold cross-
validation, where the data is divided into k subsets, and each model is trained and evaluated on different
combinations of these subsets. Cross-validation helps assess how well a set of hyperparameters performs across
different data partitions, reducing the risk of overfitting.
Optimization of tuning parameters
Hyperparameter Tuning:
Apply the chosen search method (grid search or random search) to find the best hyperparameters. For
each combination of hyperparameters:
•Train the model on the training data.
•Evaluate the model's performance using cross-validation and the chosen evaluation metric(s).
•Record the evaluation metric(s) for that combination.
Refinement: Depending on the results, you may need to iterate on the optimization process, fine-tuning the
hyperparameters further or even revisiting your initial choices.
Documentation: Document the optimized hyperparameters and the performance metrics associated with them.
This documentation is essential for reproducibility and model deployment.
Deployment and Monitoring: Deploy your model with the optimized hyperparameters in a production
environment. Monitor its performance over time and be prepared to re-tune the hyperparameters periodically as
the data distribution evolves or the model's requirements change.
Visualization of results
Visualizing the results of your data analysis or machine learning model can
provide valuable insights, help you communicate your findings effectively, and
aid in decision-making. The choice of visualization techniques depends on the
type of data and the specific goals of your analysis.
Common ways to visualize
Line Charts
Line charts are ideal for showing trends over time or
across ordered categories. They are commonly used
for time series data or to visualize the relationship
between two continuous variables.
Scatter Plots
Scatter plots are effective for visualizing the
relationship between two continuous variables. Each
point on the plot represents a data point, making it
easy to identify patterns or outliers.
Common ways to visualize
Area Charts:
Area charts are similar to line charts but are filled in with
color, making it easier to visualize the cumulative effect of
values over time.
Sankey Diagrams
Sankey diagrams are used to visualize the flow of resources or
quantities between different entities. They are often used in
process analysis or to depict hierarchical structures.
Choropleth Maps
Choropleth maps use color-coding to represent data by
geographic regions. They are useful for showing regional
patterns or variations.
Common ways to visualize
3D Plots
3D plots can be used when you need to visualize data in three
dimensions. They are suitable for situations where two
continuous variables are dependent on a third variable.
Interactive Dashboards:
Create interactive dashboards using tools like Tableau, Power
BI, or Plotly. These allow users to explore data and results
dynamically.
Network Graphs:
Network graphs are useful for visualizing
relationships between entities, such as social
networks, co-authorship networks, or
hierarchical structures.
Error Bars:
When presenting statistical results, use error
bars to indicate variability or uncertainty in
your measurements.
Thank You