Data Cleaning and Preprocessing
Data Cleaning and Preprocessing
1. Introduction
Data cleaning and preprocessing are essential steps in data science and
machine learning. Raw data is often messy, containing missing values,
inconsistencies, and irrelevant features. A well-processed dataset leads to
accurate and efficient machine learning models.
Data cleaning and preprocessing are crucial because they ensure the
dataset is reliable, consistent, and suitable for analysis. Below are some
key reasons why this process is essential:
df.drop_duplicates(inplace=True)
3. Data Preprocessing
3.1 Feature Scaling
Scaling ensures that numerical features are within the same range,
improving ML performance.
Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df)