Data Binning
Data Binning
Consists of a centralized fact table connected to multiple- Consists of a centralized fact table connected to
Structure
dimension tables in a hierarchical manner dimension tables in a star-like structure
Query Performance Excellent for complex queries and aggregations Better for simple queries and aggregations
Storage Efficiency Highly efficient for storing data Less efficient due to denormalization
Scalability Highly scalable due to the separation of data Limited scalability due to denormalization
Data Integrity Ensures high data integrity Lower data integrity due to denormalization
Complexity More complex to design and maintain Simpler to design and maintain
Flexibility More flexible for changes in the data model Less flexible for changes in the data model
Usage Suitable for large, complex data warehouses Suitable for small to medium-sized data warehouses
Storage Overhead Requires less storage space Requires more storage space
Data preprocessing transforms the data into a format that is more easily
and effectively processed in data mining, machine learning and other data
science tasks. The techniques are generally used at the earliest stages of
the machine learning and AI development pipeline to ensure accurate
results.
2. Data cleansing. The aim here is to find the easiest way to rectify quality
issues, such as eliminating bad data, filling in missing data or otherwise
ensuring the raw data is suitable for feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise
from characterizing phenomena in different ways or data that is not relevant
to a particular ML, AI or analytics task. Data reduction uses techniques like
principal component analysis to transform the raw data into a simpler form
suitable for particular use cases.
5. Data enrichment. In this step, data scientists apply the various feature
engineering libraries to the data to effect the desired transformations. The
result should be a data set organized to achieve the optimal balance
between the training time for a new model and the required compute.
6. Data validation. At this stage, the data is split into two sets. The first set
is used to train a machine learning or deep learning model. The second set
is the testing data that is used to gauge the accuracy and robustness of the
resulting model. This second step helps identify any problems in
the hypothesis used in the cleaning and feature engineering of the data. If
the data scientists are satisfied with the results, they can push the
preprocessing task to a data engineer who figures out how to scale it for
production. If not, the data scientists can go back and make changes to the
way they implemented the data cleansing and feature engineering steps.
Data cleaning is the process of preparing raw data for analysis by
removing bad data, organizing the raw data, and filling in the null
values. Ultimately, cleaning data prepares the data for the process
of data mining when the most valuable information can be pulled
from the data set.
Data Cleaning Characteristics
Iterative process - Data cleaning in data mining is an iterative process that involves multiple
iterations of identifying, assessing, and addressing data quality issues. It is often an ongoing
activity throughout the data mining process, as new insights and patterns may prompt the need
for further data cleaning.
Time-consuming - Data cleaning in data mining can be a time-consuming task, especially when
dealing with large and complex datasets. It requires careful examination of the data, identifying
errors or inconsistencies, and implementing appropriate corrections or treatments. The time
required for data cleaning can vary based on the complexity of the dataset and the extent of the
data quality issues.
Domain expertise - Data cleaning in data mining often requires domain expertise, as
understanding the context and characteristics of the data is crucial for effective cleaning. Domain
experts possess the necessary knowledge about the data and can make informed decisions about
handling missing values, outliers, or inconsistencies based on their understanding of the subject
matter.
Impact on analysis - Data cleaning in data mining directly impacts the quality and reliability of
the analysis and results obtained from data mining. Neglecting data cleaning can lead to biased
or inaccurate outcomes, misleading patterns, and unreliable insights. By performing thorough
data cleaning, analysts can ensure that the data used for analysis is accurate, consistent, and
representative of the real-world scenario.
Mape metric