0% found this document useful (0 votes)
10 views

Report

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Report

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MANUFACTURING ANALYSIS

PROJECT: OPTIMIZING
PRODUCTION EFFICIENCY

TEAM NAMES
Ahmed Mamdouh Khaled Saleh => 22012089
Mohammed Alaa Mohammed Abd-El Wahab => 2202123
Omar El-Said Mohamed El-Said =>2202122
Osama El-Said Abd-El Salam =>20221466241
Abdulrahman Ahmed Ali => 20191322564

PROJECT INTRODUCTION
The "House Price Analysis and Prediction" project aims to explore
the dynamics of the real estate market by analyzing housing data
to identify key factors influencing property prices. The dataset
contains detailed information about various property attributes,
including size, location, amenities, and detected defects.
The primary objective of this project is to:
1. Understand the relationships between property features and
prices.
2. Develop predictive models to estimate house prices based
on specific features.
3. Provide insights into optimizing property valuations for
buyers, sellers, and real estate investors.
Through this analysis, the project seeks to uncover trends and
patterns in the housing market, offering data-driven
recommendations to stakeholders.

DATA SET (DATA COLLECTION)


Source: https://kaggle.com/datasets/yasserh/housing-prices-
dataset
The dataset is sourced from Kaggle, titled Housing Prices
Dataset, and contains information about various housing
properties. Here’s a brief summary based on the dataset's
description:
Overview:
This dataset provides details about houses and their associated
attributes to analyze and predict housing prices. The goal is to
explore the impact of different features on property prices and
develop a pricing model.
Key Features:
 price: The price of the house (target variable for
predictions).
 area: The area of the house in square feet.
 bedrooms, bathrooms, stories: Number of bedrooms,
bathrooms, and floors in the house.
 mainroad: Whether the house is located on the main road
(yes/no).
 guestroom: Availability of a guest room (yes/no).
 basement: Indicates whether the house has a basement
(yes/no).
 hotwaterheating: Availability of hot water heating (yes/no).
 airconditioning: Presence of air conditioning (yes/no).
 parking: Number of parking spaces available.
 prefarea: Whether the house is in a preferred area (yes/no).
 furnishingstatus: The furnishing condition of the house
(furnished, semi-furnished, or unfurnished).
Purpose:
This dataset is ideal for:
 Conducting exploratory data analysis to identify trends and
patterns in housing prices.
 Understanding how different features influence house prices.
 Developing regression models to predict housing prices.

THE CODE
## IMPORT REQUIRED LIBRARIES
numpy: A fundamental library for numerical computations in Python. It
provides support for arrays, matrices, and mathematical operations.

pandas: A powerful library for data manipulation and analysis. It allows


working with data

matplotlib.pyplot: A library for creating static, interactive, and animated


visualizations in Python. The alias plt is commonly used for easy reference.

seaborn: A statistical data visualization library built on top of Matplotlib,


providing high-level interfaces for drawing attractive and informative graphics.

IMPORTING MACHINE LEARNING TOOLS


train_test_split: A function from scikit-learn to split the dataset
into training and testing subsets, ensuring proper evaluation of
models.
MinMaxScaler: A preprocessing tool that scales features to a
range (usually between 0 and 1), ensuring uniformity and
enhancing model performance.
mean_squared_error (MSE): A metric to measure the average
squared difference between predicted and actual values (lower is
better).
mean_absolute_error (MAE): Measures the average absolute
difference between predicted and actual values (lower is better).
LinearRegression: A simple and widely used regression model
for predicting continuous variables based on a linear relationship.

STYLING
sns.set_style: Sets the theme for Seaborn visualizations. The
'ticks' style adds ticks to the axes, making the plots more readable
and polished

PURPOSE
This setup prepares the environment for:
1. Data Loading and Preprocessing: Using NumPy and
Pandas.
2. Visualization: With Matplotlib and Seaborn.
3. Model Training and Evaluation: Using scikit-learn for
splitting data, scaling, and applying regression models.
4. Error Metrics: To evaluate and compare model
performance using MAE and MSE.
## LOADING DATA
pd.read_csv: A function from the Pandas library that reads a
CSV (Comma-Separated Values) file and loads it into a
DataFrame.
'D:Manufacturing Analytics\Final Project\Housing.csv': The
file path to the CSV file

PREPROCESSING

## CHECKING MISSING VALUES

The code checks for missing (null) values in the dataset and
provides a summary of how many are present in each column.

## CHECKING DUPLICTE VALUES

The code data.duplicated().sum() checks for duplicate rows in


your dataset and counts how many of them exist.

## GET STATISTICAL INFROMATION ABOUT OUR


DATASET
The code uses the describe() function to generate descriptive
statistics for the dataset, tailored to numeric and categorical data
separately.

VISULAIZATION AND ANALYSIS :-

This code identifies and separates the numerical and categorical


columns in your dataset based on their data types. Here's the
explanation
## DISTRIBUTIONS OF NUMERICAL COLUMNS
COUNT PLOT OF CATEGORY
CORRELATION BETWEEN NUMERICAL FEATURES
SELECTING RELEVANT COLUMNS TO THE TARGET
The code selects the relevant columns that are believed to have a
high correlation with the target variable (price).

PERFORMING SCALING

The code is normalizing the selected features (from the cdf


DataFrame) using the MinMaxScaler. The MinMaxScaler from
scikit-learn is used to scale features to a specific range, typically
[0, 1]. This is useful when you want to normalize the data so that
each feature contributes equally to machine learning models,
especially for algorithms that rely on distance calculations (like
KNN or neural networks).
scaler.fit_transform(cdf) scales the features in the cdf DataFrame.
This method first calculates the minimum and maximum values of
each feature in the dataset and then transforms the data to fit
within the [0, 1] range.pd.DataFrame(..., columns=cdf.columns)
converts the scaled values back into a DataFrame, preserving the
original column names (e.g., 'area', 'bathrooms', etc.).

DETERMINE THE FEATURES AND TARGET COLUMN


The code is splitting the features and target variable from the cdf
DataFrame. Dropping the price column to create x. The result is assigned
to x, which will contain all the features (i.e., 'area', 'bathrooms', 'bedrooms',
'stories', and 'parking') excluding the target variable 'price'. Assigning the
price column to y

SPLIT DATA INTO TRAIN AND TEST

APPLYING LINEAR REGRESSION ALGORITHM

TRAIN OUR MODEL


The code is training a Linear Regression model using the features
(train_x) and target variable (train_y) from our training dataset.Linear
Regression is used to model the relationship between independent
variables (features) and a dependent variable (target).

The fit() method trains the model by adjusting the parameters (coefficients)
to best fit the data, minimizing the error between predicted and actual
values.

TESTIG OUR MODEL

The line y_pred = model.predict(test_x) is used to make predictions using


the trained Linear Regression model.

EVALUATE OUR MODEL


MSE measures the average squared difference between the actual values (test_y) and the predicted
values (y_pred). It is a common metric used to evaluate the performance of regression models. The code
calculates the MSE between the predicted values (y_pred) and the actual values (test_y), which
quantifies how far off the predictions are, on average.

MAE is another metric that measures the average of the absolute differences between the actual values
and the predicted values. Unlike MSE, MAE does not square the differences, which makes it less
sensitive to large errors (outliers).The code calculates the MAE, which provides an idea of the average
size of the errors between the predicted values and the actual values, without giving extra weight to larger
errors.

K-NEAREST NEIGHBORS (KNN) MODEL FOR HOUSE


PRICE PREDICTION
In this model, we apply the K-Nearest Neighbors (KNN) algorithm to
predict house prices based on various features (such as area, bathrooms,
bedrooms, etc.). The KNN algorithm is a non-parametric method used for
both classification and regression. For this regression task, KNN predicts
the target value (house price) based on the average of the target values of
the nearest neighbors in the feature space.
This cell initializes a K-Nearest Neighbors model with 5 neighbors and trains it using the provided training
data. After training, the model will be able to predict house prices based on the features in train_x.

This cell runs the KNN model on the test data (test_x) and stores the predicted house prices in the
variable knn_predictions. These predictions can be further evaluated (e.g., using Mean Squared Error or
Mean Absolute Error) to assess the performance of the model.

 This cell evaluates the KNN regression model by calculating two


common error metrics: Mean Squared Error (MSE) and Mean
Absolute Error (MAE). Both metrics help assess the performance of
the model in predicting house prices:

o MSE penalizes larger errors more heavily and provides an


overall measure of model accuracy.

o MAE gives an average of the absolute errors and is less


sensitive to extreme values compared to MSE.

This cell visualizes the performance of the KNN model by plotting a scatter
plot of actual vs. predicted house prices. The blue dots represent the
predicted vs. actual values, and the red dashed line represents the ideal fit
where predicted values match the actual values exactly.

If the points are scattered close to the red line, it indicates that the model is
making accurate predictions. The plot helps you assess the model's
prediction performance visually.
COMPARING THE MODELS EFFICIENCY

 This cell compares the efficiency of two regression models: Linear


Regression and K-Nearest Neighbors (KNN).

 The R-squared (R²) score is used to assess how well the models
predict the target variable (house prices). It measures the proportion
of the variance in the target variable that is explained by the model.
 A bar plot is generated to visually compare the R² scores of both
models, helping to identify which model explains more of the variance
in the data.

The model with the higher R² score is the better performer in terms of
predicting the house prices.

Then Linear Regression Model Is Better

OBJECTIVES
DETECT BOTTLENECKS IN PRODUCTION PROCESSES
Definition of Bottlenecks: Bottlenecks in production processes refer to
steps or stages where the flow of operations slows down or gets delayed,
thus limiting the overall throughput of the process. In data processing,
bottlenecks
can occur in various stages, such as data loading, preprocessing, model
training, and prediction, which can significantly affect performance.

This function measures the execution time of any function (func) passed to
it. It records the time before and after the function is executed, calculates
the duration, and returns the result of the function along with the time take

This section measures the time taken to load the data from the specified
CSV file using pd.read_csv(). Data loading can sometimes be a bottleneck,
especially with large datasets.

This section measures the time spent on preprocessing the data, including
checking for missing values, duplicates, and generating descriptive
statistics. Data preprocessing can often take time, depending on the
dataset's size and complexity
This section measures the time taken to scale the features of the dataset
using the MinMaxScaler. Scaling is important in machine learning,
especially for models sensitive to feature scaling (like KNN or gradient-
based algorithms), but it can be time-consuming for large datasets.

This section measures the time taken to train a linear regression model on
the training data. Model training is often one of the most time-consuming
steps, especially with complex models and large datasets

This section measures the time taken to make predictions using the trained
model on the test data. Prediction times can also be an issue, especially in
real-time applications.
The bar plot compares the time taken for each stage of the machine
learning pipeline. The longer the time, the more likely it is a bottleneck.

After visualizing the execution times for each step in the machine learning
pipeline, it appears that preprocessing is the stage that takes the most
time

PREDICTIVE MODEL: A MODEL PREDICTING FAILURES


AND DEFECTS.
A predictive model is designed to forecast future outcomes based on
historical data. In this context, the model is predicting failures and defects
in housing data by using various features of a house, such as area,
bedrooms, bathrooms, stories, parking, and price.

Specifically, the goal of this model is to predict the presence or absence of


a defect (a binary classification) in the houses based on a newly created
variable called defects. The defects variable indicates whether a house is
likely to have a defect based on its price per area, which is calculated by
dividing the house's price by its area.

Step 1: Create the target variable "defects"

1. Price per Area Calculation:

o The new variable price_per_area is calculated by dividing the


price of a house by its area. This metric helps in understanding
how expensive a house is relative to its size

Defining the Defect Threshold:

 The defects variable is created by identifying houses that have a


price per area below a certain threshold. The threshold is calculated
as the mean minus two times the standard deviation of
price_per_area. Houses with a price per area below this threshold are
considered to have a defect (defects = 1), while others are defect-free
(defects = 0).

Step 2: Data Preprocessing

1. Handling Missing Data:

o The code drops any rows that have missing values.


Selecting Features and Target:

 The relevant features for prediction are chosen (i.e., area, bedrooms,
bathrooms, stories, parking, price), and the target variable is set to
defects.

Scaling the Features:

 The features are scaled using MinMaxScaler to ensure that the


model does not favor any particular feature due to differences in
scale. The scaler transforms the features to a range between 0 and 1.

Step 3: Train-Test Split

1. Splitting the Data:

o The data is split into training and testing sets (70% for training,
30% for testing). This allows the model to be trained on one
subset of the data and evaluated on another to assess its
performance.
Step 4: Train the Linear Regression Model

1. Model Training:

o A Linear Regression model is instantiated and trained on the


training data (X_train and y_train).

Step 5: Make Predictions

1. Predicting Defects:

o The trained model is used to predict the defects on the test data
(X_test).

Step 6: Evaluate the Model

1. Evaluating Model Performance:

o The model's performance is evaluated using Mean Squared


Error (MSE) and Mean Absolute Error (MAE), both of which
provide insights into how well the model's predictions match the
actual values.
The predictive model aims to forecast whether a house has a defect based
on its features, specifically focusing on its price per area. The model
predicts binary outcomes: 1 for defect and 0 for no defect.

Model Performance Metrics:

 Mean Squared Error (MSE): Measures the average of the squares


of the errors, indicating how far the predicted values are from the
actual values.

 Mean Absolute Error (MAE): Measures the average of the absolute


errors, giving a clear indication of how much the predictions differ
from the true values on average.

Steps Taken:

1. Target Variable Creation: A target variable defects was created


based on a threshold defined by the price per area.

2. Preprocessing: The dataset was cleaned and scaled to ensure


accurate predictions.

3. Model Training: A Linear Regression model was trained on the data.

4. Evaluation: The model was evaluated using MSE and MAE,


providing insights into its predictive accuracy.

Conclusion: The predictive model is capable of forecasting whether a


house will have a defect based on its price per area and other features. By
evaluating MSE and MAE, we can understand the accuracy of the
predictions and further optimize the model if needed.

ASSESS PRODUCTION QUALITY THROUGH DEFECT


DETECTION
The goal of this analysis is to identify potential defects in housing data by
detecting certain patterns or anomalies that could indicate subpar quality or
production issues. Three types of defects are assessed:

1. Defects Based on Unusually Low Prices:


The code calculates the price per area for each house and identifies
houses with a price per area significantly lower than the average.
These houses may be undervalued, potentially due to hidden defects
or errors in pricing.

2. Defects Based on Inefficient Layouts:


It checks for houses with an unusually high number of bedrooms
relative to their area, which may suggest inefficient use of space, or
layouts that don't meet typical consumer preferences.

3. Defects in High-Price Houses Lacking Premium Features:


The analysis identifies houses priced above the average but lacking
essential premium features such as air conditioning, a preferred area,
or sufficient parking spaces. High-price houses without these features
may fail to meet customer expectations.

This code performs a thorough analysis to detect defects in housing data.


The defects are categorized based on price anomalies, inefficient layouts,
and missing premium features in high-price houses. The detected defects
are then compiled and saved for further inspection.

Key Insights:

 Low price-to-area ratios could indicate houses that are undervalued,


and further investigation may be required.

 Inefficient layouts, particularly houses with too many bedrooms for


the available area, should be considered for redesign.

 High-price houses lacking premium features could fail to meet the


expectations of buyers, and improvements or adjustments are
recommended.
By identifying and addressing these defects, production quality can be
improved, leading to better housing offerings and more accurate pricing
strategies.

VISUALIZING DEFECT CATEGORIES

The code visualizes the number of defects detected in each of the three
defect categories: Low Price, Inefficient Layout, and Missing Features.
This is done using a bar chart, which gives a clear and intuitive
representation of the defects across different categories.

PROPOSING ACTIONABLE INSIGHTS TO REDUCE


WASTAGE AND MAXIMIZE THROUGHPUT
Proposing Actionable Insights to Reduce Wastage and Maximize
Throughput

In the analysis of the housing data, we have identified potential areas of


inefficiency and opportunities for optimization. Below are the key insights
based on the identified factors of overpricing and feature utilization:

1. Identifying Potential Wastage: Overpriced Houses

 We analyzed the pricing data to detect houses that may be


overpriced by comparing their prices to the mean price plus two
standard deviations. This helps identify houses priced excessively
relative to the market.

 Result: A number of houses were found to be overpriced. These


houses could be facing lower demand, leading to inefficiencies and
wasted resources.

Actionable Insight:

o Consider reducing the pricing of these overvalued houses, as


they may be struggling to sell, leading to unnecessary wastage
of marketing and maintenance efforts.
o Adjusting the price points can make these properties more
attractive to potential buyers and help achieve a better market
balance.

2. Maximizing Throughput: Optimizing Feature Allocation

 We examined how the presence of certain features (e.g., parking and


air conditioning) correlates with the average price of houses. This
analysis provides insights into which features are more valuable and
should be prioritized or optimized.

Parking Analysis:

o Houses with 2 or more parking spaces tend to have higher


average prices, indicating that more parking spaces add
significant value to a property.

Air Conditioning Analysis:

o Houses that include air conditioning have higher average


prices, suggesting that air conditioning is a feature that attracts
more value in the market.

Actionable Insights:

o Optimize parking allocation: Given that houses with 2 or


more parking spaces have higher prices, properties with fewer
parking spaces might benefit from adding more. This could
increase the value and desirability of the property.

o Promote air-conditioned properties: Air conditioning is a


premium feature that contributes positively to the property's
value. Highlighting and promoting homes with this feature can
help increase sales or rental prices.
ANALYZING MACHINE UT ILIZATION AND DOW NTI ME

The simulation provides a detailed analysis of machine utilization and


downtime by simulating the machine's processes, such as model training
and prediction. Here's how the results from the simulated code could be
interpreted:

1. Machine Utilization: This metric indicates how effectively the machine


is being used for active tasks (training and prediction). For example,
with training lasting 200 seconds and prediction lasting 150 seconds,
the total active time is 350 seconds. If the total available time is 1000
seconds

Machine Downtime: This metric reflects how much of the machine's time is spent idle. In the
given example, the downtime is simply the remaining time after the active tasks are completed:

Idle Time=Total Available Time−Total Active Time=1000−350=650 seconds\text{Idle Time} =


\text{Total Available Time} - \text{Total Active Time} = 1000 - 350 = 650 \text{
seconds}Idle Time=Total Available Time−Total Active Time=1000−350=650 seconds

Final Insights:

 Machine Utilization: The machine was actively used for 35% of the
time for training and prediction, which is a key performance indicator.
A low utilization percentage suggests potential inefficiencies, where
the machine could be used more productively.

 Machine Downtime: The machine was idle for 65% of the time,
indicating there may be room for optimization. This downtime could
be due to factors like unoptimized processes or delays in training and
prediction phases.

Actionable Recommendations:

 Optimize active processes: Improve training and prediction


efficiency to reduce idle time. This could involve parallelizing tasks,
reducing processing time, or improving algorithms.

 Monitor machine health: Since the simulation accounts for


breakdowns (though not implemented here), monitoring for potential
failures and maintenance needs can further reduce downtime.

By improving these aspects, the overall machine efficiency can be


enhanced, ensuring higher utilization and lower downtime, which leads to
improved throughput and reduced operational costs.
DATA REQUIREMENT: SIMULATE MACHINE LOGS FOR
RUNTIME, IDLE TIME, AND BREAKDOWNS
The provided code simulates machine operation logs, specifically focusing
on training time, prediction time, idle time, and breakdowns. It tracks
the performance of a machine during these processes and provides
insights into its utilization and downtime.

Explanation of Simulation:

A simulation is the process of mimicking real-world processes or systems


in a controlled environment to understand their behavior or to test different
scenarios. In this case, the simulation mimics the operations of a machine
that performs tasks like training a model and making predictions. The goal
is to assess the machine's efficiency, downtime, and potential failures
(breakdowns) during these tasks.

Here’s a breakdown of how the code simulates the machine’s operations:

1. Machine Time Tracking:

o The machine's total available time is set to 1000 seconds.

o The code simulates two key processes: model training and


model prediction, during which the machine is active.

o During each phase, random breakdowns can occur, adding to


the idle time and affecting machine efficiency.

2. Simulation Steps:

o Training: The code simulates the time spent on training the


model by sleeping for a given duration (training_duration).

o Prediction: The code then simulates the time spent on


prediction by sleeping for a specified duration
(prediction_duration).

o Breakdowns: The code includes a random chance of


breakdown during both training and prediction phases. If a
breakdown occurs, it increases the downtime by 10 seconds
for each breakdown.
o Idle Time Calculation: After the training and prediction phases,
the code calculates how much time the machine spends idle
(not performing tasks) based on the total available time,
processing time, and any breakdowns.

o Machine Utilization: This metric represents how much of the


machine’s available time was used for active processes
(training and prediction).

o Downtime: Represents how much of the machine’s time was


spent in an idle state, either due to the breakdowns or non-
active periods.

3. Reporting: The code outputs the following:

o Training Time: The duration it took to train the model.

o Prediction Time: The duration it took to make predictions.

o Breakdowns: The number of breakdowns that occurred during


both processes.

o Machine Utilization: The percentage of total available time


spent actively performing tasks.

This simulation helps analyze the efficiency of a machine during its


operations by providing insights into key performance metrics such as
utilization, downtime, and breakdowns. By identifying bottlenecks and
potential failure points (e.g., machine breakdowns), it allows for better
decision-making in production environments and helps in designing
strategies to optimize machine utilization and reduce downtime.
DASHBOARD
The bar chart that displays the number of bedrooms on the x-axis and the average price
on the y-axis is important for several reasons:

1. Identify Pricing Trends: It helps in understanding how the price of properties


varies with the number of bedrooms. This allows for insights into which bedroom
configurations are more expensive and how pricing scales with the size of the
property.

2. Market Segmentation: By visualizing the relationship between bedrooms and


price, it helps in identifying market segments. For example, properties with more
bedrooms might attract higher prices, while smaller properties could have lower
prices. This can be useful for buyers, sellers, and real estate agents.

3. Investment Decisions: Real estate investors can use this information to make
more informed decisions. It allows them to assess which types of properties
(based on bedroom count) offer the best value or return on investment.

4. Price Forecasting: The chart can help in predicting future price trends based on
the number of bedrooms, providing useful data for appraisers and market
analysts.

5. Identifying Outliers: Any significant deviations in price for certain bedroom


counts could highlight unusual or unique properties, which may warrant further
investigation.
pie chart with "prefarea" as the legend and the average of "price" as the values

Importance of This Visualization:

1. Identifying Price Disparities: A pie chart helps to quickly identify if there's a


significant difference in prices between properties in preferred areas vs. non-
preferred areas.

2. Market Insights: By displaying the average prices for each category, you gain
insights into how the location (prefarea) affects the price.

3. Decision Making: For both buyers and sellers, understanding the impact of
location on pricing can help in making informed decisions about purchasing or
listing properties.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy