0% found this document useful (0 votes)
12 views

ML assignment

The document outlines a series of tasks involving data analysis and machine learning techniques using Python. It includes tasks such as clustering with KMeans and hierarchical clustering, calculating utilities in a grid environment, performing factor analysis on housing data, and building a Random Forest classifier for fraud detection. Each task demonstrates data preprocessing, model training, and evaluation methods, highlighting the effectiveness of the chosen algorithms.

Uploaded by

fifotox176
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

ML assignment

The document outlines a series of tasks involving data analysis and machine learning techniques using Python. It includes tasks such as clustering with KMeans and hierarchical clustering, calculating utilities in a grid environment, performing factor analysis on housing data, and building a Random Forest classifier for fraud detection. Each task demonstrates data preprocessing, model training, and evaluation methods, highlighting the effectiveness of the chosen algorithms.

Uploaded by

fifotox176
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 11

Task 1:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# to Load the dataset


fl = pd.read_csv('ecommerce.csv')

# to Display the first few rows of the dataframe


print(fl.head())

print(fl.info())
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

# Normalize the data


scaler = StandardScaler()
sceled_data = scaler.fit_transform(fl)

# Elbow Method
Ac = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++',n_init ='auto', random_state=10)
kmeans.fit(sceled_data)
Ac.append(kmeans.inertia_)
# Plotting the results of the Elbow Method
plt.plot(range(1, 11), Ac)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Ac')
plt.show()

# Silhouette Score
for n_clusters in range(2, 11):
clusterer = KMeans(n_clusters=n_clusters,n_init='auto', random_state=10)
cluster_labels = clusterer.fit_predict(sceled_data)
silhouette_avg = silhouette_score(sceled_data, cluster_labels)
print(f"For n_clusters = {n_clusters}, the average silhouette_score is : {silhouette_avg}")
optimal_clusters = 4

# Applying Hierarchical Clustering


from scipy.cluster.hierarchy import dendrogram, linkage

# Using the 'ward' method for Hierarchical clustering


z = linkage(sceled_data, method='ward')

# Plotting the dendrogram to plot histogram


plt.figure(figsize=(10, 5))
dendrogram(z, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Customer Index')
plt.ylabel('Distance')
plt.show()

For n_clusters = 2, the average silhouette_score is : 0.3489864966095777


For n_clusters = 3, the average silhouette_score is : 0.507952731017849
For n_clusters = 4, the average silhouette_score is : 0.6109029564611685
For n_clusters = 5, the average silhouette_score is : 0.5027280434062202
For n_clusters = 6, the average silhouette_score is : 0.3968672862014622
For n_clusters = 7, the average silhouette_score is : 0.30834094932263917
For n_clusters = 8, the average silhouette_score is : 0.2998299646859838
For n_clusters = 9, the average silhouette_score is : 0.29659504567055106
For n_clusters = 10, the average silhouette_score is : 0.29147277579139513

Explanation: #importing pandas library to read the file, then we are normalizing the data using
the standard scalar function
#from importing K-Means from the ski library and fitting our dataset K-means s calculated
#Elbow method and silhouette score is calculated both of them show the optimum value at 4
# calculating Elbow method and silhouette score is not good for hierarchical clustering so we
taking for k-means
#then the from ski learn hierarchical clustering is imported and graph is plotted.

Task 2:
import numpy as np

# Define the grid with utilities

# For simplicity, inaccessible states are set to None and terminal states with
their rewards

grid_utilities = np.array([

[7.41, 7.52, 7.65, 10, 7.54],

[7.31, None, -10, 5.82, -10],

[7.15, None, -10, 4.31, None],

[6.98, 6.77, 6.44, 5.87, 6.12],

[6.90, 6.80, 6.59, 6.51, 6.34]

])

# Define the reward for non-terminal states

reward = -0.1

# Define the success probability

success_prob = 0.8

# Define the failure probability (divided equally among perpendicular


directions)
failure_prob = 0.2 / 2 # 0.2 probability split between two perpendicular
directions

# Function to calculate the utility of a given action from a given state

def calculate_utility(state, action, grid):

nrows, ncols = grid.shape

x, y = state

# Directions

directions = {

'UP': (-1, 0),

'DOWN': (1, 0),

'LEFT': (0, -1),

'RIGHT': (0, 1)

# Calculate the new position after the action

dx, dy = directions[action]

new_x, new_y = x + dx, y + dy

if 0 <= new_x < nrows and 0 <= new_y < ncols and grid[new_x, new_y] is
not None:

primary_utility = grid[new_x, new_y]

else:

primary_utility = grid[x, y]
# Calculate the utility of perpendicular mooves

perp_utility = 0

for perp_action in ['LEFT', 'RIGHT'] if action in ['UP', 'DOWN'] else ['UP',


'DOWN']:

dx, dy = directions[perp_action]

perp_x, perp_y = x + dx, y + dy

if 0 <= perp_x < nrows and 0 <= perp_y < ncols and grid[perp_x,
perp_y] is not None:

perp_utility += failure_prob * grid[perp_x, perp_y]

else:

perp_utility += failure_prob * grid[x, y]

# Calculate total expected utility

total_utility = success_prob * primary_utility + perp_utility + reward

return total_utility

green_states = [(1 ,0), (3, 2), (4, 1)] # Placeholder positions for green states

optimal_actions = {}

for state in green_states:

utilities = {action: calculate_utility(state, action, grid_utilities) for action in


['UP', 'DOWN', 'LEFT', 'RIGHT']}

optimal_action = max(utilities, key=utilities.get)

optimal_actions[state] = optimal_action
optimal_actions
It gave me the answer of {(1, 0): 'UP', (3, 2): 'DOWN', (4, 1): 'LEFT'}

Task 3:
import pandas as pd #its used to load the dataset in this program
from sklearn.decomposition import FactorAnalysis #Factor analysis is imported from inbuilt
sklearn library

# Load the dataset


Y = 'kc_house_data_reduced (1).csv'
dta = pd.read_csv(Y)
dta.head()
# dropping the price column from the orginal dataset and assigning it to X
X = dta.drop('price', axis=1)
print(X)

# Creating the Factor Analysis model with 2 components that is reducing more factors in to two
fact_ana = FactorAnalysis(n_components=2, random_state=0)
fact_ana.fit(X)

#components show thw 2 variable components


components = fact_ana.components_

# Formatting the components for better understding that is we are fixing the colums and index
components_d = pd.DataFrame(components, columns=X.columns, index=['Size', 'Quality'])
components_d

sqft_basemen
grade sqft_above sqft_living15
condition t

Size -0.044862 0.965322 648.830918 196.716360 558.405490

Quality 0.157787 -0.277271 -395.732030 370.346887 -132.457091

#first components have positive number for sqft_above and square ft variance which captures the
size of the house and component 2 gives the quality of the house rom the condition and the grade
#Factor analysis helps to find the latent variables in the data and serve as a dimensionality
reduction technique

Task 4
import pandas as pd

# Load the dataset


file_path = 'bs140513_032310.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
data.head()
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Data Preprocessing
# Encode categorical variables
le = LabelEncoder()
categorical_cols = ['customer', 'age', 'gender', 'zipcodeOri', 'merchant', 'zipMerchant', 'category']
for col in categorical_cols:
data[col] = le.fit_transform(data[col])

# Splitting the dataset into features and target variable


X = data.drop('fraud', axis=1)
y = data['fraud']

# Splitting the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the Random Forest Classifier


model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predicting and Evaluating the Model


y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix = cm,display_labels=model.classes_).plot()
plt.show() #

accuracy =print('accuracy: {}'.format( accuracy_score(y_test, y_pred)))


report = print('classification report:\n {}\n'.format(classification_report(y_test, y_pred)))

accuracy, report

accuracy: 0.996207821473316
classification report:
precision recall f1-score support

0 1.00 1.00 1.00 117512


1 0.90 0.76 0.83 1417

accuracy 1.00 118929


macro avg 0.95 0.88 0.91 118929
weighted avg 1.00 1.00 1.00 118929
The above results indicates that the fraudulent transactions were detected correctly at high
percent
Explanation:
In this project we are importing pandas to read the data file given by user.
We are using Random forest classification in this methods. The reason for choosing Random
forest over K-Nearest neighbor classifier is KNN can become tough to split with large dataset as
it requires space to store the dataset and required more time and K-Nearest neighbor is not good
in scaling. But overall KNN is good in Handling imbalance data that is its good for fitting
imbalance data. We can fit to multiple models than any other models and its capable of capturing
complex interactions And logistic regression wont work good in handling non linear
relationships data as good as random forest classification and Random forest may perform well
in unbalanced dataset but logic regression may struggle more in this type of dataset
In this code initially we are assigning every label name in the column and by the label encoders
from SK learn is used to convert it in to numerical value for the further processing. By doing like
these its very easy for the algorithm to make further classification. After that we are dropping
fraud tab from the dataset because its every value is zero and no use of it so we are dropping it
and assigning to the variable named Y
After that we are using standard scalar from SK learn and assigning it to the variable named
scalar. Now we fit the model data X to the scalar to make sure that the machine learning model
does not get skewed or biased towards one certain features just because of their scales or units of
measurement
Then we split the data in to training data set and testing data set and random state of 42 is used to
ensure same random numbers are produced again and again
Then we are using random forest classification method to train and testing the data
Model evaluation is done by confusion matrix and classification report from the ski kit learn
And the output that is out of 1082 correctly detected out of 1417 and the results shows that the
results were detected correctly at 76 percent and by the confusion matrix the graph is drawn to it

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy