data science
data science
KARAMADAI, COIMBATORE
DEPARTMENT OF
Name :
Register Number :
MARK SIGN
EX. DATE NAME OF THE EXPERIMENT PAGE
NO NO.
Reading Data from text files ,Excel and the web and
5 exploring various commands for doing descriptive
analytics on the IRIS Dataset
AIM:
To download, install and explore the features of Numpy, Scipy, Jupyter, Statsmodels
and pandas packages.
ALGORITHM:
Step 1: Go to Command prompt.
Step 2: Type pip install Numpy.
Step3: Numpy packages have been installed.
Step 4: Type pip Scipy, Scipy packages get installed.
Step 5: Type pip install Jupyter, Jupyter packages get
installed. Step 6: Type pip install Statsmodel, the packages get
installed. Step 7: Type pip install pandas, the packages get
installed.
INSTALLATION PROCESS:
Numpy Installation: pip install numpy
pi Pandas installation:
RESULT:
Thus the working with commands executed successfully.
AIM:
Write a python code to implement the concept of Numpy arrays.
ALGORITHM:
PROGRAM:
import numpy as np
# create a 1D NumPy
array a = np.array([1, 2, 3,
4, 5])
print(a)
# print the
results print(g)
print(h
)
print(i)
print(j)
OUTPUT:
1-D Array:
[1 2 3 4 5]
2-D Array:
[[1 2 3]
[4 5 6]
[7 8 9]]
Random Array:
[0.21995867 0.92288075 0.69384057 0.7043604 0.80637838]
Element-wise multiplication:
[ 4 10 18]
Matrix multiplication:
[[19 22]
21
3.5
6
1
RESULT:
Thus the working with numpy array was completed successfully.
ALGORITHM:
Step 1: Import pandas library
Step 2: Load data into a
DataFrame. Step 3: Explore the
DataFrame Step 4: Selecting
data.
Step 5: Manipulating data
Step 6: Cleaning data
Step 7: Saving the modified DataFrame
PROGRAM:
import pandas as pd
OUTPUT:
Name Age
0 John 30
1 Jane 25
2 Bob 40
3 Mary 35
Age Salary
Name
Data Science Lab Page 13
Bob 40 70000
Jane 25 60000
John 30 50000
Mary 35 80000
0 Alice 25 Female
1 Bob 30 Male
2 Charlie 35 Male
3 David 40 Male
4 Eve 45 Female
RESULT:
Thus the working with pandas Data Frame was completed successfully.
AIM:
Write a python code to implement the data sampling method.
ALGORITHM:
Step 1: import pandas.
Step 2: get data
Step 3: start by generating a sample dataset with two columns, 'A'
and 'B'.
We then implement four different sampling methods:
Simple random sampling: We use the sample() method from the Pandas library
to randomly select 30 rows from the dataset.
Systematic sampling: We select every 10th row of the dataset using Python's slicing
syntax.
Stratified sampling: We group the dataset by quantiles of the 'B' column
using the groupby() method from Pandas. We then select a random sample of
20% of rows from each group using the sample() method.
Cluster sampling: We randomly select 5 clusters of rows from the 'A' column and
include all rows with those values using the apply() method.
Step 4:Finally, we output the results for each sampling method using the print()
function.
PROGRAM
import random
import numpy as
np import pandas
as pd
# Systematic sampling
systematic_sample = data.iloc[::10,
:]
# Stratified sampling
strata = data.groupby(pd.qcut(data['B'], 3))
stratified_sample = strata.apply(lambda x: x.sample(frac=0.2))
OUTPUT:
Simple random sample:
A B
79 80 0.878277
40 41 0.639264
57 58 0.897447
58 59 0.600354
13 14 0.661578
95 96 0.246993
7 8 0.934867
94 95 0.812213
24 25 0.837017
49 50 0.186842
0 1 0.940231
42 43 0.394464
33 34 0.793838
60 61 0.181043
54 55 0.190086
56 57 0.773640
74 75 0.228341
4 5 0.514767
34 35 0.640982
87 88 0.102709
53 54 0.594242
23 24 0.689938
72 73 0.800255
52 53 0.898425
65 66 0.530389
61 62 0.322569
77 78 0.029112
80 81 0.596407
35 36 0.699136
99 100 0.637643
Systematic
sample: AB
0 1 0.940231
10 11 0.721191
20 21 0.242574
30 31 0.275564
40 41 0.639264
50 51 0.663985
60 61 0.181043
70 71 0.409256
80 81 0.596407
Data Science Lab Page 17
90 91 0.133356
Stratified sample:
A B
B
(0.012400000000000001, 0.334] 60 61 0.181043
69 70 0.177240
96 97 0.124787
87 88 0.102709
19 20 0.122518
31 32 0.118424
93 94 0.152851
(0.334, 0.641] 42 43 0.394464
80 81 0.596407
14 15 0.444850
83 84 0.633580
75 76 0.475987
82 83 0.416136
66 67 0.340407
(0.641, 0.967] 44 45 0.814840
28 29 0.836442
46 47 0.680723
32 33 0.653128
57 58 0.897447
86 87 0.837541
10 11 0.721191
Cluster sample:
A B
2 3 0.621991
3 4 0.576675
4 5 0.514767
6 7 0.308789
9 10 0.013366
RESULT:
Thus the implementation of sampling method executed successfully.
Data Science Lab Page 18
EX: NO: 5 READING DATA FROM TEXT FILES, EXCEL AND
THE DATE: WEB AND EXPLORING VARIOUS COMMANDS FOR
DOING DESCRIPTIVE ANALYTICS ON THE IRIS DATA SET
AIM:
To Read the data from text files, Excel and the web and exploring various commands
for doing descriptive analytics on the Iris data set.
ALGORITHM:
Step 1: Import the pandas library as pd and the requests library.
Step 2: From the io library, import the BytesIO function.
Step 3: Read data from a text file called iris.txt using the pd.read_csv() function. Assign
the resulting DataFrame to iris_txt. The file has no header row, so header=None is passed
as an argument. The column names are specified as a list of strings using the names
argument.
Step 4: Read data from an Excel file called iris.xlsx using the pd.read_excel()
function. Assign the resulting DataFrame to iris_excel.
Step 5: Read data from a CSV file from the web using the requests.get() function to
retrieve the file contents, and then pass the contents to the pd.read_csv() function using
BytesIO to create a file-like object. Assign the resulting DataFrame to iris_web. The file
has no header row, so header=None is passed as an argument. The column names are
specified as a list of strings using the names argument.
Step 6: Concatenate the three DataFrames using pd.concat(), and assign the result to
iris. ignore_index=True is passed as an argument to reset the index of the concatenated
DataFrame.
Step 7: Display the descriptive statistics of the entire dataset using iris.describe().
Step 8:Group the data by species and display the mean values for each species
using iris.groupby('species').mean().
Step 9: Create a box plot for each variable by species using
iris.boxplot(by='species', figsize=(10, 8)).
PROGRAM:
import pandas as
pd import requests
from io import BytesIO
OUTPUT:
AIM:
ALGORITHM:
PROGRAM :
OUTPUT:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
AIM:
To use the diabetes data set from UCI and Pima Indians diabetes data set performing
the following.
a) Implement Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis from UCI dataset.
b) Bivariate analysis: Linear and Logistic Regression Modeling.
c) Multiple Regression Analysis.
ALGORITHM:
Step 1: Download the Pima Indians Diabetes dataset
Link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes- database?
resource=download
Step 2: Install Packages.
Step 3: Open the pycharm and type the following Commands.
Step 4: The output will display.
PROGRAM:
a) Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis
import statistics as st
# Mean
mean = diabetes_df[col].mean()
# Median
median = diabetes_df[col].median()
# Mode
# Variance
variance = diabetes_df[col].var()
# Standard deviation
std_dev = diabetes_df[col].std() # Skewness
skewness = skew(diabetes_df[col])
# Kurtosis
kurt = kurtosis(diabetes_df[col])
Output:
Column: Outcome
Frequency: Outcome
0 500
1 268
Name: count, dtype: int64
Mean: 0.3489583333333333
Median: 0.0
Mode: 0 0
Name: Outcome, dtype: int64
Variance: 0.22748261625380273
Standard Deviation: 0.47695137724279896
Skewness: 0.6337757030614577
Kurtosis: -1.5983283582089547
(768, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
Data Science Lab Page 27
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
Pregnancies 3.845052
Glucose 120.894531
BloodPressure 69.105469
SkinThickness 20.536458
Insulin 79.799479
BMI 31.992578
DiabetesPedigreeFunction 0.471876
Age 33.240885
Outcome 0.348958
dtype: float64
Pregnancies 3.0000
Glucose 117.0000
BloodPressure 72.0000
SkinThickness 23.0000
Insulin 30.5000
BMI 32.0000
DiabetesPedigreeFunction 0.3725
Age 29.0000
Outcome 0.0000
dtype: float64
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 1.0 99 70.0 0.0 0.0 32.0
1 NaN 100 NaN NaN NaN NaN
PROGRAM:
diabetes = datasets.load_diabetes()
diabetes.keys()
df=pd.DataFrame(diabetes['data'],columns=diabetes['feature_names'])
x=df
y=diabetes['target']
from sklearn.model_selection import train_test_split
#to split our data into training and testing set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 101)
#splitting our data
print("\nWeights :",model.coef_)
print("\nIntercept",model.intercept_)
Output:
LOGISTIC REGRESSION:
PROGRAM:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets,linear_model
from sklearn.metrics import mean_squared_error
diabetes=datasets.load_diabetes()
diabetes.keys()
df=pd.DataFrame(diabetes['data'],columns=diabetes['fea
ture_names'])
x=df
y=diabetes['target']
Output:
r^2 : -0.44401265478624397
RMSE : 94.65723681369009
RMSE : 58.00932552866432
AIM:
To apply and explore various plotting functions on UCI data sets.
a) Normal Curves.
b) Density and Contour Plots.
c) Correlation and Scatter Plots.
d) Histograms.
e) Three Dimensional Plotting.
ALGORITHM:
Step 1: Download Heart dataset from kaggle.
Link: https://www.kaggle.com/datasets/zhaoyingzhu/heartcsv
Step 2: Save that in downloads or any other Folder and install packages.
Step 3: Apply these following commands on the dataset.
Step 4: The Output will display.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Plot a bar chart of the mean petal width for each class
class_means = dataset.groupby('class')['petal-
width'].mean() class_means.plot(kind='bar')
plt.title('Mean Petal Width for Each
Class') plt.xlabel('Class')
plt.ylabel('Mean Petal
Width') plt.show()
OUTPUT:
AIM:
To create an insight Geographic Data with Basemap.
ALGORITHM:
Step 1: Install Basemap. The zip file occurs extract the original file.
Step 2: import Packages.
Step3: Save that in downloads or any other Folder.
Step 4: Apply these following commands.
Step 5: The Output will display.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-
100) m.bluemarble(scale=0.5);
m.etopo(scale=0.5, alpha=0.5)
RESULT:
AIM
To write a python program to perform arithmetic operation between two panda series
ALGORITHM
STEP 1: Start
STEP 2: Import pandas package
STEP 3: Initialise ds1 and ds2
STEP 4: For addition, calculate ds1+ds2
STEP 5: For subtraction, calculate ds1-ds2
STEP 6: For multiplication, calculate
ds1*ds2 STEP 7: For division, calculate
ds1/ds2 STEP 8: Print the desired results
STEP 9: Stop
PROGRAM
import pandas as pd
ds1=pd.Series([2,4,6,8,10]
)
ds2=pd.Series([1,3,5,7,9]
) print("Add two series")
ds=ds1+ds2
print(ds)
print("Subtract two
series") ds=ds1-ds2
print(ds)
print("Multiply two
series") ds=ds1*ds2
print(ds)
print("Divide two
series") ds=ds1/ds2
print(ds)
Add two
series 0 3
1 7
2 11
3 15
4 19
dtype: int64
Subtract two
series 0 1
1 1
2 1
3 1
4 1
dtype: int64
Multiply two
series 0 2
1 12
2 30
3 56
4 90
dtype: int64
Divide two
series 0
2.000000
1 1.333333
2 1.200000
3 1.142857
4 1.111111
dtype: float64
RESULT
Thus the program to perform arithmetic operations between two panda series has
been executed successfully
DATE: DATASET
AIM
To perform a scatter plots in Python, using Matplotlib and Seaborn library with Pokemon
dataset.
ALGORITHM:
Step 1: Download pokemon dataset from kaggle.
Link: https://www.kaggle.com/datasets/rounakbanik/pokemon
Step 2: Save that in downloads or any other Folder and install packages.
Step 3: Apply these following commands on the dataset.
Step 4: The Output will display.
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("pokemon.csv")
data.shape
data.head()
g1 = data.loc[data.generation==1,:]
# dataframe.plot.scatter() method
g1.plot.scatter('attack', 'defense');
# The ';' is to avoid showing a message before showing t e plot
# plt.scatter() function
plt.scatter('attack', 'defense', data=g1);
g1.plot.scatter('attack', 'defense', s = 40, c = 'orange', marker = 's', figsize=(8,5.5));
plt.figure(figsize=(10,7)) # Specify size of the chart
plt.scatter('attack', 'defense', data=data[data.is_legendary==1], marker = 'x', c = 'magenta')
plt.scatter('attack', 'defense', data=data[data.is_legendary==0], marker = 'o', c = 'blue')
plt.legend(('Yes', 'No'), title='Is legendary?')
plt.show()
plt.figure(figsize=(10,7))
sns.scatterplot(x = 'attack', y = 'defense', s = 70, hue ='is_legendary', data=data);
# hue represents color
plt.figure(figsize=(10,7))
sns.scatterplot(x = 'attack', y = 'defense', s = 50, hue = 'is_legendary', style ='is_legendary',
data=data);
# style represents marker
plt.figure(figsize=(11,7))
sns.scatterplot(x = 'attack', y = 'defense', s = 50, hue = 'type1', data=data)
plt.legend(bbox_to_anchor=(1.02, 1))
# move legend to outside of the chart
plt.title('Defense vs Attack for All Pokemons', fontsize=16)
plt.xlabel('Attack', fontsize=12)
Data Science Lab Page 40
plt.ylabel('Defense', fontsize=12)
plt.show()
water = data[data.type1 == 'water']
water.plot.scatter('height_m', 'weight_kg', figsize=(10,6))
plt.grid(True) # add gridlines
plt.show()
water.plot.scatter('height_m', 'weight_kg', figsize=(10,6))
plt.grid(True)
for index, row in water.nlargest(5, 'height_m').iterrows():
plt.annotate(row['name']) # text to show
xy = (row['height_m'], row['weight_kg']), # the point to annotate
xytext = (row['height_m']+0.2, row['weight_kg']), # where to show the text fontsize=12)
plt.xlim(0, ) # x-axis has minimum 0 plt.ylim(0, ) # y-axis has minimum 0 plt.show()