0% found this document useful (0 votes)
50 views

FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)

Good for btech study

Uploaded by

anirudh38041
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)

Good for btech study

Uploaded by

anirudh38041
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

FUNDAMENTALSOFDATASCIENCELAB-

FUNDAMENTALS OF DATA SCIENCE LAB MANUAL

NumPy is one of the most fundamental libraries in Python and perhaps the most useful of them all. NumPy handles large
datasets effectively and efficiently. I can see your eyes glinting at the prospect of mastering NumPy already. As a data
scientist or as an aspiring data science professional, we need to have a solid graspon NumPy and how it works in Python.
NumPy stands for Numerical Python and is one of the most useful scientific libraries in Python programming. It provides
support for large multi dimensional array objects and various tools to work with them. Various other libraries like Pandas
,Matplotlib ,and Scikit-learn are built on top of this amazing library.

Arrays are a collection of elements/values, that can have one or more dimensions. An array of one dimension is called a
Vector while having two dimensions is called a Matrix. NumPy arrays are called ndarray or N-dimensional arrays and they
store elements of the same type and size. It is known for its high-performance and provides efficient storage and data
operations as arrays grow in size. NumPy comes pre-installed when you download Anaconda. But if you want to install
NumPy separately on your machine, just type the below command on your terminal: pip install numpy

Now you need to import the library: import numpy as np

np is the defact abbreviation for NumPy used by the data science community.

Python Lists vs NumPy Arrays –What’s the Difference?

If you’re familiar with Python, you might be wondering why use NumPy arrays when we already have Python lists?
After all, these Python lists act as an array that can store elements of various types. This is a perfectly valid question
and the answer to this is hidden in the way Python stores an object in memory.

A Python object is actually a pointer to a memory location that stores all the details about the object, like bytes and the
value. Although this extra information is what makes Python a dynamically typed language, it also comes at a cost which
becomes apparent when storing a large collection of objects, like in an array.

Python lists are essentially an array of pointers, each pointing to a location that contains the information related to the
element. This adds a lot of overhead in terms of memory and computation. And most of this information is rendered when
all the objects stored in the list are of the same type!

To overcome this problem, we use NumPy arrays that contain only homogeneous elements, i.e. elements having the
same datatype.This makes it more efficient at storing and manipulating the array. This difference becomes apparent
when the array has a large number of elements, say thousands or millions. Also, with NumPy arrays, you can perform
element-wise operations, something which is not possible using Python lists!

This is the reason why NumPy arrays are preferred over Python lists when performing mathematical operations on a large
amount of data.

1/59
FUNDAMENTALSOFDATASCIENCELAB-
Experiment 1. Creating a NumPy Array

A .Basic nd- array


NumPy arrays are very easy to create given the complex problems they solve. To create a very basic nd-array, you use
the np.array() method. All you have to pass are the values of the array as a list:
In [1]:
import numpy as np
np.array([1,2,3,4])

Out[1]:

array([1, 2, 3, 4])

This array contains integer values. You can specify the type of data in the data type argument:

In [2]:

np.array([1,2,3,4],dtype=np.float32)

Out[2]:

array([1., 2., 3., 4.], dtype=float32)

Since NumPy arrays can contain only homogeneous datatypes, values will be upcast if the types do not match
arrays can contain only homogeneous datatypes, values will be upcast if the types do not match:

In the following example NumPy has upcast integer values to float values.

In [3]:

np.array([1,2.0,3,4])

Out[3]:

array([1., 2., 3., 4.])

NumPy arrays can be multi-dimensional too.A matrix is just a rectangular array of numbers with shape N x M where N
is the number of rows and M is the number of columns in the matrix.

The below examples is a example for2 x 4 matrix


In[4]:
I

np.array([[1,2,3,4],[5,6,7,8]])

Out[4]:

array([[1, 2, 3, 4],
[5, 6, 7, 8]

1(b) Arrays of Zeros:

NumPy lets you create an array of all zeros using the np.zeros() method. All you have to do is pass the shape of the
desired array:

In [5]:

np.zeros(5)

2/59
FUNDAMENTALSOFDATASCIENCELAB-
Out[5]:

array([0., 0., 0., 0., 0.])

The one above is a 1-D array while the one below is a 2-D array of all zeros:

In [6]:

np.zeros((2,3))

Out[6]:

array([[0., 0., 0.],


[0., 0., 0.]])

1(c) Array of once:

You could also create an array of all 1s using the np.ones() method:

In [7]:

np.ones(5,dtype=np.int32)

Out[7]:

array([1, 1, 1, 1, 1])

1(d) Random numbers in nd array

Another very commonly used method to create ndarrays is np.random.rand() method. It creates an array of a given
shape with random values from [0,1):

np.random.rand(2,3)

Out[8]:

array([[0.01952557, 0.24637561, 0.37780528],


[0.32267058, 0.51446159, 0.85915703]])

1(e) An array of your choice

in fact, you can create an array filled with any given value using the np.full() method. Just pass in the shape of the
desired array and the value you want:
In [9]:

np.full((2,2),7)

Out[9]:

array([[7, 7],
[7, 7]])

1(f) Imatrix in NumPy (Identity matrix)


Another great method is np.eye() that returns an array with 1s along its diagonal and 0s everywhere else.An Identity

matrix is a square matrix that has 1s along its main diagonal and 0s everywhere else.

Below is an Identity matrix of shape 3 x 3.

3/59
FUNDAMENTALSOFDATASCIENCELAB-

Note: A square matrix has an N x N shape. This means it has the same number of rows and columns.

Note: A matrix is called the Identity matrix only when the 1s are along the main diagonal and not any other diagonal

In [10]:

np.eye(3)

Out[10]:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
However ,NumPy gives you the flexibility to change the diagonal along which the values have to be 1s.You can either move
it above the main diagonal: not an identity matrix.

In [11]:

np.eye(3,k=1)

Out[11]:
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])

Or move it below the main diagonal:

In [12]:
Out[12]:
np.eye(3,k=-2)
array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.]])

1(g) Evenly spaced nd-array

You can quickly get an evenly spaced array of numbers using the np.arange() method:

In [13]:

np.arange(5)

Out[13]:

array([0, 1, 2, 3, 4])

The start, end and step size of the interval of values can be explicitly defined by passing in three numbers as arguments
for these values respectively. A point to be noted here is that the interval is defined as (start,end)where the last number
will not be included in the array:

Alternate elements were printed because the step-size was defined as 2. Notice that 10 was not printed as it was the last
element.

4/59
FUNDAMENTALSOFDATASCIENCELAB-

In [14]:

np.arrange(2,10,2)

Out[14]:

varray([2, 4, 6, 8])

Another similar function is np.linspace(), but instead of step size, it takes in the number of samples that need to

be retrieved from the interval.

A point to note here is that the last number is included in the values returned unlike in the case of np.arange().

In [15]:

np.linspace(0,1,5)

Out[15]:

array([0., 0.25, 0.5 , 0.75, 1.])

Experiment 2. The Shape and Reshaping of NumPy Array

Once you have created your ndarray,then ext thing you would want to do is check the number of axes, shape, and the size
of the ndarray.

5/59
FUNDAMENTALSOFDATASCIENCELAB-

2(a) Dimensions of NumPy array

You can easily determine the number of dimensions or axes of a NumPy array using the ndims attribute:In the

following example, array has two dimensions: 2 rows and 3 columns.

In [16]:

a=np.array([[5,10,15],[20,25,20]])
print('Array:','\n',a)
print('Dimensions:','\n',a.ndim)

Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :2

2(b) Shape of NumPy array

The shape is an attribute of the NumPy array that shows how many rows of elements are there along each dimension.

You can further index the shape so returned by then array to get value along each dimension:

In [17]:

a=np.array([[1,2,3],[4,5,6]])
print('Array:','\n',a)
print('Shape:','\n',a.shape)
print('Rows=',a.shape[0])
print('Columns=',a.shape[1])

Array :
[[1 2 3]
[4 5 6]]
Shape :
(2, 3)
Rows =2
Columns =3

2(c) Size of NumPy array

Youcandeterminehowmanyvaluesthereareinthearrayusingthesizeattribute.It just multiplies

the number of rows by the number of columns in the ndarray:

In [18]:

a=np.array([[5,10,15],[20,25,20]])
print('Sizeofarray:',a.size)
print('Manualdeterminationofsizeofarray:',a.shape[0]*a.shape[1])
6/59
FUNDAMENTALSOFDATASCIENCELAB-
Size of array : 6
Manual determination of size of array: 6

2(d) Reshaping a NumPy array

Reshaping a ndarray can be done using the np.reshape() method.

It changes the shape of the ndarray without changing the data within the ndarray:In the

following example reshaped the ndarray from a1-D to a2-D ndarray.

In [19]:

a=np.array([3,6,9,12])
np.reshape(a,(2,2))

Out[19]:

array([[ 3,6],
[ 9, 12]])

While reshaping, if you are unsure about the shape of any of the axis, just input -1. NumPy automatically calculates
the shape when it sees a -1:

In [20]:

a=np.array([3,6,9,12,18,24])
print('Threerows:','\n',np.reshape(a,(3,-1)))
print('Threecolumns:','\n',np.reshape(a,(-1,3)))

Three rows :
[[ 36]
[ 9 12]
[18 24]]
Three columns :
[[3 6 9]
[12 18 24]]

2(e) Flattening a NumPy array

Sometimes when you have a multidimensional array and want to collapse it to a single-dimensional array,you can either
use the flatten() method or the ravel() method:

But an important difference between flatten() and ravel() is that the former returns a copy of the original array while the
latter returns a reference to the original array. This means any changes made to the array returned from ravel() will also
be reflected in the original array while this will not be the case with flatten().

In [21]:

a =
np.ones((2,2))b= 7/59
FUNDAMENTALSOFDATASCIENCELAB-
Original shape : (2, 2)Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)Array :
[1. 1. 1. 1.]

f.Transposeof aNumPyarray

Another very interesting reshaping method of NumPy is the transpose() method. It takes the input array andswaps the
rows with the column values, and the column values with the values of the rows:

Ontransposinga2x3array,we got a 3x2 array.Transpose has a lot of significance in linear algebra

In [22]:

a=np.array([[1,2,3],
[4,5,6]])
b=np.transpose(a)
print('Original','\n','Shape',a.shape,'\n',a)
print('Expandalongcolumns:','\n','Shape',b.shape,'\n',b)

Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expandalongcolumns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]

Experiment 3. Expanding and Squeezing a NumPy Array

3(a) Expanding a NumPy array

You can add a new axis to an array using the expand_dims() method by providing the array and the axis along which to
expand:

In [23]:

a=np.array([1,2,3])
b =
np.expand_dims(a,axis=0)c
=np.expand_dims(a,axis=1)
print('Original:','\n','Shape',a.shape,'\n',a)
print('Expandalongcolumns:','\n','Shape',b.shape,'\n',b)
print('Expandalongrows:','\n','Shape',c.shape,'\n',c)
8/59
FUNDAMENTALSOFDATASCIENCELAB-
Original:
Shape (3,)
[1 2 3]
Expandalongcolumns:
Shape (1, 3)
[[1 2 3]]
Expandalongrows:
Shape (3, 1)
[[1]
[2]
[3]]

3(b) Squeezing a NumPy array

On the other hand,if you instead want to reduce the axis of the array,use the squeeze() method. It removes the axis that has a
single entry. This means if you have created a 2 x 2 x 1 matrix, squeeze() will remove the third dimension from the matrix:

In [24]:

a=np.array([[[1,2,3],
[4,5,6]]])
b=np.squeeze(a,axis=0)
print('Original','\n','Shape',a.shape,'\n',a)
print('Squeezearray:','\n','Shape',b.shape,'\n',b)

Original
Shape (1, 2, 3)
[[[1 2 3]
[4 5 6]]]
Squeeze array:
Shape (2, 3)
[[1 2 3]
[4 5 6]]

3(c) Sorting in NumPy Arrays

Sorting is an important and very basic operation that you might well use on a daily basis as a data scientist. So,it is
important to use a good sorting algorithm with minimum time complexity.

The NumPy library is alegend when it comes to sortingelementsofanarray.Ithasarangeofsortingfunctionsthat you can use to
sort your array elements. It has implemented quicksort, heapsort, mergesort, and timesortfor you under the hood when you
use the sort() method:

a = np.array([1,4,2,5,3,6,8,7,9]) np.sort(a, kind='quicksort')

In [25]:

a=np.array([[5,6,7,4],
[9,2,3,7]])#sortalongthecolumn
print('Sortalongcolumn:','\n',np.sort(a,kind='mergresort',axis=1))

Sort along column :[[4 5


6 7]
[2 3 7 9]]

9/59
FUNDAMENTALSOFDATASCIENCELAB-
In [26]:

# sort along the row


print('Sortalongrow:','\n',np.sort(a,kind='mergresort',axis=0))

Sort along row :[[5 2


3 4]
[9 6 7 7]]

10/5
FUNDAMENTALSOFDATASCIENCELAB-

Experiment 4. Indexing and Slicing of NumPy Array

4(a)Slicing 1-D NumPy arrays

Slicing means retrieving elements from one index to another index. All we have to do is to pass the starting andending
point in the index like this: [start: end].

However,youcaneventakeitupanotchbypassingthestep-size.Whatisthat?Well,supposeyouwantedtoprint every other


element from the array, you would define your step-size as 2, meaning get the element 2places away from the present
index. Incorporating all this into a single index would look something like this:[start:end:step-size].

In [27]:

a =
np.array([1,2,3,4,5,6])p
rint(a[1:5:2])

[2 4]

Notice that the last element did not get considered. This is because slicing includes the start index but excludesthe end
index.A way around this is to write the next higher index to the final index value you want to retrieve:

In [28]:

a =
np.array([1,2,3,4,5,6])p
rint(a[1:6:2])

[2 4 6]

Ifyoudon’tspecifythestartorendindex,itistakenas0orarraysize,respectively,asdefault.Andthestep-size by default is 1.

In [29]:

a =
np.array([1,2,3,4,5,6])p
rint(a[:6:2])
print(a[1::2])
print(a[1:6:])

[1 3 5]
[2 4 6]
[2 3 4 5 6]

4(b) Slicing 2-D NumPy arrays

Now,a2-Darrayhasrowsandcolumnssoitcangetalittletrickytoslice2-Darrays.Butonceyouunderstandit, you can slice any


dimension array!

Beforelearninghowtoslicea2-Darray,let’shave alookathowtoretrieveanelementfroma2-D array:

11/5
FUNDAMENTALSOFDATASCIENCELAB-

In [30]:

a=np.array([[1,2,3],
[4,5,6]])
print(a[0,0])
print(a[1,2])
print(a[1,0])

1
6
4

Here, we provided the row value and column value to identify the element we wanted to extract. While in a 1-Darray,we
were only providing the column value since there was only 1row.
So,to slice a 2-Darray, you need to mention the slices for both,the row and the column:
In [31]:

a=np.array([[1,2,3],[4,5,6]])
# print first row values
print('Firstrowvalues:','\n',a[0:1,:])
# with step-size for columns
print('Alternatevaluesfromfirstrow:','\n',a[0:1,::2])
#
print('Secondcolumnvalues:','\n',a[:,1::2])
print('Arbitraryvalues:','\n',a[0:1,1:3])

First row values :


[[1 2 3]]
Alternate values from first row:[[1 3]]
Second column values :[[2]
[5]]
Arbitrary values :[[2 3]]

4(c) Slicing 3-D NumPy arrays


Sofarwe haven’t seen a 3-D array.Let’s first visualize how a 3-D array looks like:

In [32]:

a=np.array([[[1,2],[3,4],[5,6]],#firstaxisarray
[[7,8],[9,10],[11,12]],#secondaxisarray
[[13,14],[15,16],[17,18]]])# third axis
array# 3-D array
print(a)
12/5
FUNDAMENTALSOFDATASCIENCELAB-
[[[ 1 2]
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]

[[13 14]
[15 16]
[17 18]]]

In [33]:

# value
print('First array, first row, first column value :','\
n',a[0,0,0])print('Firstarray lastcolumn :','\n',a[0,:,1])
print('Firsttworowsforsecondandthirdarrays:','\n',a[1:,0:2,0:2])

First array, first row, first column value :1


First array last column :
[2 4 6]
First two rows for second and third arrays :[[[ 78]
[9 10]]

[[13 14]
[15 16]]]

print('Printingasasinglearray:','\n',a[1:,0:2,0:2].flatten())

Printing as a single array :[ 789 10 13


14 15 16]

4(d) Negative slicing of NumPy arrays

An interesting way to slice your array is to use negative slicing. Negative slicing prints elements from the endrather
than the beginning. Have a look below:

In [35]:

a=np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])

[ 5 10]

Here,thelastvaluesforeachrowwereprinted.If,however,wewantedtoextractfromtheend,wewouldhaveto explicitly provide a


negative step-size otherwise the result would be an empty list.

In [36]:

print(a[:,-1:-3:-1])

[[ 54]
[109]]

13/5
FUNDAMENTALSOFDATASCIENCELAB-
Having said that, the basic logic of slicing remains the same, i.e. the end index is never included in the
output.Aninteresting use of negativeslicing is to reverse theoriginal array.

In [37]:

a=np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Originalarray:','\n',a)
print('Reversedarray:','\n',a[::-1,::-1])

Original array :
[[ 12 345]
[ 67 89 10]]
Reversed array :
[[109 876]
[ 54 321]]

a=np.array([[1,2,3,4,5],[6,7,8,9,10]])
print('Originalarray:','\n',a)
print('Reversed array vertically :','\
n',np.flip(a,axis=1))print('Reversedarrayhorizontally:','\
n',np.flip(a,axis=0))

Original array :
[[ 12345]
[ 6789 10]]
Reversed array vertically :[[ 54321]
[109876]]
Reversed array horizontally :[[ 6789
10]
[ 12345]]

Experiment 5: Stacking and Concatenating Numpy Arrays

5(a) Stacking ndarrays

Youcancreateanewarraybycombiningexistingarrays.Thisyoucandointwoways:

•Either combine the arrays vertically (i.e. along the rows) using the vstack() method, thereby increasing thenumber of
rows in the resulting array

14/5
FUNDAMENTALSOFDATASCIENCELAB-
•Or combine the arrays in a horizontal fashion (i.e. along the columns) using the hstack(), thereby increasingthe
number of columns in the resultant array

In [39]:

a=np.arange(0,5)b
=np.arange(5,10)
print('Array1:','\n',a)
print('Array2:','\n',b)
print('Vertical stacking :','\
n',np.vstack((a,b)))print('Horizontalstacking:'
,'\n',np.hstack((a,b)))

Array 1 :
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :[[0 1 2
3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0123456789]

Another interesting way to combine arrays is using the dstack() method. It combines array elements index byindex and
stacks them along the depth axis:

In [40]:

a=[[1,2],[3,4]]
b=[[5,6],[7,8]]
c=np.dstack((a,b))
print('Array1:','\n',a)
print('Array2:','\n',b)
print('Dstack :','\
n',c)print(c.shape)

Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]

15/5
FUNDAMENTALSOFDATASCIENCELAB-
Dstack :
[[[15]
[2 6]]

[[3 7]
[4 8]]]
(2, 2, 2)

5(b) Concatenating ndarrays


While stacking arrays is one way of combining old arrays to get a new one, you could also use the

concatenate() method where the passed arrays are joined along an existing axis:

In [41]:

a=np.arange(0,5).reshape(1,5)
b =
np.arange(5,10).reshape(1,5)p
rint('Array1:','\n',a)
print('Array2:','\n',b)
print('Concatenatealongrows:','\n',np.concatenate((a,b),axis=0))
print('Concatenatealongcolumns:','\n',np.concatenate((a,b),axis=1))

Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :[[0 1 2
3 4]
[5 6 7 8 9]]
Concatenate along columns :[[0 1 2
3 4 5 6 7 8 9]]

The drawback of this method is that the original array must have the axis along which you want to
combine.Otherwise,get ready to be greeted by an error.

Anotherveryusefulfunctionistheappendmethodthataddsnewelementstotheendofandarray.Thisisobviously useful when you


already have an existing ndarray but want to add new values to it.

In [42]:
Out[42]:
# append values to ndarray
a=np.array([[1,2],
array([[1, 2],
[3, 4],[3,4]])
np.append(a,[[5,6]],axis=0)
[5, 6]])

5(c) Broadcasting in Numpy Arrays

Broadcasting is one of the best features of ndarrays. It lets you perform arithmetics operations betweenndarraysof
different sizes or between an ndarray and a simple number!

Broadcasting essentially stretches the smaller ndarray so that it matches the shape of the larger ndarray:

16/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[517]:
In [43]:

a=np.arange(10,20,2)b
=np.array([[2],[2]])
print('Adding two different size arrays :','\
n',a+b)print('Multiplyinganndarrayandanumber:',
a*2)

Adding two different size arrays :


[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying an ndarray and a number : [20 24 28 32 36]

Itsworkingcanbethoughtoflikestretchingormakingcopiesofthescalar,thenumber,[2,2,2]tomatchtheshape of the ndarray


and then perform the operation element-wise. But no such copies are being made. It isjust a way of thinking about how
broadcasting is working. This is very useful because it is more efficient tomultiply an array with a scalar value rather
than another array! It is important to note that two ndarrays canbroadcast together only when they are compatible.
Ndarrays are compatible when:

1. Both have the same dimensions.

2. Either of the ndarrays has a dimension of 1. The one having a dimension of 1 is broadcast to meet the
sizerequirements of the larger ndarray

Incasethearraysarenotcompatible,youwillgetaValueError.

Here,thesecondndarraywasstretched, hypothetically,toa3x3 shape,andthentheresultwas calculated.

In [44]:
Out[44]:
a=np.ones((3,3))
b =
array([[3., 3., 3.],
np.array([2])a+
[3., 3., 3.],
b
[3., 3., 3.]])

17/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[518]:
Experiment 6. Perform following operations using pandas
Pandas is one of the most popular and powerful data science libraries in Python. It can be considered as thestepping stone
for any aspiring data scientist who prefers to code in Python. Even though the library is easy toget started, it can certainly
do a wide variety of data manipulation. This makes Pandas one of the handiest datascience libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can also be considered as one of
the most efficient statistical tools for mathematical computations of tabular data.
Today. we’ll cover some of the most important and recurring operations that we perform in Pandas. Make no mistake,
there are tons of implementations and prospects of Pandas. Here we’ll try to cover some not`able aspects only.We’ll use
the analogy of Euro Cup 2020 in this tutorial.We’ll start off by creating our own minimal dataset.

6(a) Creating dataframe


Let’sstartoffbycreatingasmallsampledatasettotryoutvariousoperationswithPandas.Inthistutorial,weshall create a Football
data frame that stores the record of 4 players each from Euro Cup 2020’s finalists –Englandand Italy.
In [45]:

importpandasaspd
# Cre`ate team data
data_england={'Name':['Kane','Sterling','Saka','Maguire'],'Age':[27,26,19,28]}
data_italy={'Name':['Immobile','Insigne','Chiellini','Chiesa'],'Age':[31,30,36,2

# Create Dataframe
df_england =
pd.DataFrame(data_england)df_italy=
pd.DataFrame(data_italy)
print(df_englan
d)print(df_ital
NameAge
0 Kane 27
1 Sterling 26
2 Saka 19
3 Maguire 28
NameAge
0 Immobile 31
1 Insigne 30
2 Chiellini 36
3 Chiesa 23
6(b) concat()
Let’s start by concatenating our two data frames. The word “concatenate” means to “link together in series”.Nowthat
we have created two data frames, let’s try and “concat” them.
We do this by implementing the concat() function.

frames = [df_england,
df_italy]both_teams =
pd.concat(frames)both_teams

Out[46]:
NameAge

0 Kane 27

1 Sterling 26

2 Saka 19

3 Maguire 28

0 Immobile 31
18/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[519]:
1 Insigne 30

2 Chiellini 36

3 Chiesa 23

A similar operation could also be done using the append() function.


In [47]:

df_england.append(df_italy)

Out[47]:
NameAge

0 Kane 27

1 Sterling 26

2 Saka 19

3 Maguire 28

0 Immobile 31

1 Insigne 30

2 Chiellini 36

3 Chiesa 23

Now,imagine you wanted to label your original data frames with the associated countries of these players. You can do this
by setting specific keys to your data frames.

pd.concat(frames,keys=["England","Italy"])

Out[48]:
Name Age

England0 Kane 27

1 Sterling 26

2 Saka 19

3 Maguire 28

Italy0 Immobile 31

1 Insigne 30

2 Chiellini 36

3 Chiesa 23

6(c) Setting conditions


Conditional statements basically define conditions for data frame columns. There may be situations where youhave to
filter out various data by applying certain column conditions (numeric or non-numeric). For eg: In anEmployee data
frame, you might have to list out a bunch of people whose salary is more than Rs. 50000. Also,you might want to filter
the people who live in New Delhi, or whose name starts with “A”. Let’s see a hands-onexample.

Imaginewewanttofilterexperiencedplayersfromoursquad.Let’ssay,wewanttofilterthoseplayerswhoseage is greater than or


equal to 30. In such case, try doing:
In [49]:

both_teams[both_teams["Age"]>=30]

19/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[520]:
Out[49]:

Name Age

0Immobile 31

1 Insigne 30

2 Chiellini 36
Now,let’s try to do some string filtration. We want to filter those players whose name starts with “S”.This implementation can
be done by pandas’ startswith() function. Let’s try:
both_teams[both_teams["Name"].str.startswith('S')]

Out[50]:
NameAge

1 Sterling 26

2 Saka 19

6(d) Adding a new column

Let’stry addingmore data toour df_englanddata frame.


In [51]:

club=['Tottenham', 'ManCity','Arsenal', 'ManUtd']


# 'Associated Club' is our new column name
df_england['Associated Clubs'] =
clubdf_england

Out[51]:
Name Age Associated Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3Maguire 28 Man Utd


Let’stry to repeatimplementing the concat functionafter updating thedata for England.

In [52]:

frames = [df_england,
df_italy]both_teams =
pd.concat(frames)both_teams

Out[52]:

NameAge Associated Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3 Maguire 28 Man Utd

0 Immobile 31 NaN

20/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[521]:
1 Insigne 30 NaN

2 Chiellini 36 NaN

3 Chiesa 23 NaN

Experiment 7. Perform following operations using pandas

7(a) Filling NaN with string


Now,what if, instead of NaN,we want to include some other text? Let’stry adding“No record found”instead of NaN values.
In [53]:

both_teams['Associated Clubs'].fillna('No Data Found',


inplace=True)both_teams

Out[53]:
NameAge Associated Clubs

0 Kane 27 Tottenham

1 Sterling 26 Man City

2 Saka 19 Arsenal

3 Maguire 28 Man Utd

0 Immobile 31 No Data Found

1 Insigne 30 No Data Found

2 Chiellini 36 No Data Found

3 Chiesa 23 No Data Found


7(b) Sorting based on column values
Sorting operation is straightforward in Pandas. Sorting basically allows the data frame to be ordered bynumbers or
alphabets (in either increasing or decreasing order). Let’s try and sort the players according to theirnames.
In [54]:

both_teams.sort_values('Name')

Out[54]:
Name Age Associated Clubs

2 Chiellini 36 No Data Found

3 Chiesa 23 No Data Found

0 Immobile 31 No Data Found

1 Insigne 30 No Data Found

0 Kane 27 Tottenham

3 Maguire 28 Man Utd

2 Saka 19 Arsenal

1 Sterling 26 Man City


Fair enough, we sorted the data frame according to the names of the players. We did this by implementing
thesort_values() function.
Let’ssortthembyages:
In [55]:

both_teams.sort_values('Age')
21/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[522]:
Out[55]:
Name Age Associated Clubs

2 Saka 19 Arsenal

3 Chiesa 23 No Data Found

1 Sterling 26 Man City

0 Kane 27 Tottenham

3 Maguire 28 Man Utd

1 Insigne 30 No Data Found

0 Immobile 31 No Data Found

2 Chiellini 36 No Data Found

Can we also sort by the oldest players? Absolutely!


In [56]:

both_teams.sort_values('Age',ascending=False)

Out[56]:
Name Age Associated Clubs

2 Chiellini 36 No Data Found

0 Immobile 31 No Data Found

1 Insigne 30 No Data Found

3 Maguire 28 Man Utd

0 Kane 27 Tottenham

1 Sterling 26 Man City

3 Chiesa 23 No Data Found

2 Saka 19 Arsenal
7(c) groupby()
Grouping is arguably the most important feature of Pandas. A groupby() function simply groups a
particularcolumn.Let’s see a simple example by creating a new data frame.
In [57]:

a={
'UserID': ['U1001', 'U1002', 'U1001', 'U1001',
'U1003'],'Transaction':[500, 300, 200,300, 700]
}
df_a =
pd.DataFrame(a)df_a
df_a.groupby('UserID').sum()

Out[57]:
Transaction

UserID

U1001 1000

U1002 300

U1003 700
Notice,we have two columns–UserID and Transaction. You can also seearepeating UserID(U1001).Let’sapply a
groupby() function to it.

22/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[523]:
The function grouped the similar UserIDs and took the sum of those IDs.
If you want to unravel a particular UserID, just try mentioning the value name through get_group().

df_a.groupby('UserID').get_group('U1001')

Out[58]:

UserIDTransaction0

U1001 500

2 U1001 200

3 U1001 300

Experiment8: Read the following file formats using pandas

23/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[524]:
a.Textfilesb.CSVfilesc.Excelfilesd.JSONfiles

I have recently come across a lot of aspiring data scientists wondering why it’s so difficult to import different file formats
in Python. Most of you might be familiar with the read_csv() function in Pandas but things get trickyfrom there.

How to read a JSON file in Python? How about an image file? How about multiple files all at once? These are questions
you should know the answer to – but might find itdifficult to graspinitially.
And mastering these file formats is critical to your success in the data science industry.You’ll be working with all sorts of file
formats collected from multiple data sources – that’s thereality of the modern digitalage we live in.
8(a)ReadingTextfiles
In [ ]:
importpandasaspd
txtdata =
pd.read_table('c:/experiment8/employee.txt')txtda
ta

Out[59]:
123234877,Michael,Rogers,140

152934485,Anand,Manikutty,14

1 222364883,Carol,Smith,37

2 326587417,Joe,Stevens,37

3 332154719,Mary-Anne,Foster,14

4 332569843,George,ODonnell,77

5 546523478,John,Doe,59

6 631231482,David,Smith,77

7 654873219,Zacary,Efron,59

8 745685214,Eric,Goldsmith,59

9 845657245,Elizabeth,Doe,14

10 845657246,Kumar,Swamy,14
8(b) Reading CSV files
Ah, the good old CSV format. A CSV (or Comma Separated Value) file is the most common type of file that adata
scientist will ever work with. These files use a “,” as a delimiter to separate the values and each row in aCSV file is a
data record.
These are useful to transfer data from one application to another and is probably the reason why they are so common
place in the world of data science.
If you look at them in the Notepad, you will notice that the values are separated by commas:

24/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M

In
[60]:
importpandasaspd
csvdata =
pd.read_csv('C:/experiment8/products.csv')csvda
ta
Out[60]:
productCodeproductName productLine productScale productVendor productDescription

1969
This replica
HarleyDa
0 S10_1678 Motorcycles 01:10 MinLinDiecast featuresworkingkick
vidson
stand,
UltimateC
front...
hopper
Classic Metal
1952Alpine Turnablefront
1 S10_1949 Classic Cars 01:10 wheels;
Creations
Renault 1300 steeringfunction;
deta...
1996 Moto
Motorcycles 01:10 Highway
2 S10_2016 Official Moto
Guzzi1100i 66MiniClassic
Guzzilogosandinsigni
s
as,
2003 Harley- saddl...

Model features,
3 S10_4698 Davidson Motorcycles Red
01:10 Start official
EagleDrag Diecas
t HarleyDavidsonlo
Bike
gos...
1972Alfa
Motor City Art Features include:
4 S10_4757 Classic Cars 01:10 Turnablefront
wheels;steer...
Classics
RomeoGTA
... ... ... ... ... ... ...

Carouse Completed
105 S700_3505 TheTitanic Ships 0.52777777 lDieCa modelmeasure
8 st s191/2
Legends inches long, 9...

Exactreplica.Woodan
The Queen Welly
106 S700_396 Ships 0.52777777 dMetal.Many
Mary DiecastProdu
2 8 extras inc...
ctions
American Polished
107 S700_400 Airlines: MD- Planes 0.52777777 finish.Exactreplia
Second Gear
2 11S 8 with
Diecast
officiallo...
Boeing X-32A
108 S72_1253 Planes 10" Wingspan
0.091666667JSF Motor City Art
withretractablelandi
Classics
ng
gears.Co...
UnimaxArt
Measures 38 inches
109 S72_3212 PontYacht Ships 0.091666667 Long x 33 3/4 inches
Galleries
High. ...

25/5
FUNDAMENTALSOFDATASCIENCELAB

110rows×9columns

8(c) Reading Excel files

importpandasaspd
exceldata =
pd.read_excel('c:/experiment8/EmployeeSalaries.xlsx')exceld
ata

Out[61]:
EMPNO ENAME JOB SAL DOJ

0 7788 SMITH MANAGER 55000 2021-06-12

1 8899 JONES ANALYST 50000 2021-01-14

2 9900 ADAMS CLERK 45000 2021-03-10

8(d) Reading JSON files

JSON (JavaScript Object Notation) files are lightweight and human-readable to store and exchange data. It
iseasyformachinesto parseandgenerate thesefilesand arebasedonthe JavaScriptprogramminglanguage.

JSON files store data within {} similar to how a dictionary stores it in Python. But their major benefit is that theyare
language-independent, meaning they can be used with any programming language – be it Python, C oreven Java!

This is how a JSON file looks

importpandasaspd
jsondata =
pd.read_table('c:/experiment8/names.json')jsonda
ta

In [62]:
Out[62]:

{"BD": "Bangladesh", "BE": "Belgium", "BF": "Burkina Faso", "BG": "Bulgaria", "BA": "Bosnia
andHerzegovina", "BB": "Barbados", "WF": "Wallis and Futuna", "BL": "Saint Barthelemy",
"BM":"Bermuda","BN":"Brunei","BO":"Bolivia","BH":"Bahrain","BI":"Burundi","BJ":"Benin","BT":
"Bhutan", "JM": "Jamaica", "BV": "Bouvet Island", "BW": "Botswana", "WS": "Samoa", "BQ":
"Bonaire, Saint Eustatius and Saba ", "BR": "Brazil", "BS": "Bahamas", "JE": "Jersey", "BY":
"Belarus","BZ":"Belize","RU": "Russia","RW":"Rwanda","RS": "Serbia","TL":"East Timor",
"RE":"Reunion","TM":"Turkmenistan","TJ":"Tajikistan","RO":"Romania","TK":"Tokelau","G
W": "Guinea-Bissau", "GU": "Guam", "GT": "Guatemala", "GS": "South Georgia and the
SouthSandwich Islands", "GR": "Greece", "GQ": "Equatorial Guinea", "GP": "Guadeloupe",
"JP":"Japan", "GY": "Guyana", "GG": "Guernsey", "GF": "French Guiana", "GE": "Georgia",
"GD":
"Grenada", "GB": "United Kingdom", "GA": "Gabon", "SV": "El Salvador", "GN": "Guinea", "GM":
"Gambia","GL":"Greenland","GI":"Gibraltar","GH":"Ghana","OM":"Oman","TN":"Tunisia",
"JO": "Jordan", "HR": "Croatia", "HT": "Haiti", "HU": "Hungary", "HK": "Hong Kong",
"HN":"Honduras","HM":"HeardIslandandMcDonaldIslands","VE":"Venezuela","PR":"PuertoRico","P
S": "Palestinian Territory", "PW": "Palau", "PT": "Portugal", "SJ": "Svalbard and Jan
Mayen","PY":"Paraguay","IQ":"Iraq", "PA":"Panama","PF":"French Polynesia","PG":"PapuaNew
Guinea", "PE": "Peru", "PK": "Pakistan", "PH": "Philippines", "PN": "Pitcairn", "PL":
"Poland","PM": "Saint Pierre and Miquelon", "ZM": "Zambia", "EH": "Western Sahara", "EE":
"Estonia","EG":"Egypt", "ZA": "South Africa","EC": "Ecuador", "IT": "Italy","VN": "Vietnam",
"SB":
"Solomon Islands", "ET": "Ethiopia", "SO": "Somalia", "ZW": "Zimbabwe", "SA": "Saudi Arabia",
"ES": "Spain", "ER": "Eritrea", "ME": "Montenegro", "MD": "Moldova", "MG": "Madagascar", "MF":

26/5
FUNDAMENTALSOFDATASCIENCELAB
"Saint Martin", "MA": "Morocco", "MC": "Monaco", "UZ": "Uzbekistan", "MM": "Myanmar", "ML":
"Mali", "MO": "Macao", "MN": "Mongolia", "MH": "Marshall Islands", "MK": "Macedonia", "MU":
"Mauritius", "MT": "Malta", "MW": "Malawi", "MV": "Maldives", "MQ": "Martinique",
"MP":"Northern Mariana Islands", "MS": "Montserrat", "MR": "Mauritania", "IM": "Isle of Man",
"UG":"Uganda","TZ":"Tanzania","MY":"Malaysia","MX":"Mexico","IL":"Israel","FR":"France","IO
":"British Indian Ocean Territory", "SH": "Saint Helena", "FI": "Finland", "FJ": "Fiji", "FK":
"FalklandIslands","FM":"Micronesia","FO":"FaroeIslands","NI":"Nicaragua","NL":"Netherlands","NO
":
"Norway","NA":"Namibia","VU":"Vanuatu","NC":"NewCaledonia","NE":"Niger","NF":
"Norfolk Island", "NG": "Nigeria", "NZ": "New Zealand", "NP": "Nepal", "NR": "Nauru", "NU":
"Niue", "CK": "Cook Islands", "XK": "Kosovo", "CI": "Ivory Coast", "CH": "Switzerland", "CO":
"Colombia", "CN": "China", "CM": "Cameroon", "CL": "Chile", "CC": "Cocos Islands",
"CA":"Canada","CG":"RepublicoftheCongo","CF":"CentralAfricanRepublic","CD":"Democratic
Republic of the Congo", "CZ": "Czech Republic", "CY": "Cyprus", "CX": "Christmas Island",
"CR":"CostaRica","CW":"Curacao","CV":"CapeVerde","CU":"Cuba","SZ":"Swaziland","SY":
"Syria", "SX": "Sint Maarten", "KG": "Kyrgyzstan", "KE": "Kenya", "SS": "South Sudan", "SR":
"Suriname", "KI": "Kiribati", "KH": "Cambodia", "KN": "Saint Kitts and Nevis", "KM": "Comoros",
"ST":"SaoTomeandPrincipe","SK":"Slovakia","KR":"SouthKorea","SI":"Slovenia","KP":
"North Korea", "KW": "Kuwait", "SN": "Senegal", "SM": "San Marino", "SL": "Sierra Leone", "SC":
"Seychelles", "KZ": "Kazakhstan", "KY": "Cayman Islands", "SG": "Singapore", "SE": "Sweden",
"SD": "Sudan", "DO": "Dominican Republic", "DM": "Dominica", "DJ": "Djibouti", "DK":
"Denmark","VG":"BritishVirginIslands","DE":"Germany","YE":"Yemen","DZ":"Algeria","US":"Unite
d States", "UY": "Uruguay", "YT": "Mayotte", "UM": "United States Minor Outlying
Islands","LB":"Lebanon","LC":"SaintLucia","LA":"Laos","TV":"Tuvalu","TW":"Taiwan","TT":
"TrinidadandTobago","TR":"Turkey","LK":"SriLanka","LI":"Liechtenstein","LV":"Latvia",
"TO":"Tonga","LT":"Lithuania","LU":"Luxembourg","LR":"Liberia","LS":"Lesotho","TH":
"Thailand","TF":"FrenchSouthernTerritories","TG":"Togo","TD":"Chad","TC":"TurksandCaicos
Islands", "LY": "Libya", "VA": "Vatican", "VC": "Saint Vincent and the Grenadines", "AE":"United
Arab Emirates", "AD": "Andorra", "AG": "Antigua and Barbuda", "AF": "Afghanistan",
"AI":"Anguilla","VI": "U.S. Virgin Islands", "IS": "Iceland","IR": "Iran", "AM": "Armenia", "AL":
"Albania", "AO": "Angola", "AQ": "Antarctica", "AS": "American Samoa", "AR": "Argentina", "AU":
"Australia","AT":"Austria","AW":"Aruba","IN":"India","AX":"AlandIslands","AZ":
"Azerbaijan", "IE": "Ireland", "ID": "Indonesia", "UA": "Ukraine", "QA": "Qatar", "MZ":
"Mozambique"}

27/5
FUNDAMENTALSOFDATASCIENCELAB

Experiment 9 Read the following file formats

a. Pickle files b. Image files using PILc. Multiple files using Glob d. Importing data from database

9(a) Reading pickle files


What is pickling? Pickle is used for serializing and de-serializing Python object structures, also calledmarshalling or
flattening. Serialization refers to the process of converting an object in memory to a byte streamthat can be stored on disk
or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object.
Pickling is useful for applications where you need some degree of persistency in your data. Your program'sstate data can
be saved to disk, so you can continue working on it later on. It can also be used to send dataover a Transmission Control
Protocol (TCP) or socket connection, or to store python objects in a database.Pickle is very useful for when you're
working with machine learning algorithms, where you want to save them tobe able to make new predictions at a later time,
without having to rewrite everything or train the model all overagain.
If you want to use data across different programming languages, pickle is not recommended. Its protocol isspecific to
Python, thus, cross-language compatibility is not guaranteed. The same holds for different
versionsofPythonitself.UnpicklingafilethatwaspickledinadifferentversionofPythonmaynotalwaysworkproperly.
Whatcanbepickled?Youcanpickleobjectswiththefollowingdatatypes:
Booleans,Integers,Floats,Complexnumbers,(normalandUnicode)Strings,Tuples,Lists,Sets,andDictionaries that
ontain picklable objects.
Pickle vs JSON JSON stands for JavaScript Object Notation. It's a lightweight format for data-interchange, thatis easily
readable by humans. Although it was derived from JavaScript, JSON is standardized and language-independent. This is a
serious advantage over pickle. It's also more secure and much faster than pickle.
However,ifyou only need to usePython,the nthepicklemoduleisstillagoodchoiceforitseaseofuseandability to reconstruct
complete Python objects.
An alternative is c Pickle.It isnearly identicaltopickle, butwritten inC, whichmakesit upto 1000times
faster.Forsmallfiles,however,youwon'tnoticethedifferenceinspeed.Bothproducethesamedatastreams,whichmeans that
Pickle and cPickle can use the same files.

#!/usr/bin/env python3

# How to use the Python pickle module to store arbitrary python data as a file.
# Cleaned up from the demo at: https://pythontips.com/2013/08/02/what-is-pickle-
in-python/

importpickle

# Create something to store


a=['testvalue1','testvalue 2','testvalue3']
# And something to compare against later
b=[]

# Dump a into a pickle file as bytes


with open("testPickleFile", 'wb')
as f:pickle.dump(a,f)

# Load from the previous pickle file as bytes


with open("testPickleFile", 'rb')
as f:b=pickle.load(f)

# Now we can compare the original data vs the loaded pickle data
print(b) # ['test value','test value 2','test value 3']
print(a==b)#true
# And it's equivalent!
# That's storing and loading data via pickle!

['test value1', 'test value 2', 'test value 3']True

9(b)Reading Image files using PIL


28/5
FUNDAMENTALSOFDATASCIENCELAB

The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in the computervision domain
and solving problems like object detection, object classification, generating new images and whatnot!

But before you jump on to working with these problems, you need to know how to open your images in Python.Let’ssee
howwe cando thatby retrievingimages fromthe webpagethat westored inour localfolder.

YouwillneedthePythonPIL(PythonImageLibrary)forthisjob.

Simply call the open() function in the Image module of PIL and pass in the path to your image:

In [64]:

fromPILimportImage
Image.open('c:/experiment9/college_pic.jpeg')

Out[64]:

9(c)Reading multiple image files using Glob

Andnow,whatifyouwanttoreadmultiplefilesinonego?That’squiteacommonchallengeindatascienceprojects.

Python’s Glob module lets you traverse through multiple files in the same location. Using glob.glob(), we canimport all
the files from our local folder that match a special pattern.

These filename patterns can be made using different wildcards like “*” (for matching multiple characters),
“?”(formatchingany singlecharacter),or ‘[0-9]’(formatching anynumber).Let’s seeglobin actionbelow.

29/5
FUNDAMENTALSOFDATASCIENCELAB

When importing multiple .py files from the same directory as your Python script, we can use the “*” wildcard:

In [65]:

importcv2
importglob
importmatplotlib.pyplotasplt
for img in
glob.glob('c:/experiment9/glob_pics/*.*'):cv_
img=cv2.imread(img)
plt.imshow(cv_im
g)plt.show()

30/5
FUNDAMENTALSOFDATASCIENCELAB

9(d) Importing data from database

When you are working on a real-world project, you would need to connect your program to a database toretrieve data.
There is no way around it (that’s why learning SQL is an important part of your data sciencejourney). Data in databases is
stored in the form of tables and these systems are known as Relational databasemanagement systems (RDBMS). However,
connecting to RDBMS and retrieving the data from it can prove tobe quite a challenging task. Here’s the good news – we
can easily do this using Python’s built-in modules! Oneof the most popular RDBMS is SQLite. It has many plus points:

1. Lightweight database and hence it is easy to use in embedded software


2. 35% faster reading and writing compared to the File System.
3. No intermediary server required. Reading and writing are done directly from the database files on the disk
4. Cross-platform database file format. This means a file written on one machine can be copied to and usedona
different machine with a different architecture.

Therearemanymorereasonsforitspopularity.Butfornow,let’sconnectwithanSQLitedatabaseandretrieveour data!

Youwillneedtoimportthesqlite3moduletouseSQLite.Then,youneedtoworkthroughthefollowingstepstoaccess your data:

1. Createaconnectionwiththedatabaseconnect().Youneedtopassthenameofyourdatabasetoaccessit. It returns a
Connection object.
2. Once you have done that, you need to create a cursor object using the cursor() function. This will allow youto
implement SQL commands with which you can manipulate your data.
3. YoucanexecutethecommandsinSQLbycallingtheexecute()functiononthecursorobject.Sinceweareretrieving data from
the database, we will use the SELECT statement and store the query in an object.
4. Storethedatafromtheobjectintoadataframebyeithercallingfetchone(),foronerow,orfecthall(),forallthe rows, function on
the object.

And just like that, you have retrieved the data from the database into a Pandas dataframe!

A good practice is to save/commit your transactions using the commit() function even if you are only reading thedata.

31/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[832]:

In [66]:

importpandasaspd
importsqlite3

# open engine connection


con=sqlite3.connect('c:/experiment9/chinook.db')

# create a cursor object


cur=con.cursor()

# Perform query: rs
rs=cur.execute('select*fromplaylists')

# Save results of the query to DataFrame: df


df=pd.DataFrame(rs.fetchall())

# Close connection
con.commit()

# Print head of DataFrame df


df

Out[66]:

0 1

0 1 Music

1 2 Movies

2 3 TV Shows

3 4 Audiobooks

4 5 90’sMusic

32/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[833]:
5 6 Audiobooks

6 7 Movies

7 8 Music

8 9 MusicVideos

9 10 TV Shows

10 11 Brazilian Music

11 12 Classical

12 13 Classical 101 - Deep Cuts

13 14 Classical 101 - Next Steps

14 15 Classical 101 - The Basics

15 16 Grunge

16 17 Heavy Metal Classic

17 18 On-The-Go 1

33/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[834]:

Experiment 10. Demonstrate web scraping using python

WebScraping refers to extracting largeamounts of data from theweb. This is important fora data scientist who

has to analyze large amounts of data. Python provides a very handy module called requests to retrieve datafrom any
website. The requests.get() function takes in a URL as its parameter and returns the HTML responseas its output. The
way it works is summarized in the following steps:

1. It packages the Get request to retrieve data from webpage


2. Sends the request to the server
3. Receives the HTML response and stores in a response object

For this example, I want to show you a bit about my city – Delhi. So, I will retrieve data from the Wikipedia pageon Delhi:

In [67]:

importrequests

# url =
"https://weather.com/en-IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27f
url="https://en.wikipedia.org/wiki/Delhi"

# response object
resp=requests.get(url)

# using text attribute of the response object, return the HTML of webpage as
string
text =

resp.textprint

(text)
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Delhi - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!
1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":
["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","Februar
y","March","April","May","June","July","August","September","October","November","December"]
,"wgRequestId":"8a49cc42-95b2-4c25-a1a1-8805d16417bd","wgCSPNonce":!
1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!
1,"wgNamespaceNumber":0,"wgPageName":"Delhi","wgTitle":"Delhi","wgCurRevisionId":105347
4153,"wgRevisionId":1053474153,"wgArticleId":37756,"wgIsArticl
e":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroup
s":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","CS1errors: missing
periodical","CS1 maint: archived copy as title","Webarchive template wayback links","CS1 maint:
numeric names: authors list","Articles with short description","Short description is different from
Wikidata","Wikipediaindefinitelysemi-protectedpages","UseIndianEnglishfromOct

But as you can see, the data is not very readable. The tree-like structure of the HTML content retrieved by ourrequest is

34/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[835]:
not very comprehensible. To improve this readability, Python has another wonderful library calledBeautifulSoup.

BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting data from the
HTMLdocument.

Find more about BeautifulSoup in this

here.Right,let’sseethewonderofBeautifulSoup.

Tomakeitwork,weneedtopassthetextresponsefromtherequestobjecttoBeautifulSoup()whichcreatesitsown object – “soup” in


this case. Calling prettify() on BeautifulSoup object parses the tree-like structure of theHTML document:

In [68]:

importrequests
frombs4importBeautifulSoup

# url
# url =
"https://weather.com/en-IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27f
url="https://en.wikipedia.org/wiki/Delhi"

# Package the request, send the request and catch the response: r
r=requests.get(url)

# Extracts the response as html: html_doc


html_doc=r.text

# Create a BeautifulSoup object from the HTML: soup


soup=BeautifulSoup(html_doc)

# Print the response


print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
Delhi - Wikipedia
</title>
<script>
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!
1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":
["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","Februar
y","March","April","May","June","July","August","September","October","November","December"]
,"wgRequestId":"8a49cc42-95b2-4c25-a1a1-8805d16417bd","wgCSPNonce":!
1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!
1,"wgNamespaceNumber":0,"wgPageName":"Delhi","wgTitle":"Delhi","wgCurRevisionId":105347
4153,"wgRevisionId":1053474153,"wgArticleId":37756,"wgIsArticl
e":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroup
s":["*"],"wgCategories":["Pages with non-numeric formatnum
arguments","CS1errors:missingperiodical","CS1maint:archivedcopyastitle","Webarchiv

35/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[836]:
Experiment(11)Performfollowingpreprocessingtechniquesonloanpredictiondataset

In simple words, pre-processing refers to the transformations applied to your data before feeding it to thealgorithm.

In python, scikit-learn library has a pre-built functionality under sklearn.preprocessing. There are many moreoptions
for pre-processing which we’ll explore.
# Importing pandas
importpandasaspd\
# Importing training data set
X_train=pd.read_csv('C:\loan_prediction-1\
X_train.csv')Y_train=pd.read_csv('C:\
loan_prediction-1\Y_train.csv')# Importing testing
data set
X_test=pd.read_csv('C:\loan_prediction-1\
X_test.csv')Y_test=pd.read_csv('C:\
loan_prediction-1\Y_test.csv')

In [70]:

print(X_train.head())

Loan_ID Gender Married Dependents Education Self_Employed\


0 LP001032 Male No 0Graduate No
1 LP001824 Male Yes 1Graduate No
2 LP002928 Male Yes 0Graduate No
3 LP001814 Male Yes 2Graduate No
4 LP002244 Male Yes 0Graduate No

ApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_Term\0 4950
0.0 125 360
1 2882 1843.0 123 480
2 3000 3416.0 56 180
3 9703 0.0 112 360
4 2333 2417.0 136 360

Credit_History Property_Area
0 1 Urban
1 1 Semiurban
2 1 Semiurban
3 1 Urban
4 1 Urban

11(a)FeatureScaling

Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.It is
performed on continuous variables. Lets plot the distribution of all the continuous variables inthe data set.
In [71]:

importmatplotlib.pyplotasplt
X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
.index.values].hist(figsize=[11,11])

c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\pandas\plotting\_matplotlib\
tools.py:331: MatplotlibDeprecationWarning:
The is_first_col function was deprecated in Matplotlib 3.4 and will be removed two minor releases later.
Use ax.get_subplotspec().is_first_col() instead.
36/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[837]:
if ax.is_first_col():Out[71]:

array([[<AxesSubplot:title={'center':'ApplicantIncome'}>,
<AxesSubplot:title={'center':'CoapplicantIncome'}>],
[<AxesSubplot:title={'center':'LoanAmount'}>,
<AxesSubplot:title={'center':'Loan_Amount_Term'}>],
[<AxesSubplot:title={'center':'Credit_History'}>, <AxesSubplot:>]],dtype=object)

In [72]:

# Importing MinMaxScaler and initializing itfrom


sklearn.preprocessing import
MinMaxScalermin_max=MinMaxScaler()
# Scaling down both train and test data set
X_train_minmax=min_max.fit_transform(X_train[['ApplicantIncome',
'CoapplicantIncome','LoanAmount','Loan_Amount_Term', 'Credit_History']])
X_test_minmax=min_max.fit_transform(X_test[['ApplicantIncome','CoapplicantIncome','LoanAmo

In [73]:

fromsklearn.metricsimportaccuracy_score
fromsklearn.neighborsimportKNeighborsClassifier 37/5
# Fitting k-NN on our scaled data
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[838]:
<ipython-input-73-19507733bb8b>:5: DataConversionWarning: A column-vector ywas passed when a
1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
knn.fit(X_train_minmax,Y_train)Out[73]:

0.75

11(b)FeatureStandardization

# Standardizing the train and test data

fromsklearn.preprocessingimportscale
fromsklearn.metricsimportaccuracy_score
X_train_scale=scale(X_train[['ApplicantIncome',
'CoapplicantIncome','LoanAmount','Loan_Amount_Te
rm','Credit_History']])
X_test_scale=scale(X_test[['ApplicantIncome','CoapplicantIncome',
'LoanAmount','Loan_Amount_Term','Credit_History']])
# Fitting logistic regression on our standardized data set
from sklearn.linear_model import
LogisticRegressionlog=LogisticRegression(penal
ty='l2',C=.01)
log.fit(X_train_scale,Y_train)
# Checking the model's accuracy
accuracy_score(Y_test,log.predict(X_test_scale))
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change
the shape of y to (n_samples, ), for example using ravel().
return

f(**kwargs)Out[74]:

0.75

11(c)LabelEncoding

In previous sections, we did the pre-processing for continuous numeric features. But, our data set has otherfeatures too
such as Gender, Married, Dependents, Self_Employed and Education. All these categoricalfeatures have string values.
For example, Gender has two levels either Male or Female. Lets feed the featuresin our logistic regression model.
In [75]:
# Importing LabelEncoder and initializing
itfrom sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
# Iterating over all the common columns in train and test
forcolinX_test.columns.values:
# Encoding only categorical variables
ifX_test[col].dtypes=='object':
# Using whole data to form an exhaustive list of levels
data=X_train[col].append(X_test[col
])le.fit(data.values)
X_train[col]=le.transform(X_train[col])X_test[col]=le.t
ransform(X_test[col])
X_train.head()

c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to
(n_samples, ), for example using ravel().
return

38/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[839]:
f(**kwargs)Out[76]:

0.7395833333333334

11(d)OneHotEncoding

In [77]:

# We are using scaled variable as we saw in previous


section that# scaling will effect the algo with l1 or l2
reguralizer
X_train_scale=scale(X_tra
in)X_test_scale=scale(X_t
est)
# Fitting a logistic regression model
log=LogisticRegression(penalty='l2',C=1)l
og.fit(X_train_scale,Y_train)
# Checking the model's accuracy
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change
the shape of y to (n_samples, ), for example using ravel().
return

f(**kwargs)Out[77]:

0.7395833333333334

from sklearn.preprocessing import


OneHotEncoderenc=OneHotEncoder(sparse=False)
X_train_1=X_trainX_t
est_1=X_test
columns=['Gender', 'Married', 'Dependents',
'Education','Self_Employed','Credit_History','Property_Area']
forcolincolumns:
# creating an exhaustive list of all possible categorical values
data=X_train[[col]].append(X_test[[col]])enc.fit(data)
# Fitting One Hot Encoding on train data
temp=enc.transform(X_train[[col]])
# Changing the encoded features into a data frame with new column names
temp=pd.DataFrame(temp,columns=[(col+"_"+str(i))foriindata[col]
.value_counts().index])
# Insideby sideconcatenationindexvalues shouldbesame# Setting the index values
similar to the X_train data frametemp=temp.set_index(X_train.index.values)
# adding the new One Hot Encoded varibales to the train data frame
X_train_1=pd.concat([X_train_1,temp],axis=1)
# fitting One Hot Encoding on test data
temp=enc.transform(X_test[[col]])
# changing it into data frame and adding column names
temp=pd.DataFrame(temp,columns=[(col+"_"+str(i))foriindata[col]
.value_counts().index])
# Setting the index for proper concatenation
temp=temp.set_index(X_test.index.values)
# adding the new One Hot Encoded varibales to test data frame
X_test_1=pd.concat([X_test_1,temp],axis=1)

In [79]:

# Standardizing the data


setX_train_scale=scale(X_tr
ain_1)X_test_scale=scale(X_
test_1) 39/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[840]:
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change
the shape of y to (n_samples, ), for example using ravel().
return

f(**kwargs)Out[79]:

0.75

Experiment 12. Perform following visualizations using matplotlib


importpandasaspd
importnumpyasnp
import matplotlib.pyplot as
pltplt.style.use('seaborn')

In [81]:

df_meal = pd.read_csv('C:\\experiment12\\
meal_info.csv')df_meal.head()

Out[81]:

meal_id category cuisine

0 1885 Beverages Thai

1 1993 Beverages Thai

2 2539 Beverages Thai

3 1248 Beverages Indian

4 2631 Beverages Indian

In [82]:

df_center = pd.read_csv('C:\\experiment12\\ 40/5


fulfilment_center_info.csv')df_center.head()
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[841]:
Out[82]:

center_id city_code region_code center_type op_area

0 11 679 56 TYPE_A 3.7

1 13 590 56 TYPE_B 6.7

2 124 590 56 TYPE_C 4.0

3 66 648 34 TYPE_A 4.1

4 94 632 34 TYPE_C 3.6

df_food = pd.read_csv('C:\\experiment12\\
train.csv')df_food.head()

Out[83]:
id week center_id meal_id checkout_price base_price emailer_for_promotion hom

01379560 1 55 1885 136.83 152.29 0

11466964 1 55 1993 136.83 135.83 0

21346989 1 55 2539 134.86 135.86 0

31338232 1 55 2139 339.50 437.53 0

41448490 1 55 2631 243.50 242.50 0

In [84]:

df =
pd.merge(df_food,df_center,on='center_id')d
f=pd.merge(df,df_meal,on='meal_id')

11(a)BarGraph

table =
pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)t
able

Out[85]:

num_orders

category

Beverages 40480525

Biryani 631848

Desert 1940754

Extras 3984979

Fish 871959

Other Snacks 4766293

Pasta 1637744

41/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In
[842]:
Pizza 7383720

Rice Bowl 20874063

Salad 10944336

Sandwich 17636782

Seafood 2715714

Soup 1039646

Starters 4649122
#bar graph
plt.bar(table.index,table['num_orders'])

#xticks
plt.xticks(rotation=70)

#x-axis labels
plt.xlabel('Fooditem')

#y-axis labels
plt.ylabel('Quantitysold')

#plot title
plt.title('Mostpopularfood')

#save plot

# plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\

matplotlib_plotting_6.png',dpi#display
plt.show()

42/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook

In [87]:

#dictionary for meals per food item


item_count={}

foriinrange(table.index.nunique()):
item_count[table.index[i]]=table.num_orders[i]/
df_meal[df_meal['category']==table.ind

#bar plot
plt.bar([xforxinitem_count.keys()],[xforxinitem_count.values()],color='orange')

#adjust xticks
plt.xticks(rotation=70)

#label x-axis
plt.xlabel('Fooditem')

#label y-axis
plt.ylabel('No.ofmeals')

#label the plot


plt.title('Mealsperfooditem')

#save plot
# plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\
matplotlib_plotting_7.png',dpi

#display plot
plt.show();

43/5
FUNDAMENTALSOFDATASCIENCELAB

12(b)Pie Chart

In [88]:

importmatplotlib.pyplotasplt

# Data to plot
labels = 'Python', 'C++', 'Ruby',
'Java'sizes=[215, 130,245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral',
'lightskyblue']explode=(0.1, 0, 0, 0)#explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels,
colors=colors,autopct='%1.1f%
%',shadow=True,startangle=140)

plt.axis('equal
')plt.show()

In [ ]:

12(c) Box Plot

#dictionary for base price per cuisine


c_price={}
foriindf['cuisine'].unique():
c_price[i]=df[df['cuisine']==i].base_price

44/5
FUNDAMENTALSOFDATASCIENCELAB
In [90]:

#plotting boxplot
plt.boxplot([xforxinc_price.values()],labels=[xforxinc_price.keys()])

#x and y-axis
labelsplt.xlabel('C
uisine')plt.ylabel(
'Price')

#plot title
plt.title('Analysingcuisineprice')

#save and display


# plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\
matplotlib_plotting_9.png',dpi

12(d)Histogram

#plotting histogram
plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red
')

#x and y-axis labels


plt.xlabel('Base price
range')plt.ylabel('Distinct
order')

#plot title
plt.title('Inspectingpriceeffect')

#save and display the plot


#plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\
matplotlib_plotting_10.png',dpi

45/5
FUNDAMENTALSOFDATASCIENCELAB

12(e) Line Chart and Subplots

In [92]:

#new revenue column


df['revenue']=df.apply(lambdax:x.checkout_price*x.num_orders,axis=1)

#new month column


df['month']=df['week'].apply(lambdax:x//4)

#list to store month-wise revenue


month=[]
month_order=[]

for i in
range(max(df['month'])):mo
nth.append(i)
month_order.append(df[df['month']==i].revenue.sum())

#list to store week-wise revenue


week=[]
week_order=[]

for i in
range(max(df['week'])):we
ek.append(i)
week_order.append(df[df['week']==i].revenue.sum())
#subplots returns a Figure and an Axes object
fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))

#manipulating the first


Axesax[0].plot(week,week_
order)ax[0].set_xlabel('W
eek')
ax[0].set_ylabel('Revenue')
ax[0].set_title('Weeklyincome')

#manipulating the second


Axesax[1].plot(month,month
_order)ax[1].set_xlabel('M
onth') 46/5
FUNDAMENTALSOFDATASCIENCELAB
12(f)Scatter Plot
In [93]:

center_type_name=['TYPE_A','TYPE_B','TYPE_C']

#relation between op area and number of orders


op_table=pd.pivot_table(df,index='op_area',values='num_orders',aggfunc=np.sum)

#relation between center type and op area


c_type={}
foriincenter_type_name:
c_type[i]=df[df['center_type']==i].op_area

#relation between center type and num of orders


center_table=pd.pivot_table(df,index='center_type',values='num_orders',aggfunc=np.sum)

#subplots
fig,ax=plt.subplots(nrows=3,ncols=1,figsize=(8,12))

#scatter plots
ax[0].scatter(op_table.index,op_table['num_orders'],color='pink')ax[0].set_xlabel('Operationarea')
ax[0].set_ylabel('Numberoforders')
ax[0].set_title('Doesoperationareaaffectnumoforders?')
ax[0].annotate('optimumoperationareaof4km^2',xy=(4.2,1.1*10**7),xytext=(7,1.1*10**7),a

#boxplot
ax[1].boxplot([xforxinc_type.values()],labels=[xforxinc_type.keys()])ax[1].set_xlabel('Centertype')
ax[1].set_ylabel('Operationarea')
ax[1].set_title('Whichcentertypehadtheoptimumoperationarea?')

#bar graph
ax[2].bar(center_table.index,center_table['num_orders'],alpha=0.7,color='orange',width=0.5)ax[2].set_xlabel('Centertype')
ax[2].set_ylabel('Numberoforders')
ax[2].set_title('Orderspercentertype')

#show figure
plt.tight_layout()
#plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',dpi
plt.show();

47/5
FUNDAMENTALSOFDATASCIENCELAB

In [ ]:

48/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy