FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)
FUNDAMENTALS OF DATA SCIENCE LAB - Jupyter Notebook (1)
NumPy is one of the most fundamental libraries in Python and perhaps the most useful of them all. NumPy handles large
datasets effectively and efficiently. I can see your eyes glinting at the prospect of mastering NumPy already. As a data
scientist or as an aspiring data science professional, we need to have a solid graspon NumPy and how it works in Python.
NumPy stands for Numerical Python and is one of the most useful scientific libraries in Python programming. It provides
support for large multi dimensional array objects and various tools to work with them. Various other libraries like Pandas
,Matplotlib ,and Scikit-learn are built on top of this amazing library.
Arrays are a collection of elements/values, that can have one or more dimensions. An array of one dimension is called a
Vector while having two dimensions is called a Matrix. NumPy arrays are called ndarray or N-dimensional arrays and they
store elements of the same type and size. It is known for its high-performance and provides efficient storage and data
operations as arrays grow in size. NumPy comes pre-installed when you download Anaconda. But if you want to install
NumPy separately on your machine, just type the below command on your terminal: pip install numpy
np is the defact abbreviation for NumPy used by the data science community.
If you’re familiar with Python, you might be wondering why use NumPy arrays when we already have Python lists?
After all, these Python lists act as an array that can store elements of various types. This is a perfectly valid question
and the answer to this is hidden in the way Python stores an object in memory.
A Python object is actually a pointer to a memory location that stores all the details about the object, like bytes and the
value. Although this extra information is what makes Python a dynamically typed language, it also comes at a cost which
becomes apparent when storing a large collection of objects, like in an array.
Python lists are essentially an array of pointers, each pointing to a location that contains the information related to the
element. This adds a lot of overhead in terms of memory and computation. And most of this information is rendered when
all the objects stored in the list are of the same type!
To overcome this problem, we use NumPy arrays that contain only homogeneous elements, i.e. elements having the
same datatype.This makes it more efficient at storing and manipulating the array. This difference becomes apparent
when the array has a large number of elements, say thousands or millions. Also, with NumPy arrays, you can perform
element-wise operations, something which is not possible using Python lists!
This is the reason why NumPy arrays are preferred over Python lists when performing mathematical operations on a large
amount of data.
1/59
FUNDAMENTALSOFDATASCIENCELAB-
Experiment 1. Creating a NumPy Array
Out[1]:
array([1, 2, 3, 4])
This array contains integer values. You can specify the type of data in the data type argument:
In [2]:
np.array([1,2,3,4],dtype=np.float32)
Out[2]:
Since NumPy arrays can contain only homogeneous datatypes, values will be upcast if the types do not match
arrays can contain only homogeneous datatypes, values will be upcast if the types do not match:
In the following example NumPy has upcast integer values to float values.
In [3]:
np.array([1,2.0,3,4])
Out[3]:
NumPy arrays can be multi-dimensional too.A matrix is just a rectangular array of numbers with shape N x M where N
is the number of rows and M is the number of columns in the matrix.
np.array([[1,2,3,4],[5,6,7,8]])
Out[4]:
array([[1, 2, 3, 4],
[5, 6, 7, 8]
NumPy lets you create an array of all zeros using the np.zeros() method. All you have to do is pass the shape of the
desired array:
In [5]:
np.zeros(5)
2/59
FUNDAMENTALSOFDATASCIENCELAB-
Out[5]:
The one above is a 1-D array while the one below is a 2-D array of all zeros:
In [6]:
np.zeros((2,3))
Out[6]:
You could also create an array of all 1s using the np.ones() method:
In [7]:
np.ones(5,dtype=np.int32)
Out[7]:
array([1, 1, 1, 1, 1])
Another very commonly used method to create ndarrays is np.random.rand() method. It creates an array of a given
shape with random values from [0,1):
np.random.rand(2,3)
Out[8]:
in fact, you can create an array filled with any given value using the np.full() method. Just pass in the shape of the
desired array and the value you want:
In [9]:
np.full((2,2),7)
Out[9]:
array([[7, 7],
[7, 7]])
matrix is a square matrix that has 1s along its main diagonal and 0s everywhere else.
3/59
FUNDAMENTALSOFDATASCIENCELAB-
Note: A square matrix has an N x N shape. This means it has the same number of rows and columns.
Note: A matrix is called the Identity matrix only when the 1s are along the main diagonal and not any other diagonal
In [10]:
np.eye(3)
Out[10]:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
However ,NumPy gives you the flexibility to change the diagonal along which the values have to be 1s.You can either move
it above the main diagonal: not an identity matrix.
In [11]:
np.eye(3,k=1)
Out[11]:
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
In [12]:
Out[12]:
np.eye(3,k=-2)
array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.]])
You can quickly get an evenly spaced array of numbers using the np.arange() method:
In [13]:
np.arange(5)
Out[13]:
array([0, 1, 2, 3, 4])
The start, end and step size of the interval of values can be explicitly defined by passing in three numbers as arguments
for these values respectively. A point to be noted here is that the interval is defined as (start,end)where the last number
will not be included in the array:
Alternate elements were printed because the step-size was defined as 2. Notice that 10 was not printed as it was the last
element.
4/59
FUNDAMENTALSOFDATASCIENCELAB-
In [14]:
np.arrange(2,10,2)
Out[14]:
varray([2, 4, 6, 8])
Another similar function is np.linspace(), but instead of step size, it takes in the number of samples that need to
A point to note here is that the last number is included in the values returned unlike in the case of np.arange().
In [15]:
np.linspace(0,1,5)
Out[15]:
Once you have created your ndarray,then ext thing you would want to do is check the number of axes, shape, and the size
of the ndarray.
5/59
FUNDAMENTALSOFDATASCIENCELAB-
You can easily determine the number of dimensions or axes of a NumPy array using the ndims attribute:In the
In [16]:
a=np.array([[5,10,15],[20,25,20]])
print('Array:','\n',a)
print('Dimensions:','\n',a.ndim)
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :2
The shape is an attribute of the NumPy array that shows how many rows of elements are there along each dimension.
You can further index the shape so returned by then array to get value along each dimension:
In [17]:
a=np.array([[1,2,3],[4,5,6]])
print('Array:','\n',a)
print('Shape:','\n',a.shape)
print('Rows=',a.shape[0])
print('Columns=',a.shape[1])
Array :
[[1 2 3]
[4 5 6]]
Shape :
(2, 3)
Rows =2
Columns =3
In [18]:
a=np.array([[5,10,15],[20,25,20]])
print('Sizeofarray:',a.size)
print('Manualdeterminationofsizeofarray:',a.shape[0]*a.shape[1])
6/59
FUNDAMENTALSOFDATASCIENCELAB-
Size of array : 6
Manual determination of size of array: 6
It changes the shape of the ndarray without changing the data within the ndarray:In the
In [19]:
a=np.array([3,6,9,12])
np.reshape(a,(2,2))
Out[19]:
array([[ 3,6],
[ 9, 12]])
While reshaping, if you are unsure about the shape of any of the axis, just input -1. NumPy automatically calculates
the shape when it sees a -1:
In [20]:
a=np.array([3,6,9,12,18,24])
print('Threerows:','\n',np.reshape(a,(3,-1)))
print('Threecolumns:','\n',np.reshape(a,(-1,3)))
Three rows :
[[ 36]
[ 9 12]
[18 24]]
Three columns :
[[3 6 9]
[12 18 24]]
Sometimes when you have a multidimensional array and want to collapse it to a single-dimensional array,you can either
use the flatten() method or the ravel() method:
But an important difference between flatten() and ravel() is that the former returns a copy of the original array while the
latter returns a reference to the original array. This means any changes made to the array returned from ravel() will also
be reflected in the original array while this will not be the case with flatten().
In [21]:
a =
np.ones((2,2))b= 7/59
FUNDAMENTALSOFDATASCIENCELAB-
Original shape : (2, 2)Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)Array :
[1. 1. 1. 1.]
f.Transposeof aNumPyarray
Another very interesting reshaping method of NumPy is the transpose() method. It takes the input array andswaps the
rows with the column values, and the column values with the values of the rows:
In [22]:
a=np.array([[1,2,3],
[4,5,6]])
b=np.transpose(a)
print('Original','\n','Shape',a.shape,'\n',a)
print('Expandalongcolumns:','\n','Shape',b.shape,'\n',b)
Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expandalongcolumns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]
You can add a new axis to an array using the expand_dims() method by providing the array and the axis along which to
expand:
In [23]:
a=np.array([1,2,3])
b =
np.expand_dims(a,axis=0)c
=np.expand_dims(a,axis=1)
print('Original:','\n','Shape',a.shape,'\n',a)
print('Expandalongcolumns:','\n','Shape',b.shape,'\n',b)
print('Expandalongrows:','\n','Shape',c.shape,'\n',c)
8/59
FUNDAMENTALSOFDATASCIENCELAB-
Original:
Shape (3,)
[1 2 3]
Expandalongcolumns:
Shape (1, 3)
[[1 2 3]]
Expandalongrows:
Shape (3, 1)
[[1]
[2]
[3]]
On the other hand,if you instead want to reduce the axis of the array,use the squeeze() method. It removes the axis that has a
single entry. This means if you have created a 2 x 2 x 1 matrix, squeeze() will remove the third dimension from the matrix:
In [24]:
a=np.array([[[1,2,3],
[4,5,6]]])
b=np.squeeze(a,axis=0)
print('Original','\n','Shape',a.shape,'\n',a)
print('Squeezearray:','\n','Shape',b.shape,'\n',b)
Original
Shape (1, 2, 3)
[[[1 2 3]
[4 5 6]]]
Squeeze array:
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Sorting is an important and very basic operation that you might well use on a daily basis as a data scientist. So,it is
important to use a good sorting algorithm with minimum time complexity.
The NumPy library is alegend when it comes to sortingelementsofanarray.Ithasarangeofsortingfunctionsthat you can use to
sort your array elements. It has implemented quicksort, heapsort, mergesort, and timesortfor you under the hood when you
use the sort() method:
In [25]:
a=np.array([[5,6,7,4],
[9,2,3,7]])#sortalongthecolumn
print('Sortalongcolumn:','\n',np.sort(a,kind='mergresort',axis=1))
9/59
FUNDAMENTALSOFDATASCIENCELAB-
In [26]:
10/5
FUNDAMENTALSOFDATASCIENCELAB-
Slicing means retrieving elements from one index to another index. All we have to do is to pass the starting andending
point in the index like this: [start: end].
In [27]:
a =
np.array([1,2,3,4,5,6])p
rint(a[1:5:2])
[2 4]
Notice that the last element did not get considered. This is because slicing includes the start index but excludesthe end
index.A way around this is to write the next higher index to the final index value you want to retrieve:
In [28]:
a =
np.array([1,2,3,4,5,6])p
rint(a[1:6:2])
[2 4 6]
Ifyoudon’tspecifythestartorendindex,itistakenas0orarraysize,respectively,asdefault.Andthestep-size by default is 1.
In [29]:
a =
np.array([1,2,3,4,5,6])p
rint(a[:6:2])
print(a[1::2])
print(a[1:6:])
[1 3 5]
[2 4 6]
[2 3 4 5 6]
11/5
FUNDAMENTALSOFDATASCIENCELAB-
In [30]:
a=np.array([[1,2,3],
[4,5,6]])
print(a[0,0])
print(a[1,2])
print(a[1,0])
1
6
4
Here, we provided the row value and column value to identify the element we wanted to extract. While in a 1-Darray,we
were only providing the column value since there was only 1row.
So,to slice a 2-Darray, you need to mention the slices for both,the row and the column:
In [31]:
a=np.array([[1,2,3],[4,5,6]])
# print first row values
print('Firstrowvalues:','\n',a[0:1,:])
# with step-size for columns
print('Alternatevaluesfromfirstrow:','\n',a[0:1,::2])
#
print('Secondcolumnvalues:','\n',a[:,1::2])
print('Arbitraryvalues:','\n',a[0:1,1:3])
In [32]:
a=np.array([[[1,2],[3,4],[5,6]],#firstaxisarray
[[7,8],[9,10],[11,12]],#secondaxisarray
[[13,14],[15,16],[17,18]]])# third axis
array# 3-D array
print(a)
12/5
FUNDAMENTALSOFDATASCIENCELAB-
[[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]
[[13 14]
[15 16]
[17 18]]]
In [33]:
# value
print('First array, first row, first column value :','\
n',a[0,0,0])print('Firstarray lastcolumn :','\n',a[0,:,1])
print('Firsttworowsforsecondandthirdarrays:','\n',a[1:,0:2,0:2])
[[13 14]
[15 16]]]
print('Printingasasinglearray:','\n',a[1:,0:2,0:2].flatten())
An interesting way to slice your array is to use negative slicing. Negative slicing prints elements from the endrather
than the beginning. Have a look below:
In [35]:
a=np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])
[ 5 10]
In [36]:
print(a[:,-1:-3:-1])
[[ 54]
[109]]
13/5
FUNDAMENTALSOFDATASCIENCELAB-
Having said that, the basic logic of slicing remains the same, i.e. the end index is never included in the
output.Aninteresting use of negativeslicing is to reverse theoriginal array.
In [37]:
a=np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Originalarray:','\n',a)
print('Reversedarray:','\n',a[::-1,::-1])
Original array :
[[ 12 345]
[ 67 89 10]]
Reversed array :
[[109 876]
[ 54 321]]
a=np.array([[1,2,3,4,5],[6,7,8,9,10]])
print('Originalarray:','\n',a)
print('Reversed array vertically :','\
n',np.flip(a,axis=1))print('Reversedarrayhorizontally:','\
n',np.flip(a,axis=0))
Original array :
[[ 12345]
[ 6789 10]]
Reversed array vertically :[[ 54321]
[109876]]
Reversed array horizontally :[[ 6789
10]
[ 12345]]
Youcancreateanewarraybycombiningexistingarrays.Thisyoucandointwoways:
•Either combine the arrays vertically (i.e. along the rows) using the vstack() method, thereby increasing thenumber of
rows in the resulting array
14/5
FUNDAMENTALSOFDATASCIENCELAB-
•Or combine the arrays in a horizontal fashion (i.e. along the columns) using the hstack(), thereby increasingthe
number of columns in the resultant array
In [39]:
a=np.arange(0,5)b
=np.arange(5,10)
print('Array1:','\n',a)
print('Array2:','\n',b)
print('Vertical stacking :','\
n',np.vstack((a,b)))print('Horizontalstacking:'
,'\n',np.hstack((a,b)))
Array 1 :
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :[[0 1 2
3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0123456789]
Another interesting way to combine arrays is using the dstack() method. It combines array elements index byindex and
stacks them along the depth axis:
In [40]:
a=[[1,2],[3,4]]
b=[[5,6],[7,8]]
c=np.dstack((a,b))
print('Array1:','\n',a)
print('Array2:','\n',b)
print('Dstack :','\
n',c)print(c.shape)
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
15/5
FUNDAMENTALSOFDATASCIENCELAB-
Dstack :
[[[15]
[2 6]]
[[3 7]
[4 8]]]
(2, 2, 2)
concatenate() method where the passed arrays are joined along an existing axis:
In [41]:
a=np.arange(0,5).reshape(1,5)
b =
np.arange(5,10).reshape(1,5)p
rint('Array1:','\n',a)
print('Array2:','\n',b)
print('Concatenatealongrows:','\n',np.concatenate((a,b),axis=0))
print('Concatenatealongcolumns:','\n',np.concatenate((a,b),axis=1))
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :[[0 1 2
3 4]
[5 6 7 8 9]]
Concatenate along columns :[[0 1 2
3 4 5 6 7 8 9]]
The drawback of this method is that the original array must have the axis along which you want to
combine.Otherwise,get ready to be greeted by an error.
In [42]:
Out[42]:
# append values to ndarray
a=np.array([[1,2],
array([[1, 2],
[3, 4],[3,4]])
np.append(a,[[5,6]],axis=0)
[5, 6]])
Broadcasting is one of the best features of ndarrays. It lets you perform arithmetics operations betweenndarraysof
different sizes or between an ndarray and a simple number!
Broadcasting essentially stretches the smaller ndarray so that it matches the shape of the larger ndarray:
16/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[517]:
In [43]:
a=np.arange(10,20,2)b
=np.array([[2],[2]])
print('Adding two different size arrays :','\
n',a+b)print('Multiplyinganndarrayandanumber:',
a*2)
2. Either of the ndarrays has a dimension of 1. The one having a dimension of 1 is broadcast to meet the
sizerequirements of the larger ndarray
Incasethearraysarenotcompatible,youwillgetaValueError.
In [44]:
Out[44]:
a=np.ones((3,3))
b =
array([[3., 3., 3.],
np.array([2])a+
[3., 3., 3.],
b
[3., 3., 3.]])
17/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[518]:
Experiment 6. Perform following operations using pandas
Pandas is one of the most popular and powerful data science libraries in Python. It can be considered as thestepping stone
for any aspiring data scientist who prefers to code in Python. Even though the library is easy toget started, it can certainly
do a wide variety of data manipulation. This makes Pandas one of the handiest datascience libraries in the developer’s
community. Pandas basically allow the manipulation of large datasets and data frames. It can also be considered as one of
the most efficient statistical tools for mathematical computations of tabular data.
Today. we’ll cover some of the most important and recurring operations that we perform in Pandas. Make no mistake,
there are tons of implementations and prospects of Pandas. Here we’ll try to cover some not`able aspects only.We’ll use
the analogy of Euro Cup 2020 in this tutorial.We’ll start off by creating our own minimal dataset.
importpandasaspd
# Cre`ate team data
data_england={'Name':['Kane','Sterling','Saka','Maguire'],'Age':[27,26,19,28]}
data_italy={'Name':['Immobile','Insigne','Chiellini','Chiesa'],'Age':[31,30,36,2
# Create Dataframe
df_england =
pd.DataFrame(data_england)df_italy=
pd.DataFrame(data_italy)
print(df_englan
d)print(df_ital
NameAge
0 Kane 27
1 Sterling 26
2 Saka 19
3 Maguire 28
NameAge
0 Immobile 31
1 Insigne 30
2 Chiellini 36
3 Chiesa 23
6(b) concat()
Let’s start by concatenating our two data frames. The word “concatenate” means to “link together in series”.Nowthat
we have created two data frames, let’s try and “concat” them.
We do this by implementing the concat() function.
frames = [df_england,
df_italy]both_teams =
pd.concat(frames)both_teams
Out[46]:
NameAge
0 Kane 27
1 Sterling 26
2 Saka 19
3 Maguire 28
0 Immobile 31
18/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[519]:
1 Insigne 30
2 Chiellini 36
3 Chiesa 23
df_england.append(df_italy)
Out[47]:
NameAge
0 Kane 27
1 Sterling 26
2 Saka 19
3 Maguire 28
0 Immobile 31
1 Insigne 30
2 Chiellini 36
3 Chiesa 23
Now,imagine you wanted to label your original data frames with the associated countries of these players. You can do this
by setting specific keys to your data frames.
pd.concat(frames,keys=["England","Italy"])
Out[48]:
Name Age
England0 Kane 27
1 Sterling 26
2 Saka 19
3 Maguire 28
Italy0 Immobile 31
1 Insigne 30
2 Chiellini 36
3 Chiesa 23
both_teams[both_teams["Age"]>=30]
19/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[520]:
Out[49]:
Name Age
0Immobile 31
1 Insigne 30
2 Chiellini 36
Now,let’s try to do some string filtration. We want to filter those players whose name starts with “S”.This implementation can
be done by pandas’ startswith() function. Let’s try:
both_teams[both_teams["Name"].str.startswith('S')]
Out[50]:
NameAge
1 Sterling 26
2 Saka 19
Out[51]:
Name Age Associated Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
In [52]:
frames = [df_england,
df_italy]both_teams =
pd.concat(frames)both_teams
Out[52]:
0 Kane 27 Tottenham
2 Saka 19 Arsenal
0 Immobile 31 NaN
20/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[521]:
1 Insigne 30 NaN
2 Chiellini 36 NaN
3 Chiesa 23 NaN
Out[53]:
NameAge Associated Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
both_teams.sort_values('Name')
Out[54]:
Name Age Associated Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
both_teams.sort_values('Age')
21/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[522]:
Out[55]:
Name Age Associated Clubs
2 Saka 19 Arsenal
0 Kane 27 Tottenham
both_teams.sort_values('Age',ascending=False)
Out[56]:
Name Age Associated Clubs
0 Kane 27 Tottenham
2 Saka 19 Arsenal
7(c) groupby()
Grouping is arguably the most important feature of Pandas. A groupby() function simply groups a
particularcolumn.Let’s see a simple example by creating a new data frame.
In [57]:
a={
'UserID': ['U1001', 'U1002', 'U1001', 'U1001',
'U1003'],'Transaction':[500, 300, 200,300, 700]
}
df_a =
pd.DataFrame(a)df_a
df_a.groupby('UserID').sum()
Out[57]:
Transaction
UserID
U1001 1000
U1002 300
U1003 700
Notice,we have two columns–UserID and Transaction. You can also seearepeating UserID(U1001).Let’sapply a
groupby() function to it.
22/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[523]:
The function grouped the similar UserIDs and took the sum of those IDs.
If you want to unravel a particular UserID, just try mentioning the value name through get_group().
df_a.groupby('UserID').get_group('U1001')
Out[58]:
UserIDTransaction0
U1001 500
2 U1001 200
3 U1001 300
23/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[524]:
a.Textfilesb.CSVfilesc.Excelfilesd.JSONfiles
I have recently come across a lot of aspiring data scientists wondering why it’s so difficult to import different file formats
in Python. Most of you might be familiar with the read_csv() function in Pandas but things get trickyfrom there.
How to read a JSON file in Python? How about an image file? How about multiple files all at once? These are questions
you should know the answer to – but might find itdifficult to graspinitially.
And mastering these file formats is critical to your success in the data science industry.You’ll be working with all sorts of file
formats collected from multiple data sources – that’s thereality of the modern digitalage we live in.
8(a)ReadingTextfiles
In [ ]:
importpandasaspd
txtdata =
pd.read_table('c:/experiment8/employee.txt')txtda
ta
Out[59]:
123234877,Michael,Rogers,140
152934485,Anand,Manikutty,14
1 222364883,Carol,Smith,37
2 326587417,Joe,Stevens,37
3 332154719,Mary-Anne,Foster,14
4 332569843,George,ODonnell,77
5 546523478,John,Doe,59
6 631231482,David,Smith,77
7 654873219,Zacary,Efron,59
8 745685214,Eric,Goldsmith,59
9 845657245,Elizabeth,Doe,14
10 845657246,Kumar,Swamy,14
8(b) Reading CSV files
Ah, the good old CSV format. A CSV (or Comma Separated Value) file is the most common type of file that adata
scientist will ever work with. These files use a “,” as a delimiter to separate the values and each row in aCSV file is a
data record.
These are useful to transfer data from one application to another and is probably the reason why they are so common
place in the world of data science.
If you look at them in the Notepad, you will notice that the values are separated by commas:
24/5
11/12/21,6:32A FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
M
In
[60]:
importpandasaspd
csvdata =
pd.read_csv('C:/experiment8/products.csv')csvda
ta
Out[60]:
productCodeproductName productLine productScale productVendor productDescription
1969
This replica
HarleyDa
0 S10_1678 Motorcycles 01:10 MinLinDiecast featuresworkingkick
vidson
stand,
UltimateC
front...
hopper
Classic Metal
1952Alpine Turnablefront
1 S10_1949 Classic Cars 01:10 wheels;
Creations
Renault 1300 steeringfunction;
deta...
1996 Moto
Motorcycles 01:10 Highway
2 S10_2016 Official Moto
Guzzi1100i 66MiniClassic
Guzzilogosandinsigni
s
as,
2003 Harley- saddl...
Model features,
3 S10_4698 Davidson Motorcycles Red
01:10 Start official
EagleDrag Diecas
t HarleyDavidsonlo
Bike
gos...
1972Alfa
Motor City Art Features include:
4 S10_4757 Classic Cars 01:10 Turnablefront
wheels;steer...
Classics
RomeoGTA
... ... ... ... ... ... ...
Carouse Completed
105 S700_3505 TheTitanic Ships 0.52777777 lDieCa modelmeasure
8 st s191/2
Legends inches long, 9...
Exactreplica.Woodan
The Queen Welly
106 S700_396 Ships 0.52777777 dMetal.Many
Mary DiecastProdu
2 8 extras inc...
ctions
American Polished
107 S700_400 Airlines: MD- Planes 0.52777777 finish.Exactreplia
Second Gear
2 11S 8 with
Diecast
officiallo...
Boeing X-32A
108 S72_1253 Planes 10" Wingspan
0.091666667JSF Motor City Art
withretractablelandi
Classics
ng
gears.Co...
UnimaxArt
Measures 38 inches
109 S72_3212 PontYacht Ships 0.091666667 Long x 33 3/4 inches
Galleries
High. ...
25/5
FUNDAMENTALSOFDATASCIENCELAB
110rows×9columns
importpandasaspd
exceldata =
pd.read_excel('c:/experiment8/EmployeeSalaries.xlsx')exceld
ata
Out[61]:
EMPNO ENAME JOB SAL DOJ
JSON (JavaScript Object Notation) files are lightweight and human-readable to store and exchange data. It
iseasyformachinesto parseandgenerate thesefilesand arebasedonthe JavaScriptprogramminglanguage.
JSON files store data within {} similar to how a dictionary stores it in Python. But their major benefit is that theyare
language-independent, meaning they can be used with any programming language – be it Python, C oreven Java!
importpandasaspd
jsondata =
pd.read_table('c:/experiment8/names.json')jsonda
ta
In [62]:
Out[62]:
{"BD": "Bangladesh", "BE": "Belgium", "BF": "Burkina Faso", "BG": "Bulgaria", "BA": "Bosnia
andHerzegovina", "BB": "Barbados", "WF": "Wallis and Futuna", "BL": "Saint Barthelemy",
"BM":"Bermuda","BN":"Brunei","BO":"Bolivia","BH":"Bahrain","BI":"Burundi","BJ":"Benin","BT":
"Bhutan", "JM": "Jamaica", "BV": "Bouvet Island", "BW": "Botswana", "WS": "Samoa", "BQ":
"Bonaire, Saint Eustatius and Saba ", "BR": "Brazil", "BS": "Bahamas", "JE": "Jersey", "BY":
"Belarus","BZ":"Belize","RU": "Russia","RW":"Rwanda","RS": "Serbia","TL":"East Timor",
"RE":"Reunion","TM":"Turkmenistan","TJ":"Tajikistan","RO":"Romania","TK":"Tokelau","G
W": "Guinea-Bissau", "GU": "Guam", "GT": "Guatemala", "GS": "South Georgia and the
SouthSandwich Islands", "GR": "Greece", "GQ": "Equatorial Guinea", "GP": "Guadeloupe",
"JP":"Japan", "GY": "Guyana", "GG": "Guernsey", "GF": "French Guiana", "GE": "Georgia",
"GD":
"Grenada", "GB": "United Kingdom", "GA": "Gabon", "SV": "El Salvador", "GN": "Guinea", "GM":
"Gambia","GL":"Greenland","GI":"Gibraltar","GH":"Ghana","OM":"Oman","TN":"Tunisia",
"JO": "Jordan", "HR": "Croatia", "HT": "Haiti", "HU": "Hungary", "HK": "Hong Kong",
"HN":"Honduras","HM":"HeardIslandandMcDonaldIslands","VE":"Venezuela","PR":"PuertoRico","P
S": "Palestinian Territory", "PW": "Palau", "PT": "Portugal", "SJ": "Svalbard and Jan
Mayen","PY":"Paraguay","IQ":"Iraq", "PA":"Panama","PF":"French Polynesia","PG":"PapuaNew
Guinea", "PE": "Peru", "PK": "Pakistan", "PH": "Philippines", "PN": "Pitcairn", "PL":
"Poland","PM": "Saint Pierre and Miquelon", "ZM": "Zambia", "EH": "Western Sahara", "EE":
"Estonia","EG":"Egypt", "ZA": "South Africa","EC": "Ecuador", "IT": "Italy","VN": "Vietnam",
"SB":
"Solomon Islands", "ET": "Ethiopia", "SO": "Somalia", "ZW": "Zimbabwe", "SA": "Saudi Arabia",
"ES": "Spain", "ER": "Eritrea", "ME": "Montenegro", "MD": "Moldova", "MG": "Madagascar", "MF":
26/5
FUNDAMENTALSOFDATASCIENCELAB
"Saint Martin", "MA": "Morocco", "MC": "Monaco", "UZ": "Uzbekistan", "MM": "Myanmar", "ML":
"Mali", "MO": "Macao", "MN": "Mongolia", "MH": "Marshall Islands", "MK": "Macedonia", "MU":
"Mauritius", "MT": "Malta", "MW": "Malawi", "MV": "Maldives", "MQ": "Martinique",
"MP":"Northern Mariana Islands", "MS": "Montserrat", "MR": "Mauritania", "IM": "Isle of Man",
"UG":"Uganda","TZ":"Tanzania","MY":"Malaysia","MX":"Mexico","IL":"Israel","FR":"France","IO
":"British Indian Ocean Territory", "SH": "Saint Helena", "FI": "Finland", "FJ": "Fiji", "FK":
"FalklandIslands","FM":"Micronesia","FO":"FaroeIslands","NI":"Nicaragua","NL":"Netherlands","NO
":
"Norway","NA":"Namibia","VU":"Vanuatu","NC":"NewCaledonia","NE":"Niger","NF":
"Norfolk Island", "NG": "Nigeria", "NZ": "New Zealand", "NP": "Nepal", "NR": "Nauru", "NU":
"Niue", "CK": "Cook Islands", "XK": "Kosovo", "CI": "Ivory Coast", "CH": "Switzerland", "CO":
"Colombia", "CN": "China", "CM": "Cameroon", "CL": "Chile", "CC": "Cocos Islands",
"CA":"Canada","CG":"RepublicoftheCongo","CF":"CentralAfricanRepublic","CD":"Democratic
Republic of the Congo", "CZ": "Czech Republic", "CY": "Cyprus", "CX": "Christmas Island",
"CR":"CostaRica","CW":"Curacao","CV":"CapeVerde","CU":"Cuba","SZ":"Swaziland","SY":
"Syria", "SX": "Sint Maarten", "KG": "Kyrgyzstan", "KE": "Kenya", "SS": "South Sudan", "SR":
"Suriname", "KI": "Kiribati", "KH": "Cambodia", "KN": "Saint Kitts and Nevis", "KM": "Comoros",
"ST":"SaoTomeandPrincipe","SK":"Slovakia","KR":"SouthKorea","SI":"Slovenia","KP":
"North Korea", "KW": "Kuwait", "SN": "Senegal", "SM": "San Marino", "SL": "Sierra Leone", "SC":
"Seychelles", "KZ": "Kazakhstan", "KY": "Cayman Islands", "SG": "Singapore", "SE": "Sweden",
"SD": "Sudan", "DO": "Dominican Republic", "DM": "Dominica", "DJ": "Djibouti", "DK":
"Denmark","VG":"BritishVirginIslands","DE":"Germany","YE":"Yemen","DZ":"Algeria","US":"Unite
d States", "UY": "Uruguay", "YT": "Mayotte", "UM": "United States Minor Outlying
Islands","LB":"Lebanon","LC":"SaintLucia","LA":"Laos","TV":"Tuvalu","TW":"Taiwan","TT":
"TrinidadandTobago","TR":"Turkey","LK":"SriLanka","LI":"Liechtenstein","LV":"Latvia",
"TO":"Tonga","LT":"Lithuania","LU":"Luxembourg","LR":"Liberia","LS":"Lesotho","TH":
"Thailand","TF":"FrenchSouthernTerritories","TG":"Togo","TD":"Chad","TC":"TurksandCaicos
Islands", "LY": "Libya", "VA": "Vatican", "VC": "Saint Vincent and the Grenadines", "AE":"United
Arab Emirates", "AD": "Andorra", "AG": "Antigua and Barbuda", "AF": "Afghanistan",
"AI":"Anguilla","VI": "U.S. Virgin Islands", "IS": "Iceland","IR": "Iran", "AM": "Armenia", "AL":
"Albania", "AO": "Angola", "AQ": "Antarctica", "AS": "American Samoa", "AR": "Argentina", "AU":
"Australia","AT":"Austria","AW":"Aruba","IN":"India","AX":"AlandIslands","AZ":
"Azerbaijan", "IE": "Ireland", "ID": "Indonesia", "UA": "Ukraine", "QA": "Qatar", "MZ":
"Mozambique"}
27/5
FUNDAMENTALSOFDATASCIENCELAB
a. Pickle files b. Image files using PILc. Multiple files using Glob d. Importing data from database
#!/usr/bin/env python3
# How to use the Python pickle module to store arbitrary python data as a file.
# Cleaned up from the demo at: https://pythontips.com/2013/08/02/what-is-pickle-
in-python/
importpickle
# Now we can compare the original data vs the loaded pickle data
print(b) # ['test value','test value 2','test value 3']
print(a==b)#true
# And it's equivalent!
# That's storing and loading data via pickle!
The advent of Convolutional Neural Networks (CNN) has opened the flood gates to working in the computervision domain
and solving problems like object detection, object classification, generating new images and whatnot!
But before you jump on to working with these problems, you need to know how to open your images in Python.Let’ssee
howwe cando thatby retrievingimages fromthe webpagethat westored inour localfolder.
YouwillneedthePythonPIL(PythonImageLibrary)forthisjob.
Simply call the open() function in the Image module of PIL and pass in the path to your image:
In [64]:
fromPILimportImage
Image.open('c:/experiment9/college_pic.jpeg')
Out[64]:
Andnow,whatifyouwanttoreadmultiplefilesinonego?That’squiteacommonchallengeindatascienceprojects.
Python’s Glob module lets you traverse through multiple files in the same location. Using glob.glob(), we canimport all
the files from our local folder that match a special pattern.
These filename patterns can be made using different wildcards like “*” (for matching multiple characters),
“?”(formatchingany singlecharacter),or ‘[0-9]’(formatching anynumber).Let’s seeglobin actionbelow.
29/5
FUNDAMENTALSOFDATASCIENCELAB
When importing multiple .py files from the same directory as your Python script, we can use the “*” wildcard:
In [65]:
importcv2
importglob
importmatplotlib.pyplotasplt
for img in
glob.glob('c:/experiment9/glob_pics/*.*'):cv_
img=cv2.imread(img)
plt.imshow(cv_im
g)plt.show()
30/5
FUNDAMENTALSOFDATASCIENCELAB
When you are working on a real-world project, you would need to connect your program to a database toretrieve data.
There is no way around it (that’s why learning SQL is an important part of your data sciencejourney). Data in databases is
stored in the form of tables and these systems are known as Relational databasemanagement systems (RDBMS). However,
connecting to RDBMS and retrieving the data from it can prove tobe quite a challenging task. Here’s the good news – we
can easily do this using Python’s built-in modules! Oneof the most popular RDBMS is SQLite. It has many plus points:
Therearemanymorereasonsforitspopularity.Butfornow,let’sconnectwithanSQLitedatabaseandretrieveour data!
1. Createaconnectionwiththedatabaseconnect().Youneedtopassthenameofyourdatabasetoaccessit. It returns a
Connection object.
2. Once you have done that, you need to create a cursor object using the cursor() function. This will allow youto
implement SQL commands with which you can manipulate your data.
3. YoucanexecutethecommandsinSQLbycallingtheexecute()functiononthecursorobject.Sinceweareretrieving data from
the database, we will use the SELECT statement and store the query in an object.
4. Storethedatafromtheobjectintoadataframebyeithercallingfetchone(),foronerow,orfecthall(),forallthe rows, function on
the object.
And just like that, you have retrieved the data from the database into a Pandas dataframe!
A good practice is to save/commit your transactions using the commit() function even if you are only reading thedata.
31/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[832]:
In [66]:
importpandasaspd
importsqlite3
# Perform query: rs
rs=cur.execute('select*fromplaylists')
# Close connection
con.commit()
Out[66]:
0 1
0 1 Music
1 2 Movies
2 3 TV Shows
3 4 Audiobooks
4 5 90’sMusic
32/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[833]:
5 6 Audiobooks
6 7 Movies
7 8 Music
8 9 MusicVideos
9 10 TV Shows
10 11 Brazilian Music
11 12 Classical
15 16 Grunge
17 18 On-The-Go 1
33/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[834]:
WebScraping refers to extracting largeamounts of data from theweb. This is important fora data scientist who
has to analyze large amounts of data. Python provides a very handy module called requests to retrieve datafrom any
website. The requests.get() function takes in a URL as its parameter and returns the HTML responseas its output. The
way it works is summarized in the following steps:
For this example, I want to show you a bit about my city – Delhi. So, I will retrieve data from the Wikipedia pageon Delhi:
In [67]:
importrequests
# url =
"https://weather.com/en-IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27f
url="https://en.wikipedia.org/wiki/Delhi"
# response object
resp=requests.get(url)
# using text attribute of the response object, return the HTML of webpage as
string
text =
resp.textprint
(text)
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Delhi - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!
1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":
["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","Februar
y","March","April","May","June","July","August","September","October","November","December"]
,"wgRequestId":"8a49cc42-95b2-4c25-a1a1-8805d16417bd","wgCSPNonce":!
1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!
1,"wgNamespaceNumber":0,"wgPageName":"Delhi","wgTitle":"Delhi","wgCurRevisionId":105347
4153,"wgRevisionId":1053474153,"wgArticleId":37756,"wgIsArticl
e":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroup
s":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","CS1errors: missing
periodical","CS1 maint: archived copy as title","Webarchive template wayback links","CS1 maint:
numeric names: authors list","Articles with short description","Short description is different from
Wikidata","Wikipediaindefinitelysemi-protectedpages","UseIndianEnglishfromOct
But as you can see, the data is not very readable. The tree-like structure of the HTML content retrieved by ourrequest is
34/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[835]:
not very comprehensible. To improve this readability, Python has another wonderful library calledBeautifulSoup.
BeautifulSoup is a Python library for parsing the tree-like structure of HTML and extracting data from the
HTMLdocument.
here.Right,let’sseethewonderofBeautifulSoup.
In [68]:
importrequests
frombs4importBeautifulSoup
# url
# url =
"https://weather.com/en-IN/weather/tenday/l/aff9460b9160c73ff01769fd83ae82cf37cb27f
url="https://en.wikipedia.org/wiki/Delhi"
# Package the request, send the request and catch the response: r
r=requests.get(url)
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
Delhi - Wikipedia
</title>
<script>
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!
1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":
["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","Februar
y","March","April","May","June","July","August","September","October","November","December"]
,"wgRequestId":"8a49cc42-95b2-4c25-a1a1-8805d16417bd","wgCSPNonce":!
1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!
1,"wgNamespaceNumber":0,"wgPageName":"Delhi","wgTitle":"Delhi","wgCurRevisionId":105347
4153,"wgRevisionId":1053474153,"wgArticleId":37756,"wgIsArticl
e":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroup
s":["*"],"wgCategories":["Pages with non-numeric formatnum
arguments","CS1errors:missingperiodical","CS1maint:archivedcopyastitle","Webarchiv
35/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[836]:
Experiment(11)Performfollowingpreprocessingtechniquesonloanpredictiondataset
In simple words, pre-processing refers to the transformations applied to your data before feeding it to thealgorithm.
In python, scikit-learn library has a pre-built functionality under sklearn.preprocessing. There are many moreoptions
for pre-processing which we’ll explore.
# Importing pandas
importpandasaspd\
# Importing training data set
X_train=pd.read_csv('C:\loan_prediction-1\
X_train.csv')Y_train=pd.read_csv('C:\
loan_prediction-1\Y_train.csv')# Importing testing
data set
X_test=pd.read_csv('C:\loan_prediction-1\
X_test.csv')Y_test=pd.read_csv('C:\
loan_prediction-1\Y_test.csv')
In [70]:
print(X_train.head())
ApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_Term\0 4950
0.0 125 360
1 2882 1843.0 123 480
2 3000 3416.0 56 180
3 9703 0.0 112 360
4 2333 2417.0 136 360
Credit_History Property_Area
0 1 Urban
1 1 Semiurban
2 1 Semiurban
3 1 Urban
4 1 Urban
11(a)FeatureScaling
Feature scaling is the method to limit the range of variables so that they can be compared on common grounds.It is
performed on continuous variables. Lets plot the distribution of all the continuous variables inthe data set.
In [71]:
importmatplotlib.pyplotasplt
X_train[X_train.dtypes[(X_train.dtypes=="float64")|(X_train.dtypes=="int64")]
.index.values].hist(figsize=[11,11])
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\pandas\plotting\_matplotlib\
tools.py:331: MatplotlibDeprecationWarning:
The is_first_col function was deprecated in Matplotlib 3.4 and will be removed two minor releases later.
Use ax.get_subplotspec().is_first_col() instead.
36/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[837]:
if ax.is_first_col():Out[71]:
array([[<AxesSubplot:title={'center':'ApplicantIncome'}>,
<AxesSubplot:title={'center':'CoapplicantIncome'}>],
[<AxesSubplot:title={'center':'LoanAmount'}>,
<AxesSubplot:title={'center':'Loan_Amount_Term'}>],
[<AxesSubplot:title={'center':'Credit_History'}>, <AxesSubplot:>]],dtype=object)
In [72]:
In [73]:
fromsklearn.metricsimportaccuracy_score
fromsklearn.neighborsimportKNeighborsClassifier 37/5
# Fitting k-NN on our scaled data
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[838]:
<ipython-input-73-19507733bb8b>:5: DataConversionWarning: A column-vector ywas passed when a
1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
knn.fit(X_train_minmax,Y_train)Out[73]:
0.75
11(b)FeatureStandardization
fromsklearn.preprocessingimportscale
fromsklearn.metricsimportaccuracy_score
X_train_scale=scale(X_train[['ApplicantIncome',
'CoapplicantIncome','LoanAmount','Loan_Amount_Te
rm','Credit_History']])
X_test_scale=scale(X_test[['ApplicantIncome','CoapplicantIncome',
'LoanAmount','Loan_Amount_Term','Credit_History']])
# Fitting logistic regression on our standardized data set
from sklearn.linear_model import
LogisticRegressionlog=LogisticRegression(penal
ty='l2',C=.01)
log.fit(X_train_scale,Y_train)
# Checking the model's accuracy
accuracy_score(Y_test,log.predict(X_test_scale))
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change
the shape of y to (n_samples, ), for example using ravel().
return
f(**kwargs)Out[74]:
0.75
11(c)LabelEncoding
In previous sections, we did the pre-processing for continuous numeric features. But, our data set has otherfeatures too
such as Gender, Married, Dependents, Self_Employed and Education. All these categoricalfeatures have string values.
For example, Gender has two levels either Male or Female. Lets feed the featuresin our logistic regression model.
In [75]:
# Importing LabelEncoder and initializing
itfrom sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
# Iterating over all the common columns in train and test
forcolinX_test.columns.values:
# Encoding only categorical variables
ifX_test[col].dtypes=='object':
# Using whole data to form an exhaustive list of levels
data=X_train[col].append(X_test[col
])le.fit(data.values)
X_train[col]=le.transform(X_train[col])X_test[col]=le.t
ransform(X_test[col])
X_train.head()
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to
(n_samples, ), for example using ravel().
return
38/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[839]:
f(**kwargs)Out[76]:
0.7395833333333334
11(d)OneHotEncoding
In [77]:
f(**kwargs)Out[77]:
0.7395833333333334
In [79]:
In
[840]:
c:\users\91770\appdata\local\programs\python\python38\lib\site-packages\sklearn\utils\validation.py:72:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change
the shape of y to (n_samples, ), for example using ravel().
return
f(**kwargs)Out[79]:
0.75
In [81]:
df_meal = pd.read_csv('C:\\experiment12\\
meal_info.csv')df_meal.head()
Out[81]:
In [82]:
In
[841]:
Out[82]:
df_food = pd.read_csv('C:\\experiment12\\
train.csv')df_food.head()
Out[83]:
id week center_id meal_id checkout_price base_price emailer_for_promotion hom
In [84]:
df =
pd.merge(df_food,df_center,on='center_id')d
f=pd.merge(df,df_meal,on='meal_id')
11(a)BarGraph
table =
pd.pivot_table(data=df,index='category',values='num_orders',aggfunc=np.sum)t
able
Out[85]:
num_orders
category
Beverages 40480525
Biryani 631848
Desert 1940754
Extras 3984979
Fish 871959
Pasta 1637744
41/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In
[842]:
Pizza 7383720
Salad 10944336
Sandwich 17636782
Seafood 2715714
Soup 1039646
Starters 4649122
#bar graph
plt.bar(table.index,table['num_orders'])
#xticks
plt.xticks(rotation=70)
#x-axis labels
plt.xlabel('Fooditem')
#y-axis labels
plt.ylabel('Quantitysold')
#plot title
plt.title('Mostpopularfood')
#save plot
matplotlib_plotting_6.png',dpi#display
plt.show()
42/5
FUNDAMENTALSOFDATASCIENCELAB-JupyterNotebook
In [87]:
foriinrange(table.index.nunique()):
item_count[table.index[i]]=table.num_orders[i]/
df_meal[df_meal['category']==table.ind
#bar plot
plt.bar([xforxinitem_count.keys()],[xforxinitem_count.values()],color='orange')
#adjust xticks
plt.xticks(rotation=70)
#label x-axis
plt.xlabel('Fooditem')
#label y-axis
plt.ylabel('No.ofmeals')
#save plot
# plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\
matplotlib_plotting_7.png',dpi
#display plot
plt.show();
43/5
FUNDAMENTALSOFDATASCIENCELAB
12(b)Pie Chart
In [88]:
importmatplotlib.pyplotasplt
# Data to plot
labels = 'Python', 'C++', 'Ruby',
'Java'sizes=[215, 130,245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral',
'lightskyblue']explode=(0.1, 0, 0, 0)#explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels,
colors=colors,autopct='%1.1f%
%',shadow=True,startangle=140)
plt.axis('equal
')plt.show()
In [ ]:
44/5
FUNDAMENTALSOFDATASCIENCELAB
In [90]:
#plotting boxplot
plt.boxplot([xforxinc_price.values()],labels=[xforxinc_price.keys()])
#x and y-axis
labelsplt.xlabel('C
uisine')plt.ylabel(
'Price')
#plot title
plt.title('Analysingcuisineprice')
12(d)Histogram
#plotting histogram
plt.hist(df['base_price'],rwidth=0.9,alpha=0.3,color='blue',bins=15,edgecolor='red
')
#plot title
plt.title('Inspectingpriceeffect')
45/5
FUNDAMENTALSOFDATASCIENCELAB
In [92]:
for i in
range(max(df['month'])):mo
nth.append(i)
month_order.append(df[df['month']==i].revenue.sum())
for i in
range(max(df['week'])):we
ek.append(i)
week_order.append(df[df['week']==i].revenue.sum())
#subplots returns a Figure and an Axes object
fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(20,5))
center_type_name=['TYPE_A','TYPE_B','TYPE_C']
#subplots
fig,ax=plt.subplots(nrows=3,ncols=1,figsize=(8,12))
#scatter plots
ax[0].scatter(op_table.index,op_table['num_orders'],color='pink')ax[0].set_xlabel('Operationarea')
ax[0].set_ylabel('Numberoforders')
ax[0].set_title('Doesoperationareaaffectnumoforders?')
ax[0].annotate('optimumoperationareaof4km^2',xy=(4.2,1.1*10**7),xytext=(7,1.1*10**7),a
#boxplot
ax[1].boxplot([xforxinc_type.values()],labels=[xforxinc_type.keys()])ax[1].set_xlabel('Centertype')
ax[1].set_ylabel('Operationarea')
ax[1].set_title('Whichcentertypehadtheoptimumoperationarea?')
#bar graph
ax[2].bar(center_table.index,center_table['num_orders'],alpha=0.7,color='orange',width=0.5)ax[2].set_xlabel('Centertype')
ax[2].set_ylabel('Numberoforders')
ax[2].set_title('Orderspercentertype')
#show figure
plt.tight_layout()
#plt.savefig('C:\\Users\\Dell\\Desktop\\AV Plotting images\\matplotlib_plotting_12.png',dpi
plt.show();
47/5
FUNDAMENTALSOFDATASCIENCELAB
In [ ]:
48/5