Grokking Data Science
Grokking Data Science
As a data scientist or machine learning engineer you can use a quick implementation in Python
to validate your ideas about some complex mathy conceptsin a fast and hassle-free manner.
It’s easily understandable for you andothers.
As a DataScientist, your life revolves arounddata. Outside the playground, you stumble upon
reality. Datain reallife is oftentimes raw,unstructured, incomplete, and large. Python comes
with the promise of knowing howto handletheseissues. But how does it do that? What is so
special about Python?
All hail the mighty packages! What's so special about Python arethe great open source code
repositories that are continuously being updated. These open source contributions give Python.
its superpowers and an edgeoverother languages. Thebest thing aboutusing these packagesis
that they have a minimal learning curve. Once you havea basic understanding of Python, you
can very easily import, use, and benefit from all of the packages outthere without having to
understandeverything is going on under the hood. Lastbutnot theleast, these packages are
completely free to use as well!
Since wehaveourdatascientist haton,let’s talk aboutdata. If youstill have doubts,I'll let this
survey from IBM convince you why weshouldlearn data science in Python andnot in R or any
other language.
(0ct27, 2016
‘= prionand (ache leaing” or "data sone: 0.180%
‘= Rand (machineering”or estasence’): 0.081%
Percentage of Matching 4Postings (%)
The Jupyter Notebookis an incredibly powerful andsleek tool for developing and presenting
data scienceprojects. It can integrate codeandits outputinto a single document, combining
visualizations, narrative text, mathematical equations, and other rich media. It’s simply
awesome.
SJUPYter noted00k iatcmspne assessoe
As if all these features weren’t sufficient enough, Jupyter Notebook can handle manyother
languages,likeR, as well. Its intuitive workflows, ease of use, and zero-cost have madeit
THE toolat the heartof any data science project.
Essentially, Jupyteris a great interface to the Python language and a must-haveforall Data
Scienceprojects.
Theeasiest and moststraight forward wayto get started with Jupyter Notebooks is by installing
Anaconda. Anacondais the mostwidely used Pythondistribution for data science. It comes pre-
loaded with the most popularlibraries and tools (e.g., NumPy, Pandas, and Matplotlib). What
this meansis that immediately get to real work, skipping the pain of managingtons of
installations, dependencies, and OS-specific installation issues.
2. Getting Anaconda
Note: Thereare other ways of running the Jupyter Notebook aswell, e.g., via pip and Docker,
but weare going to keepit sweet and simple. Wewill stick with the mosthassle-free approach.
1 jupyter notebook
A Jupyter server is now runningin your terminal,listening to port 8888, andit will
automatically launch the application in your default web browserat http://localhost:8888if it
doesn’t happen automatically, you can use theurlto launch it yourself. You should see your
workspacedirectory,like in the screenshotbelow:
= Jupyter Logout
Fes Runing uses
‘Solelms peroactionson thm Uo |New |
o[-)s are # Lestodtes
a ects 11 aays 90
5 contats ‘anys age
(3 Desktop tt a9ys990
© Documents Sys a90
© Downioads| 2s 390
So Fovertes 11 days 90
You can create a new Python notebookbyclicking on the New [1] button(screenshot below)
andselecting the appropriate Python version (Python3)[2].
~ TextFile
Folder
env
Terminal
Notebooks
Octave
Python 2
2» Python 3
Anotebookcontainsa list ofcells. Each cell can contain executable code or formatted text
(Markdown). Right now, the notebookcontains only one emptycodecell. Try typing
print(“Hello Data Science!”) in the cell[2], then click on the run button [3] (or press Shift-Enter).
Hitting the run button sendsthecurrentcell to this notebook’s Python kernel, which runsit and
returnsthe output. Theresult is displayed below thecodecell:
‘Sjupyter Hello Data Science <—" point: minute ago. (unsaved changes) Pogo
Fle Edt Vow Cot Kemet Widgets Hep Tried) 9 [Pyton’3 ©
B+ (x G/B) ATS Rn Cm coe vie
2
In [i]? | print(*Hello Data Science!)
Hello bata Science!
[=o
Jupyter Notebook
You cantry this here:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
Congratulations! Now you haveyour first Jupyter Notebook up and running. You will need it
for the IMDB andend-to-end MLprojectthatarediscussedlaterin this course.
Bonus Tip
Whenworking with other people ona datascience project,it is a good practice to:
As welearnedin thepreviouslesson, oneof the greatest assets of Pythonis its extensive set of
libraries. These are what makethelife of a data scientist easy — thestartof the loveaffair
between Python anddatascientists!
YOUREFLYING!
HOW?
T DUNNO... 1
DYNAMIC TYPING? I JUST TYPED
WHITESPRE? Inport ontigeaity
/ cone soniye THATS 7? [
T LEARNED ITLAST PROGRAMMING .-- I ALSO SAMPLED
NIGHT! EVERYTHING 'S FUN AGAIN! EVERYTHING IN THE
1S $0 SIMPLE! ITS A WHOLE MEDICINE CABINET
! NEW WORLD FOR COMPARISON.
HELLO WORLD IS JUST A UP HERE! i
print "Hello, world)" BUT HOWARE BUT LTHINKTHIS
You FLYING? ISTHE PYTHON.
ImageCedits: https:/Ixked.com
Numpy
+ NumPy (Numerical Python)is a powerful, and extensively used,library for storage and
calculations. It is designed for dealing with numericaldata. It allows data storage and
calculations by providing datastructures, algorithms, andotheruseful utilities. For
example,this library contains basic linear algebra functions, Fourier transforms, and
advanced random numbercapabilities. It can also be used to load data to Python and
exportfrom it.
Bi NumPy
Pandas
« Pandasis a library that you can’t avoid when working with Python on a data science
project. It is a powerful tool for data wrangling, a process required to prepare your data so
that it can actually be consumed for analysis and model building. Pandascontainsa large
variety of functionsfor data import, export, indexing, and data manipulation. It also
provides handy datastructures like DataFrames(series of columnsandrows, andSeries(1-
dimensionalarrays), and efficient methods for handling them. For example, it allows us to
reshape, merge, split, and aggregatedata.
ful pandas
Scikit Learn
* Scikit Learnis an easy to use library for Machine Learning. It comes with a variety of
efficient tools for machine learningandstatistical modeling:it provides classification
models (e.g., Support Vector Machines, Random Forests, Decision Trees), Regression
Analysis(e.g., Linear Regression, Ridge Regression, Logistic Regression), Clustering
methods (e.g, k-means), data reduction methods(e.g., Principal ComponentAnalysis,
feature selection), model tuning, andselection with featureslike grid search, cross-
validation. It also allows for pre-processingof data. If these terms soundforeign to you
right now, don’t worry,wewill get back to all this in detail in the section on machine
learning.
Matplotlib
* Matplotlib is widely used for datavisualization like for plotting histograms, line plots, and
heat plots.
Histogram of 1: = 100, 0 =35
Seaborn
* Seabornis anothergreat library for creating attractive and information rich graphics. Its
goalis to makedata exploration and understanding easier, andit doesit very well.
Seabornis based on Matplotlib whichis its child, basically.
#® Note: Learning to use Python well means using lotoflibraries and functions
which canbe intimidating. But no need to panic —you don’t have to remember them
all by heart! Learning how to Googlethese things efficiently is among the top skills
of a good data scientist!
Learning NumPy- AnIntroduction
* Why NumPy
e Lessons Overview
Why NumPy
Data comes in all shapes andsizes. We can haveimagedata,audiodata,text data, numerical
data, etc. We haveall these heterogeneoussources of data but computers understandonly 0’s
and1’s — Atits core, data can be thoughtofas arrays of numbers. In fact, the prerequisite for
performing any data analysis is to convert the data into numerical form. This meansit is
importantto be able to store and manipulatearraysefficiently, and this is where Python’s
NumPy package comesinto picture.
Now, you might be questioning, “When can I use Python’s built-in lists and to doall sorts of
computations and manipulations throughlist comprehensions, for-loops, etc., why should I
bother with NumPy arrays?”You arerightin thinking so because,in someaspects, NumPy
arrays arelike Python’s lists. Their advantageis that they provide moreefficient storage and
data operationsasthearrays growlargerin size. This is the reason NumPy arraysareat the
core ofnearly all data sciencetools in Python.This,in turn,implies that it is essential to know
NumPy well!
Lessons Overview
In this Learning NumPy series, we will start by understandingthebasics of array manipulations
in NumPy. Wewill then proceed to learn about computations, comparisons, and other more
advanced tricks.
It is also importantto recall and apply whateverwelearnby practicing — wewill end with
someexercisesto hit-refresh on the concepts learned. The reason wehaveexercisesat the end
of the entire course rather thanat the endof each lesson is because recalling information after
sometimeis a better wayof learning.
Enough talking! Without further ado,let’s dive into the world of NumPy.
NumPyBasics - Creating NumPy Arrays and Array
Attributes
* 1. Creating Arrays
* a. Arrays From Lists
© b, Arrays From Scratch
© 2. Array Attributes
1. Creating Arrays
Thereare two waysto create arrays in NumPy:
Thefirst step when working with packagesis to define the right “imports”. We can import
NumPy like so:
import numpy as np
There are multiple waysto create arrays. Let’s start by creating an array from list using the
np.array function.
np-array([1, 2, 3, 4])
3D array
2D array axis 2
axis wW
1D array az) eS
(Sle[7) axis 0 >| [o(7.0) 6 axiso->6
Oren
Run the codein the widget below andinspectthe outputvalues. In particular, observe the
type ofthe created array from the result of the print statement.
11234]
<class ‘numpy.ndarray'>
If we wantto explicitly set the data type ofthe resulting array, we can use the dtype keyword.
Someof the most commonly used numpydtypes are: ‘float’, ‘int’, ‘bool’, ‘str’, and ‘object’. Say we
wantto createan array offloats, we candefineit like so:
1.318
<class ‘numpy.ndarray'>
[1. 2. 3. 4.1
In the examples above, we have seen one-dimensional arrays. We canalso define two and
three-dimensionalarrays.
[fo 2 2)
[345]
[67 81)
A keydifference betweenanarray anda listis that arrays allow you to perform vectorized
operations whilea list is not designed to handlevector operations. A vector operation means a
function gets applied to every item in the array.
Say wehavea list and we wantto multiply each item by 2. We cannotdo element-wise
operationsby simply saying “mylist * 2”. However, we can do so on a NumPy array.Let’s see
some code-examples:
print(arrt)
# Vector (element-wise) operations
print(arra * 2)
print(arrd + 2)
print(arra * arri)
11234]
12468]
[345 6]
[1 4 9 16)
« Array size cannot be changedafter creation, you will have to create a new array or
overwritethe existing one to changesize.
* Unlikelists, all itemsin the array must be of the samedtype.
« An equivalent NumPy array occupies muchless space than a Pythonlistoflists.
Nowinstead ofusinglists as a starting point, let’s learn to create arrays from scratch. For large
arrays,it is more efficient to create arrays using routines already built into NumPy. Here are
several examples:
np.zeros(160, dtype=int)
# Create a 3x3 floating-point array filled with 1s
np.ones((3, 3), dtype=float)
# Create an array filled with a linear sequence
# Starting at @, ending at 20, stepping by 3
# (this is similar to the built-in range() function)
10 np.arange(2, 28, 3)
11
12. # Create an array of hundred values evenly spaced between @ and 1
13 np.linspace(@, 1, 100)
14
15 # Create a 3x3 array of uniformly distributed random values between @ and 1
16 np.random.random((3, 3))
17
18 # Create a 3x3 array of random integers in the interval [@, 16)
19 np.random.randint(@, 1, (3, 3))
20
21 # Create a 3x3 array of normally distributed random values
22 # with mean @ and standard deviation 1
23. np.random.normal(@, 1, (3, 3))
24
25 np.random.randint(10, size=6) # One-dimensional array of random integers
26 np.random.randint(10, size=(3, 3)) # Two-dimensional array of random integers
27 np.random.randint(10, size=(3, 3, 3)) # Three-dimensional array of random integers
2. Array Attributes
Each arrayhas the followingattributes:
import numpy as np
wavanaune
dtype: inté4
itemsize: 8 bytes
nbytes: 72 bytes
NumPyBasics - Array Indexing and Slicing
Observetheinputs and outputs (indicated by “#>”as start marker) for the examples given
below.
Note: When running thecode, ifyou wantto view the outputin the console, you can add
print statementsif they are already not there,like at the endofthisfirst code widget. I have
omitted them to removenoisefrom the code so thatyou can fully focus on the importantbits.
# Input array
wavanaune
x1 = np.array([1, 3, 4, 4, 6, 4])
# Assess the first value of x1
xi[e]
pa
# Assess the third value of x1.
xa[2]
10 #4
11
12 # To view the output, you can add print statemtents
13 print(x1[@])
xi[-1]
D4
# Get the second last value of x1
xi[-2]
6
V succeeded
If we have a multidimensional array, and wantto access items based on both column androw,
wecan passthe row and column indices at the sametime using a comma-separatedtuple as
shownin the examples below.
1 # In a multidimensional array, we need to specify row and column index. Given input array x2:
2 x2 = np.array([[3, 2, 5, 5].[@, 1, 5, 8], (3, % 5, @]])
3 x2
4 #array([[3, 2, 5, 5],
5s ® [e, 1, 5, 8],
6 [3, ®5, @]])
7
8 # Value in 3rd row and 4th column of x2
9 x2[2,3]
1e me
12. # 3rd row and last value from the 3rd column of x2
13° x2[2,-1]
14 e
16 # Replace value in 1st row and 1st column of x2 with 1
47 x2[@,0] = 1
18 #>array([[1, 2, 5, 5].
19 # [% 1 5, 8],
20 #> [3, @ 5, @]])
V succeeded
4. Array Slicing
Slicing array is a way to access subarrays,i.e., accessing multiple or a range of elements from
an array instead ofindividualitems. In other words, whenyouslice arrays you get and set
smaller subsets of items within largerarrays.
Again, we needto use square brackets to access individual elements. Butthis time, we also
need the slice notation, “:” to access a slice or a range of elementsof a given array, x:
x[start:stop:step]
If we do not specify anything for start, stop, or step, NumPyuses the default values for these
parameters: start=0, stop=size of dimension, and step=1.
Carefully go throughall of the examples given below,and observethe outputvalues for the
different combinationsofslices. As an exercise play with the indices and observe the
outputs.
21 #return elements from 1st position step by 2 (every other element starting at index 1)
220 xa[i::2]
23 #> array([1, 3, 5, 7, 9])
Y succeeded
Whatdo you think would happenif wespecify a negative step value? In this case, the defaults
for start and stop are swapped,a handy waytoeasily reverse an array!
xafs:-4]
# array([9, 8, 7, 6 5, 4, 3, 2. 1, ])
# reverse every other element starting from index 5
xa[S::-2]
# array([5, 3, 1])
o succeeded
Wecan use this same approach with multi-dimensionalslices. We can define multipleslices
separated by commas:
# array([[@, 1, 2],
> (3, 4, 5],
> [6 7, 8]])
x2[:2, :2] # Extract the first two rows and two columns
# array([[ @ 11,
> (3, 411)
10 x2[:3, ::2] # all rows, every other column
11 #array([[@, 21,
12 [3 5)
13 [6 8]])
Y succeeded
Again,try modifying the values and play with multi-dimensional arrays before moving ahead.
Wecanalso perform reverse operations on subarrays:
Y succeeded
Note: Arrayslices are not copiesofthe arrays. This means thatif we wantto doa
modification on the array obtained from theslicing operation withoutchanging the
original array, we haveto use the copy() method:
5, Reshapingof Arrays
6. Concatenation and Splitting of Arrays
* a. Concatenation
° b. Splitting
5. Reshaping of Arrays
Reshapingis about changing the wayitems are arranged within thearrayso that the shape of
the array changesbuttheoverall numberof dimensions stays the same, e.g., you can useit to
convert a 1D array into 2D array.
Reshaping is a very useful operation andit can easily be done using the reshape() method.
Since a picture speaks a thousand words,let's see theeffects of reshaping visually:
v
npseshape (x, (3,2))
v
3
{e[=]s
5
7
Howcan wedothis in code? Say we wantto create a 3x3 grid with numbersfrom 1 to 9. We
first create a 1D array and then convertit to the desired shape as shown below.
Run the codein the widget below and observethe outputsof the print statements to
understand what’s going on.
import numpy as np
reshaped = np.arange(1, 10).reshape((3, 3))
print(reshaped)
[f1 2 3]
[4.5 6]
[7 8 91]
Similarly, we can use reshaping to convert between row vectors and column vectors by simply
specifying the dimensions we want. A row vectorhasonly 1 row.This meansthat after
reshaping,all the values end up as columns. A columnvectorhasonly 1 columnandall the
values end upin rows.
x = np.array([1, 2, 3])
wavanaune
print(x)
# row vector via reshape
x_pv= x.reshape((1, 3))
print(x_rv)
# column vector via reshape
x_ev = x.reshape((3, 1))
18 print(x_cv)
1.218
1123)
[f2 231]
[4]
[2]
(311
a. Concatenation
The concatenate() method allowsus to put arrays together. Let’s understandthis with some
examples.
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
2 = [11,11,11]
np.concatenate([x, y, z])
®array([ 1, 2, 3, 3, 2, 4, 11, 11, 111)
# We can also concatenate 2-dimensional arrays.
1 grid = np.array([[1,2,3] . [4,5,6]])
11 np.concatenate([grid, grid])
12. #> array([[1, 2, 3],
13 [4, 5, 6],
14 t. 2, 31,
15 [4 5, 6]])
16
Y succeeded
3[s(7 3
a
5
5 7 9
7
np.hstack((xy))
| 2
7
ee 9
357] s|7io
— |
np.hstack((x,y))
3} 5 7
5} 7 19
|
np.vstack((x,y))
“| ol|—
3 7
5 9
Le
Runthe codebelowto better understand howto dothis practically. Before looking at the
output, try to visualize the solution in your head.
Reminder: You can add print statement aroundthe outputto see results in your consoleas well.
1 x = np.array([3,4,5])
2 grid = np.array([[1,2,3],[9.1@,11]])
3
4 np.vstack([x,grid]) # vertically stack the arrays
5 # array([l 3, 4, 5],
6 > (1, 2, 3],
7 » [9, 1, 11]])
8
9 z = np.array([[19],[19]])
10 np.hstack([grid,z]) # horizontally stack the arrays
11 #array([[ 1, 2, 3, 19],
12 [9, 1, 41, 19]])
Y succeeded
b. Splitting
Wecan do the opposite of concatenation and split the arrays based on a given position for the
split points.
1 X = np.arange(10)
2 # array([@, 1, 2, 3, 4, 5, 6, 7, 8 9])
3
4 x1, x2, x3 = np.split(x,[3,6])
5 print(x1, x2, x3)
6 #[812] [345] [6789]
1.318
[012] [345] 16789)
0. 1 2. 3. 0. 7 2. 3. 4.
4, 5. 6. 7. 5. 6. 7. 8. 9,
8. 9. 10.}/ 11. 10. 11. 12. 13. | 14.
12. | 13. 14. 15. 15. 16. 17. 18. 19.
od eee
np.hsplit( a, 2) np.vsplit(a, 2)
0. 1. offifi2iaa
4. 5. 5. 6. 7. 8. 9.
L ee
8.| 9. —a a ord
12.13. 10.|[ 11. [12.13] 14
—— 15.|| 8.
16.|/ 17.1 19.
[1 168..1 7
2.| 3. Sonen en
6.| 7.
10.11.
14] 15.
Let’s see this in action. Run the code below and observethe outputs.
import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid, “\n")
# Split vertically and print upper and lower arrays
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower, "\n")
41 # Split horizontally and print left and right arrays
12 left, right = np.hsplit(grid, [2])
13 print(left)
14 print(right)
Ifo 1 2 3]
[45 6 7]
[8 910 11)
[22 13 14 15]]
[fo 12 3]
[45671]
If 8 9 10 11]
[22 13 14 15]]
In addition to the functions we havelearned so far, there are several other very useful
functionsavailable in the NumPylibrary (sum,divide, abs, power, mod,sin,cos, tan, log, var,
min, mean, max, etc.) which can be usedto perform mathematical calculations. There arealso
built-in functions to compute aggregates and functions to perform comparisons. Weare going
to learn these conceptsin the upcoming lessons, so keep going!
NumPy Arithmetic and Statistics - Computations and
Aggregations
1. Computations on NumPyArrays
The reason for NumPy’s importancein the Pythonicdata science worldis its ability to perform
computationsin a fast and efficient manner. Nowthatwearefamiliar with the basic nuts and
bolts of NumPy,weare going to dive into learning it to perform computations.
NumPy provides the so-called universal functions (ufuncs) that can be used to make repeated
calculations on array elements in a very efficient manner. Theseare functionsthat operate on
nD-arrays in an element-by-elementfashion. Rememberthe vectorized operations from earlier.
Mathematical Functions
Whatare someof the most commonanduseful mathematical ufuncs available in the NumPy
package? Let’s explore them with someconcrete examples.
Thearithmetic operators, as shownin the code widget below, are conveniently wrapped around
specific functions built into NumPy;for example, the + operatoris a wrapperfor the add ufunc.
Run thecodein the widget below, tweaktheinputs, and observethe outputsofthe print
statements.
1 import numpy as np
2
3. x = np.arange(10)
4
5 # Native arithmentic operators
6 print(”
7 print("x + 5 5)
8 print("x - 5 - 5)
9 print("x * 5 *5)
1 print("x / 5 7/5)
41 print("x ** 2 =", x ** 2)
12. print("x %2 =", x % 2)
13
14 # OR we can use explicit functions, ufuncs, e.g. “add” instead of "+"
15 print(np.add(x, 5))
16 print(np.subtract(x, 5))
17 print(np.multiply(x, 5))
418 print(np.divide(x, 5))
19 print(np.power(x, 2))
20 print(np.mod(x, 2))
x= [012345678 9)
x+5=[5 6 7 8 91011 12 13 14]
x-5=[-5-4-3-2-1 0 1 2 3 4]
x“ 5= [0 5 10 15 20 25 30 35 40 45]
x/5=[0. 0.20.40.6 0.81. 1.21.4 1.6 1.8]
x [0 1 4 9 16 25 36 49 64 81)
x22 [0101010101]
[5 6 7 8 91011 12 13 14]
[-5 -4-3-2-1 0 1 2 3 4]
Someofthe most useful functions for data scientists are the trigonometric functions. Let’s look
into these. Let’s define an arrayof anglesfirst and then compute sometrigonometric functions
based on those values:
theta np.pi, 4)
wewne
print( theta)
print(” np.sin(theta))
print(” np.cos(theta))
print("tan(theta) np.tan(theta))
Note: These might not seem useful to you at the moment, butyouwill see their direct
applicationin ourfinal project.
x= [1 2.3]
wavanaune
print (” x)
print(” np .exp(x))
print(” np -exp2(x))
print ("3x np.power(3, x))
print("In(x) np.log(x))
print(“log2(x) np..10g2(x))
print(“logi@(x) np.10g10(x))
x = [t, 2, 3]
e*x = [ 2.71828183 7.3890561 20.08553692)
2x = (2. 4. 80]
3x = [3 9 27]
In(x) = [0. 0.69314718 1.09861229)
log2(x) [0. 1. 1,5849625]
logiO (x) = [0. 0.30103 9.47712125]
Say we want to apply someoperation to reduce an arrayto a single value. Wecan use the
reduce() method for this. This method repeatedly applies the given operation to the elements of
an array until only a single result remains. For example, calling reduce on the add functions
returnsthe sum ofall elementsin the array:
x = np.arange(1, 6)
wewne
sum_all = np.add.reduce(x)
print(x)
print(sum_all)
12345]
15
Note: add.reduce() is equivalentto calling sum(. In fact, when the argumentis a NumPy
array, np.sum ultimately calls add.reduce to do the job. This overhead ofhandlingits
argumentanddispatching to add.reduce can make np.sum slower. For moredetails, you can
refer to this answer on StackOverflow.
If we need tostore all the intermediate results of the computation, we can use accumulate()
instead:
Xx = np.arange(1, 6)
wewne
sum_ace = np.add.accumulate(x)
print(x)
print(sum_acc)
12345]
[1 3 6 10 15]
2. Aggregations
When wehavelarge amounts ofdata,asa first step, welike to get an understandingof the data
first by computingits summarystatistics, like mean and standarddeviation.
Note: Wewill look into the theoretical aspects of these statistical conceptsin the “Statistics
for DataScience”section, so don’t worry ifyou don’t remember whatstandarddeviationis,
for instance!
import numpy as np
wavanaune
x = np.random.random(100)
# Sum of all the values
print("sum of values is np. sum(x))
# Mean value
print("Mean value is: ", mp.mean(x))
1 #For min, max, sum, and several other NumPy aggregates,
11 #a shorter syntax is to use methods of the array object itself,
12. # i.e. instead of np.sum(x), we can use x.sum()
13 print(” ", x.sum())
14 print(” + X.mean())
45 print(” x.max())
16 print(” > x.min())
import numpy as np
wavanaune
a. Comparisons
In this world reignedby thesocial media, the trap of making comparisonsis just about
everywhere.So, staying true to the culture of making comparisons,let’s talk about comparisons
in NumPy
NumPy provides comparison operators suchas less than and greater than as element-wise
functions. The result of these comparison operatorsis always an array with a Boolean data
type, i.e., we get boolean array as output which containsonly True andFalse values depending
on whetherthe elementat that indexlives up to the comparison or not. Let’s see this in action
with some examples.
1 import numpy as np
2
3 x = np.array([1, 2, 3, 4, 5])
4
5 print(x < 2) # less than
6 print(x >= 4) # greater than or equal
x = np.array([1, 2, 3, 4, 5])
wewne
# Elements for which multiplying by two is the same as the square of the value
(2 * x) == & *2)
# array([False, True, False, False, False], dtype-bool)
V succeeded
Wecanalso countentries in the boolean array that weget as outputs. This can help us perform
otherrelated operations,like getting the total countofvaluesless than 6, np. count_nonzero , OF
checkingif all the valuesin thearray are less than 10, np.al1 and np.any:
import numpy as np
x = np.arange(10)
print(x)
# How many values less than 6?
print(np.count_nonzero(x < 6))
# Are there any values greater than 8?
10 print(np.any(x > 8))
12. # Are all values less than 16?
13 print(np.all(x < 1@))
1.178
[0123456789]
6
True
True
b. Boolean Masks
Amore powerfulpattern than just obtaining a boolean outputarrayis to use boolean arrays as
masks. This meansthat weare selecting particular subsetsof the array that satisfy some given
conditions by indexing the boolean array. We don’t just wantto knowif an index holds a value
less than 10, we want to getall the valuesless than 10 themselves.
Suppose wehavea 3x3 grid with randomintegers from 0 to 10 and we want anarrayofall
values in the original array thatare less than 6. We can achievethis like so:
1 import numpy as np
2
3 # Random integers between [@, 18) of shape 3x3
4 x = np.random.randint(@, 18, (3, 3))
5 print(x)
6
7 # Boolean array
8 print(x < 6)
9
1 # Boolean mask
411 print(x[x < 6])
1.928
[t7 1 21
[8 6 6]
[0 3 91]
[[Palse True True]
[False False False]
[ True True False]]
11203]
By combining boolean operations, masking operations, and aggregates, we can very quickly
answera lot of useful questions about ourdataset.
Final Thoughts
Congratulations! Weareat the end of our lessons on NumPy f Ofcourse, wewill keep
bumping intoit in the upcoming lessons as well; especially in the Projects section.
For a deeper diveinto all the goodness NumPyhastooffer, hereis their official documentation.
Lastbutnot the least, before moving on with new concepts, make sure to test your NumPy
knowledgeandsolidify the concepts learned so far by completing the exercises in the next
lesson.
Exercises: NumPy
© TimeToTest YourSkills!
Q3. Create a 3x3x3 array with random values and setit in the variable
called “arr”.
Q4. Create a 10x10 array with random valuescalled “arr4”. Find its
minimum and maximum values andset them in the variablescalled
“min_val” and “max_val” respectively.
Q5.First create a 1D array with numbers from 1 to 9 andthen convert it
into a 3x3 grid. Store the final answerin thevariable called “grid”.
Q6. Replace the maximum valuein thegiven vector, “arr6”, with -1.
Q7. Reverse the rowsofthe given 2Darray, “arr7”.
Q8. Subtract the mean of each row ofthe given 2D array,“arr8”, from the
values in thearray. Setthe updatedarray in “transformed_arr8”.
Q1.Create a null vector (all zeros) of size 10 and setit in the variable
called “Z”.
Solution eR o
1 Z = np.zeros(1@)
Solution eR o
1 arr = np.arange(10)
Q3. Create a 3x3x3 array with random valuesand setit in the variable
called “arr”.
Solution eR o
1 arr = np.random.random((3,3,3))
Q4. Create a 10x10 array with random values called “arr4”. Find its
minimum and maximum values and set them in the variables called
“min_val” and “max_val” respectively.
Solution eR o
1 arr4 = np.random.random((10,10))
2° min_val arr4.min()
3) max_val = arr4.max()
Q5. First create a 1D array with numbers from 1 to 9 and then convert
it into a 3x3 grid. Store the final answer in the variable called “grid”.
Solution eR o
Q6. Replace the maximum value in the given vector, “arr6”, with -1.
# Input
arré = np.arange(10)
# Your solution goes here
Solution eR o
1 arré = np.arange(10)
2 arré[arr6.argmax()] = -1
# Input
arr7 = np.arange(9) .reshape(3,3)
# Your solution goes here
Solution eR o
1 # Input
2. arr? = np.arange(9).reshape(3,3)
3
4 # Solution
Sarr? = arr7[::-1]
6
7
Q8. Subtract the mean of each rowof the given 2D array, “arr8”, from
the values in the array. Set the updated array in “transformed_arr8”.
To get the meanalong therow axis, you can use the numpy.mean method, mean(axis=1,
keepdims=True)
# Input
arr8 = np.random.rand(3, 16)
# Your solution goes here
ag Hide Solution a
Solution eR o
Pandasis a very powerful and popular package built on top of NumPy. It provides an efficient
implementation of data objects built on NumPy arrays and many powerful data operations.
Thesekind of operations are knownas data wrangling — steps required to preparethe data
so thatit can actually be consumedfor extracting insights and modelbuilding.
This might surprise you, but data preparation is whattakes the longest in a data science
project!
The two primary components of Pandasare the Series and DataFrameobjects. A Series is
essentially a column. And a DataFrameis a multi-dimensional table made upofa collection of
Series; it can consist of heterogeneousdata types and even contain missing data.
Lessons Overview
Pandasprovides manyuseful tools and methods in addition to the basic data structures. These
tools and methods require familiarity with the core data structures though,so wewill start by
understanding the nuts andbolts of Series and DataFrames. Then wewill diveinto all the good
things that Pandashasto offer by analyzing somereal data — be ready to explore the IMDB-
movies dataset!
Pandas Core Components- The Series Object
coverthe following Aa
1 import pandas as pd
2 series = pd.Series([@, 1, 2, 3])
3. print(series)
o
wnro
1
2
3
dtype: inté4
From the previous output, we can see thata Series consists of both a sequenceof values and a
sequence ofindices. The values are simply a NumPyarray, while the indexis an array-like
object of type pd. Index . Values can be accessed with the correspondingindex using the already
familiar square-bracketandslic ing notations:
1 import pandas es pd
2 series = pd.Series([@, 1, 2, 3, 4, 5])
3
4 print("values series. values)
5 print(Indice: series.index, "\n")
6
7 print(series[1], "\n") # Get @ single value
8
9 print(series[1:4]) # Get @ range of values
values: [0123 4 5]
Indices: RangeIndex(start=0, stop=6, step=1)
aoa
2 2
33
dtype: inté4
Pandas’ Series are much moregeneral and flexible than the 1D NumPy arrays. The essential
differenceis the presence of the index; while the values in the NumPyarray have an
implicitly defined integer index(to get andsetvalues), the PandasSerieshas an explicitly
defined integer index, which gives theSeries object additional capabilities.
For example, in Pandas, the index doesn’t haveto be an integer— it can consist of values of any
desired type, e.g., we can use strings as an index andthe item access worksas expected. Here is
an exampleofa Series based on non-integer index:
1 import pandas es pd
2 data = pd.Series([12, 24, 13, 54],
3 index=['a', "b', ‘c', ‘d"])
4
5 print(data, “\n")
6 print("Value at index b:", data[‘b'])
a 12
b 24
© 13
a 54
dtype: inté4
Value at index b: 24
Let’s see how wecan create Series from a dictionary, and then wewill perform indexing
andslicing on it. Say we havea dictionary with keysthat are fruits and values that correspond
to their amount. We wantto use this dictionary to create a Series object and then access values
using the namesofthefruits:
1 import pandas as pd
2
3 fruits_dict = ‘apples’:
4 “oranges’
5 “bananas*
6 “strauberries’: 20}
7
8 fruits = pd.Series(fruits_dict)
9 print("value for apples: ", fruits[‘apples*], “\n")
18
11 # Series also supports array-style operations such as slicing:
12 print(fruits[ bananas’: ‘strauberries"])
bananas 3
ozanges, 5
strawberries 20
dtype: inté4
Pandas Core Components - The DataFrame Object
1 import pandas es pd
2
3 data_s1 = pd.Series([12, 24, 33, 15],
4 index=[ apples’, ‘bananas’, ‘strawberries’, ‘oranges’ ])
5
6 # ‘quantity’ is the name for our column
7 dataframel = pd.DataFrame(data_s1, columns=[ ‘quantity’ ])
8 print(dataframe1)
9
quantity
apples 12
bananas 24
strauberries 33
oranges 15
1
2
3
4
5 data = pd.DataFrame(dict)
6 print(data)
import pandas es pd
wavanaune
price quantity
apples 4.0 12
bananas 4.5 24
strauberries 8.0 33
oranges 7 15
import pandas es pd
@Vounune
Wehaveonlyjust scratched the surface andlearned howto construct DataFrames.In the next
lessons we will go deeper andlearn-by-doing the many methodsthat we can call on these
powerful objects.
Pandas DataFrame Operations - Read, View and Extract
Information
To makethis step more engaging and fun, weare going to work with the IMDB Movies
Dataset. The IMDBdataset is a publicly available dataset that contains information about
14,762 movies. Each row consists of a movie and for each movie wehave informationliketitle,
yearofrelease, director, numberof awards, rating, duration etc. Sounds fun to explore, right?
Let’s putour datascientist’s hat on, anddiveinto the world ofthe movies! 9 &
#® Note: Once you have gone through these “IMDB-lessons”, I highly recommend you
downloadthis dataset andplay with it. It is really importantto get your hands dirty;
don’t just read through these lessons!
# You can also find the Juptyter Notebookwith all the code for these “IMDB-lessons”
on myGitprofile, here.
# You can find the live execution of the Jupyter Notebookat the endofthis lesson.
1 import pandas es pd
2
3 # Reading data from the downloaded CSV:
4 moviesdf = pd.read_csv("IMDB-Movie-Data.csv")
Note that wearecreating both title-indexed and default DataFrames(we don’t need both), so
that we can understandthe indexing concept better by comparing thetwoas, in steps 2 and 4.
Wecan use the head() method to visualize the first few rowsof our dataset. This method
outputs the first 5 rows of the DataFramebydefault, but we can pass the numberof rows we
wantas input parameter.
1 movies_d¥.head()
2
3 # To ouput the top ten rows
4 movies_df.head(10)
movies_df.head()
Rank Te Genre
escripton Director Runtime rang votes REVEMYE.
Actors Year (untme Revenue metascore
0 + usrdane
alaxy
rego eq
Mofine —ActonAaventreserri
meget
— James Chis Prat Vin Diesel,
Sacooper zoe
$ aot «12181 787074 30013 760
Following
12 Promeneus Adventrentysteysciri mankind cues ygy
‘tommeorgnat Riley NoamRapace,
PERSEGren Logan 2012 «azd_—=7.0 868201284850
tchol Fa
Tree gis re sore anes MeAvoy. Aya
2 3 ‘Split Horror Thriter mani a Shyamalan
“idnapped by 8M. Night =YOt~oy.
Taylor-Joy, Haley Lu 2016 nT Haley
73 187606 138.12 620
dag "
nactyot vate
3 & Sing AnmatonComeayFamiy «(UTNE Cetephe winerepoon,
iccongughyReese
Se a 2016 108=«72«GOSKS «TOS —=—580
rusting ea
45 suisce don Adventure Fata
SEE ActenAdvertreranisy ofS awd Ayer Ayer Wi“Marge
aCe! Oawd Sty, tare
Raibe, olaLet,isa 201820
"MorgetRabbe, a sa802
982 esTaY a0
‘someoth
Nowif weprint the rowsof the DataFrame with the explicit index, we cansee that the nameof.
the indexed column,“Title”, gets printed slightly lowerthantherest of the columns,and it is
displayed in place of the column which was showing row numbersin the previouscase (with
default index):
1 moviesdftitle_indexed.head()
novies_df_title_indexed.head()
suicide Asecet
orverrent WaSme, Jade,
sulcide 5 ActonAtvertxe Fantasy STEN Dad ayer HSMM JoedLeln. 2016aig 23,62 872732502400
ome ot
After this simple visualization, we are already morefamiliar with our dataset. Now we know
which columns makeup our data and what the values in each column looklike. We can now
see that each rowin ourdataset consists of a movie andfor each movie we have information
like “Rating”, “Revenue”, “Actors” and “Genre”. Each columnisalso called a feature,attribute, or
a variable.
Wecanalso see that each movie hasan associated rank as well and that rowsare ordered by
the “Rank” feature.
Similarly, it can be useful to observethelast rowsof the dataset. Wecan do this by using the
tail() method.Like the head( method,tail() also accepts the numberof rows we want to view
as input parameter.
Let’s lookatthelast three rowsof our dataset (worst movies in termsof rank):
1 moviesdftitle_indexed.tail(3)
movies_df_title_indexed.tail(3)
rank Gone Description Director Actors Year uuntie rating votes RAVE". etascore
we
ELLE
‘Step Up2: om ormauscronane Romante “evespans mn vodr
occur ae
two dance a Robert Heian,
EvanGansBrana
vigan, Cassie 2008 862 Toe enor i 00
Search oop verte. Comacy APM O XtendsHlendgembarkenban Sct Me 2014
ey MtPay, TJThomas -99«86 ABB} NAN 20
y cna missontoreun” ASO) gage
Wine Lives 1000 comeayFamiyFartaay Asuty busiesaman conn
‘inde mseapped .
gary Ken Spacey,
Gomer dente
Rate 206
Tanstch e783 12435 «96H
a. infoQ:This method allowsusto get someessential details aboutourdataset, like the number
of rows and columns, the numberofindex entries within that index range,thetype ofdata in
each column,the numberof non-null values, and the memory used by the DataFrame:
1 # This should be one of the very first commands you run after loading your data:
2 moviesdftitle_indexed.info()
movies_df_title_indexed.info()
<class ‘pandas.core.frame.DataFrame' >
Index: 10@0 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1900 non-null intea
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null intea
Runtime (Minutes) 1000 non-null intea
Rating 1900 non-null floatea
votes 1900 non-null intea
Revenue (Millions) 872 non-null floatea
Metascore 936 non-null floates
dtypes: floatea(3), inte4(4), object(4)
memory usage: 93.8+ KB
As wecan see from the snippet above,ourdatasetconsists of 1000 rows and 11 columns,andit
is using about 93KB of memory. An important thingto noticeis that we have two columnswith
missing values: “Revenue” and “Metascore”. Knowing which columns have missing valuesis
importantfor the next steps. Handling missing data is an importantdata preparationstep in
anydata science project; moreoften than not, we need to use machine learning algorithms and
methods for data analysis that are not able to handle missing data themselves.
The outputof the info() methodalso allowsshowsus if we have any columns that we expected
to be integers butareactually strings instead. For example, if the revenue had been recorded
as string type, before doing any numerical analysis on that feature, we would have needed to
convertthe values for revenuefromstringto float.
-shape: This is a fast and useful attribute which outputs a tuple, <rows, columns>, representing
the numberof rows and columnsin the DataFrame. Thisattribute comes in very handy when
cleaning and transformingdata. Say we had filtered the rows based on somecriteria. We can
use shape to quickly check how manyrowsweareleft with in thefiltered DataFrame:
moviesdftitleindexed. shape
# Output: (1088, 11)
## Note: .shape has no parentheses and is a simple tuple of format (rows, columns).
# From the output we can see that we have 100@ rows and 11 columns in our movies DataFrame.
b. describeQ:Thisis a great method for doing a quick analysis of the dataset. It computes
summary statistics of integer/doublevariables andgives us somebasicstatistical details like
percentiles, mean, and standarddeviation:
movies_df_title_indexed.describe()
Wecansee that wehavea lot of useful high-level insights about our data now. For example, we
can tell that our dataset consists only of movies from 2006 (min Year) to 2016 (max Year). The
maximum revenue generated by any movie during that period was 936.63M USD while the
mean revenue was 82.9M USD. Wecan analyze all the other features as well and extract
important informationlike a breeze!
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
Note: We will talk aboutthesestatistics concepts in very detail later, so don’t worry
if any of these stats soundsalien toyou.
Pandas DataFrame Operations - Selection, Slicing, and Filtering
4. DataSelection andSlicing
Oneimportant thing to rememberhereis that although manyof the methods can be applied to both
DataFrame and Series, these two havedifferent attributes. This means we need to know which type
of object we are working with. Otherwise, we can end up with errors.
Wecan extract a column byusing its label (column name) and the square bracketnotation:
1 genre_col = movies_d¥[’Genre’]
If we wantto extract multiple columns, wecan simply add additional column names tothelist.
Nowlet’s look at how to perform slicing by rows. Here wehaveessentially the following indexers:
* loc: the loc attribute allows indexing andslicing that always referencesthe explicit index,i-e.,
locates by name. For example, in our DataFrameindexedbytitle, we will usethetitle of the
movieto select the required row.
* iloc : theiloc attribute allows indexing andslicing that always references the implicit Python-
style index,i.e., locates by numerical index. In the case of our DataFrame,wewill pass the
numerical index of the movie for which weare interested in fetching data.
* ix: this is a hybrid of the other two approaches. Wewill understandthis better by looking at
some examples.
# With loc we give the explicit index. In our case the title, “Guardians of the Galaxy”
wewne
# With Loc we give the explicit index. In our case the title, "Guardians of the Galaxy”:
gog = movies_df_title_indexed.loc["Guardians of the Galaxy"]
g0g
Rank. 1
Genre Action,Adventure, Sci-Fi
Description A group of intergalactic criminals are forced ...
Director James Gunn
Actors Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
Year 2014
Runtime (Minutes) 121
Rating 8.1
Votes 787074
Revenue (Millions) 333.13
Metascore 76
Name: Guardians of the Galaxy, dtype: object
Wecan also get slices with multiple rowsin the same manner:
If we donot wantto selectall the columns, we can specify both rows and columnsat once;thefirst
indexrefers to rows while the second one(after the coma) to columns:
1 # Select all rows uptil ‘Sing’ and all columns uptil ‘Director’
2 movies_df_title_indexed.loc[:'Sing’, :'Director’]
3 moviesdftitleindexed.iloc[:4, :3]
Nowlet’s lookat the hybrid approach,ix. It’s just like the other two indexing options, except that we
can use a mix of explicit and implicit indexes:
1 # Select all rows uptil Sing and all columns uptil Director
2 moviesdftitleindexed.ix[:'Sing’, :4]
3 moviesdftitleindexed.ix[:4, :"Director’]
Say we wantto filter our movies DataFrame to show only movies from 2016 or all the movies that
had rating of more than 8.0?
Nowlet’s look at some more complex filters. We can make our conditions richer with logical
operators like “|” and “&”.
Say we wantto retrieve the latest movies (movies released between 2010 and 2016) that had a
very poorrating (scoreless than 6.0) but were among thehighestearners at the boxoffice
(revenue abovethe 75th percentile). We can write our queryas follows:
1 moviesdftitle_indexed[
2 ((movies_df_title_indexed['Year'] >= 2010) & (movies_df_title_indexed[‘Year’] <= 2016))
3 & (movies_df_title_indexed[ ‘Rating'] < 6.0)
4 & (movies_df_title_indexed[ ‘Revenue (Millions)'] > movies_df_title_indexed[ ‘Revenue (Millions)"].quan
5 ]
movies_df_title_indexed[
((movies_df_titleindexed{'Vear"] >= 2010) & (movies_df_title_indexed[ "Year"] <= 2016))
& (movies_dftitleindexed{ 'Rating'] < 6.0)
& (movies_df_title_indexed{ ‘Revenue (Millions)'] > moviesdf_title_indexed[ ‘Revenue (Millions)'].quantile(@.75))
Theresultstell us that “Fifty Shades of Grey” tops the list of movies with the worst reviewsbut the
highest revenues! In total there are 12 movies that match these criteria.
Note that the 75th percentile was given to us earlier by the .describe() method (it was 113.715M $),
andtheseareall movies with revenue abovethat.
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
Click on “Click to Launch” g7 button to work andsee the code running live in the
notebook.
* 6. Grouping
* 7. Sorting
* Jupyter Notebook
6. Grouping
Thingsstartlooking really interesting when we group rowswith certain criteria and then
aggregate theirdata.
Say we wantto group ourdataset by director and see how much revenue(sum)each director
earnedat the box-office and then also look at the average rating (mean)for each director.
Wecandothis by using the groupby operation on the column ofinterest, followed by the
appropriate aggregate (sum/mean),like so:
1 # Let’s group our dataset by director and see how much revenue each director has
2 movies_df.groupby( ‘Director’ ).sum()
3
4 # Let’s group our dataset by director and see the average rating of each director
5 movies_df.groupby( ‘Director’)[[ ‘Rating ]].mean()
for example, Let's growp ourdataset by director and see how auch revenue ach director has: =aeies_4 af groupby(
growpby("bireckor DEL tating'I].Rating’ 11-mean()
econ
tories.df. grupty(“Dsrector")-sum()
Rank Year Runtime [ainues} Rating Votes. Revenve(ions), Metscore ating
As wecan see, Pandas groupedall the ‘Director’ rows by name into one. And since we used
sum()for aggregation, it added togetherall the numerical columns. The values for each of the
columns nowrepresentthe sum of values in that column for that director.
For example, wecan see that the director Aamir Khan hasa very high averagerating (8.5) but
his revenue is much lower comparedto manyotherdirectors(only 1.20M $). This can be
attributed to thefact that wearelookingat a dataset withinternational movies, and Hollywood
directors/movies have understandably much higher revenues comparedto movies from
internationaldirectors.
In addition to sum() and mean() Pandas provides multiple other aggregation functionslike
min() and max().
Alt! Can you find a problem in the code when we apply aggregation to get the sum?
Thisis not the correct approachforall the columns. Wedo not wantto sum all the ‘Year’ values,
for instance. To make Pandas apply the aggregation on someofthe columnsonly, we can
specify the name ofthe columnsweareinterested in. For example, in the second example, we
specifically passed the ‘Rating’ column,so that the meandid not get applied to all the columns.
Groupingis an easy and extremely powerful data analysis method. Wecan useit to fold
datasets and uncoverinsights from them, while aggregationis oneof the foundationaltools of
statistics. In fact, learning to use groupby() to its full potential can be oneof the greatest
usesof the Pandaslibrary.
7. Sorting
Pandasallowseasy sorting based on multiple columns. Wecan apply sorting on theresult of
the groupby(operation or we can applyit directly to the full DataFrame. Let’s see this in action
via two examples:
1. Say we wantthe total revenueperdirector andto have our results sorted by earnings,
not in alphabeticalorderlike in the previous examples.
Wecanfirst do a groupby() followed by sum() (just like before) and then wecancall
sort_values ontheresults. To sort by revenue, we need to pass the name of that column as
inputto the sorting method; wecan also specify that we wantresults sorted from highest
to lowest revenue:
1 #Let’s group our dataset by director and see who earned the most
2 movies_df.groupby( ‘Director’ )[[ ‘Revenue (Millions)"]].sum().sort_values(['Revenue (Millions)"], asc
#Let’s group our dataset by director and see who earned the most
movies_df.groupby( ‘Director')[['Revenue (Millions)"]].sum()-sort_values([ ‘Revenue (Millions)'], ascending=False)
Revenue(Millions)
Director
J. Abrams 1683.45
David Yates 162051
Christopher Nolan 1515.09
Michael Bay 142132
Francis Lawrence 1290.81
Joss Whedon 1082.27
Jon Favreau 1025.60
Zack Snyder 975.76
Peter Jackson 26045
2. Now, say we wantto see which movies had both the highest revenue andthe highest
rating:
Any guesses?!
1 # Let's sort our movies by revenue and ratings and then get the top 10 results
2 data_sorted = movies_df_title_indexed.sort_values(['Revenue (Millions)', 'Rating’], ascending-False)
3 data_sorted[['Revenue (Millions)*, ‘Rating’ ]].head(1@)
Wenow knowthat J.J. Abramsis the director who earned the most$$$at the boxoffice,
andthat Star Warsis the movie with the highest revenue andrating, followed by some
other very popular movies!
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
* Jupyter Notebook
« None: A Pythonobject that is often used for missing data in Python. Nonecan only be used
in arrays with data type ‘object’ (i.e., arrays of Python objects).
+ NaN (Not a Number): A specialfloating-point valuethatis used to represent missing data.
floating-point type meansthat, unlike with None’s objectarray, we can perform
mathematical operations. However, rememberthat, regardlessofthe operation,the result
of arithmetic with NaN will be another NaN.
Runthe examples in the code widget below to understandthedifference between the two.
Observethat performing arithmetic operationson the array with the Nonetype throwsa run-
timeerrorwhile the code executes without errors for NaN:
import numpy as np
wavanaune
import pandas es pd
# Example with None
None_example np.array([@, None, 2, 3])
print("dtype None_example.dtype)
print(None_example)
# Example with NaN
10 NaN_example = np.array([@, np.nan, 2, 3])
11 print("dtype Natl_example.dtype)
12 print(NaN_example)
13
14 # Math operations fail with None but give NaN as output with NaNs
15 print("Arithmetic Operations”)
16 print("Sum with NaNs:", NaN_example.sum())
17 print("Sum with None:", None_example.sum())
1.548
dtype = object
[0 None 2 3]
dtype = floated
[0. nan 2. 3.]
Arithmetic Operations
Sum with NaNs: nan
Pandasis built to handle both NaN and None,andittreats the two asessentially
interchangeablefor indicating missing or null values. Pandasalso provides us with many
useful methods for detecting, removing, and replacing null values in Pandasdata structures:
isnull(), notnull(), dropna(), and fillna() . Let’s see all of these in action with some
demonstrations.
isnull() and notnull() are two useful methodsfordetecting null data for Pandas data
structures. They return a Boolean maskoverthe data. For example,let’s see if there are any
movies for which we have some missing data:
1 moviesdftitleindexed. isnull()
moviesdftitleindexed. isnull()
As wecan see from the snippet of the Boolean mask above, isnull( returns a DataFrame where
eachcell is either True or False depending on that cell’s missing-valuestatus. For example, we
can see that we do not havethe revenueinformation for the movie “Mindhorn”.
Wecan also count the numberofnull valuesin each column using an aggregate function for
summing:
1 moviesdftitleindexed. isnull().sum()
Nowweknowthat wedo not knowthe revenuefor 128 movies and metascorefor 64.
Removing null values is very straightforward. However,it is not always the best approach to
deal with null values. And here comesthe dilemmaof dropping vs imputation,replacing nulls
with somereasonable non-null values.
« By default, this method will dropall rowsin which anynull valueis present and return a
new DataFramewithoutalteringtheoriginal one. If we want to modify our original
DataFrameinplace instead, we can specify inplace=True.
« Alternatively, we can drop all columnscontaining any null values by specifying axis=1.
movies_df_title_indexed.dropna()
# Drop all the columns containing any missing data
movies_df_title_indexed.dropna(axis=1)
« Dropping rows would remove 128 rows whererevenueis null and 64 rows where
metascoreis null. This is quite somedata losssince there’s perfectly good data in the other
columns of those dropped rows!
« Dropping columns would removethe revenue and metascore columns — not a smart
moveeither!
To avoidlosingall this good data, we can also chooseto drop rowsor columns based on a
threshold parameter, drop onlyif the majority of data is missing. This can hespecified using
the howorthresh parameters, which allow fine control of the numberofnulls to allow in
through the DataFrame:
d¥.dropna(axis="columns’, how="all")
# Thresh to specify a minimum number of non-null values
# for the row/column to be kept
df .dropna(axis="rows", thresh=10)
As wehavejust seen, dropping rowsor columnswith missing data can result in a losing a
significant amountofinteresting data. So often, rather than dropping data, we replace missing
values with a valid value. This new value can be a single number,like zero,or it can be some
sort of imputation orinterpolation from the good values,like the mean or the median value of
that column.For doing this, Pandas providesus with the very handy fillna() method for
doing this.
For example,let’s impute the missing values for the revenue column using the mean revenue:
revenue_mean
82.95637614678897
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
dtype: intea
Wehavenowreplaced all the missing values for revenue with the meanofthe column, and as
wecan observefrom the output,by using inplace=True we have modified theoriginal
DataFrame — it has no morenulls for revenue.
Note: While computing the mean, the aggregate operation did notfail, even if we had missing
values, becausethe dataset has missing revenues denoted by NaN, as showninthe snippet
below:
This wasa very simple way of imputing values. Insteadof replacing nulls with the meanofthe
entire column, a smarter approach could have beento be morefine-grained — we could have
replacedthe null values with the mean revenuespecifically for the genreof that movie,instead
of the meanforall the movies.
9. Handling Duplicates
Wedonot have duplicate rowsin our moviesdataset, butthis is not always the case. If we do
have duplicates, we wantto make surethat weare not performing computations,like getting
thetotal revenueperdirector, based on duplicate data.
Pandasallowsusto very easily remove duplicates by using the drop_duplicates() method. This
method returnsa copy of the DataFramewith duplicates removed unless wechooseto specify
inplace=True,just like for the previously seen methods.
Note:It’s a good practiceto use .shape to confirm the changein numberofrowsafter the
drop_duplicates() methodhas been run.
Say we wantto introduce a new columnin our DataFramethat has revenue per minute for
each movie. Wecandividethe revenueby the runtimeandcreate this new columnvery easily
like so:
# We can use ‘Revenue (Millions)* and ‘Runtime (Minutes)' to calculate Revenue per Min for each.mov|
wewne
# Let's use the ‘Revenue (MiLLions)' and ‘Runtime (Minutes)’ to catcutate Revenue per Min for each movie
novies_df_titleindexed"Revenue per Min"] = movies_éftitleindexed[ ‘Revenue Millions)" ]/moviesdftitle_indexed{ Runtime (
novies_df_title_indexed.head(),
From the snippet above, wecan see that we have a new column atthe end of the DataFrame
with revenueper minute for each movie. This is not necessarily useful information,it was just
an example to demonstrate howto create new columns based onexisting data.
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
Click on “Click to Launch” g7 button to work andsee the code running live in the
notebook.
Thepivottable takes simple column-wise data as input, and groups theentriesinto a two-
dimensionaltable to give a multidimensional summary of the data. Hard to understand?
Let’s understandthe concept with an example!
Say we wantto comparethe $$$ earnedby the variousdirectors per year. We can create a
pivottable using pivot_table ; wecanset index =‘Director’ (row ofthe pivottable) and get the
yearly revenueinformation by setting columns ‘Year’:
1 # Let's calculate the mean revenue per director but by using a pivot table instead of groupby as se
2 moviesdftitle_indexed.pivot_table( ‘Revenue (Millions)', index="Director’,
3 aggfunc='sum’, columns="Year").head()
#Let's calculate the mean revenue per director but by using a pivot table instead of groupby as seen previously
movies_df_title_indexed.pivot_table('Revenue (Millions)', index='Director',aggfunc='sum', columns="Vear').head()
Year 2006 2007 2008 2009-2010 2011 2012 2013 2014 2015 2016,
Director
‘AamirKhan NaN 120 NaN NaN NaN NaN NaN NaN NaN NaN NaN
‘Abdellati Kechiche NaN NaN NaN NaN NaN NaN NaN 22 NaN NaN Naty
‘Adam Leon NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
‘Adam MeKay 148.21 NaN 100.47 NaN 11922 NeN NaN NeN NaN 70.24 NaN
‘Adam Shankman NaN 11882 NaN NaN NaN NaN 3851 NaN NaN NaN NaN
From our pivot table, we can observethat for Aamir Khan wehaveonly revenue for 2007. This
can imply twothings: 2007 wastheonly yearin this 10 year period whenanyofhis movies got
released or that we simply do not have completedata for this director. Wecan also see that
Adam McKayhasthe most moviesoverthis ten year period,withhis highest revenuebeing in
2006and lowest in 2015.
With a simple pivottable, we can see the annual trendin revenueperDirector;pivot tables
are indeed a powerful tool for data analysis!
You might be thinking, “Can’t wedothis by iterating over the DataFrameorSerieslike with
lists?” Yes, you are right, wecan. The problemisthat it would not be anefficient approach,
especially when dealing with large datasets. Pandasutilizes which meansvectorization
(operationsareapplied to wholearrays instead ofindividual elements).
For example, we could use a function to classify movies into four buckets (“great”, “good”,
“average”, “bad”) based on their numericalratings. We can dothis in twosteps:
« First, define a function that when given a rating determinestheright bucket for that
movie.
« Then apply that function to the DataFrame.
1 #41. Let's define the function to put movies into buckets based on their rating
2 def rating_bucket(x):
3 if x >= 8.8:
4 return “great”
5 elif x >= 7.0:
6 return “good”
7 elif x >= 6.0:
8 return “average”
9 else:
18 return “bad”
11
12 #2. Let's apply the function
13 movies_df_title_indexed[ "RatingCategory"] = movies_df_title_indexed["Rating”].apply(rating_bucket)
14
15 #3. Let's see some results
16 movies_df_title_indexed.head(10)[[ ‘Rating’ , "RatingCategory" ]]
# 1. Let's define the function to put movies into buckets based on their rating
def rating_bucket(x):
AF x= 8.0:
return “great
elif x >= 7.0:
return “good”
eli f x
wurn “average”
return “bad”
# 2. Let's apply the function
novies_dftitleindexed[ "RatingCategory"] = moviesdftitleindexed[Rating"].apply(rating bucket)
#3. Let's see some results
movies_dftitleindexed-head(10){{'Rating’ , ‘RatingCategory’)
Rating RatingCategory
Tite
(Guardians ofthe Galmy 8.1 ‘reat
Prometheus 7.0 008
spit 73 00
sing 72 geod
Sulcide Squad 62 average
The Grest Wall 6.1 average
Latatand 03 ‘reat
Mindnorn 64 average
The LostetyofZ 7.1 008
Passengers 70 008
Accordingto our rating method, we can see that “Guardiansof the Galaxy” and “La La
Land” are great movies, while “Suicide Squad” is just an average movie!
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
Click to launchapp!
Final Thoughts
Whata fun rideit has been! From data exploration and data extraction to data
transformation, we have learned so many Pandas magictricks! And as a bonus, wehavealso
gained manyinteresting movies-insights along the way!
Being well-versedin Pandas operationis one oftheessential skills in data science. It’s
importantto have a good grasp on these fundamentals.If you want to go furtheror learn more
Pandastricks,here is Pandas extensiveofficial documentation.
Pandasofficial documentationis very extensive. To make navigating through iteasier, here are
somegood placesfor a broader and/or more detailed overvie
crepe
ct asap gatas
tt, pa tania
itynr
te saea
TR
oc a ce
orSon
siepe Sy —Te
Select rows by povtion, +
om
=e. = a. aoe
Soe enough
‘otherpanosmethodcanbeatedtthe
‘es ThGancecen
imgovesvenaofode ora
Fo [fase i co 7]
cannaelena, [owen
oc fociloners
ew = scanbttennd
setae* th ion (i
Sariable®
Sale’: “val “yae’, licen s pene) =< Ser
aectocter ('#ta
>
felesrearesfersaertaang [usmccrmcmera] “fleet e130,0
‘F("w" Jovalve_counts()
Handling Missing Data Enon
a
ropa) an
at
:
Count number of rows with each unique vate of variable
aFaso
Drop rows with any column having A/a data.
ann ot ode
nn
t
aa
rt
df expanding() df.plot.hist() éf.plot.scatter(x="w',y="h") cs
sta oe :
lee
df rolLing(n) on Andteatoretrve)
Exercises: Pandas
Q3. Selectall the rows wherethe Genreis ‘Pop’ and store the result in the
variable “pop_artists”.
Q4. Selectthe artists who have more than 2,000,000listeners and whose
Genreis ‘Pop’ and savethe output in the variable called “top_pop”.
Q5.Perform a grouping by Genre using sum() as the aggregation function
and storethe results in the variable called “grouped”.
Q1. Create a DataFrame from the given dictionary data and index labels
and storeit in the variable called “df”.
import pandas es pd
wavanaune
# Input
data = {‘Artist': ['Ariana Grande’, ‘Taylor Swift’, ‘Ed Sheeran’, ‘Justin Bieber’, ‘Lady Gaga’, ‘Br
“Genre’: ['Jazz", ‘Rock’, "Jazz", ‘Pop’, “Pop’, ‘Rock"],
‘Listeners’: [1300000, 27@0008, Seageee, 2000000, 3000000, 1108000]}
labels = ['AG', ‘TS’, 'ED', ‘JB‘, "LG", “BM']
1@ # Your solution goes here
11
12 # Uncomment the print statement once done
13° # print(dF)
Solution eR o
1 df = pd.DataFrame(data, index-labels)
2 print(dF)
Q2. a) Select the column labelled “Listeners” and storeit in the variable
called “col”. b) Select the first row andstoreit in the variable called
“row”.
Solution eR o
dF iloc[@] # or dF.loc[*AG’]
dF‘ Listeners"]
Q3. Select all the rows where the Genreis ‘Pop’ and store the result in
the variable “pop_artists”.
Solution eR o
Q4.Select the artists who have more than 2,000,000 listeners and
whose Genreis ‘Pop’ and save the outputin the variable called
“top_pop”.
Solution eR o
grouped = d¥.groupby(‘Artist’).sum()
# Uncomment the print statement once done
# print(grouped)
Solution eR o
1 grouped = d¥.groupby(“Genre").sum()
Data Visualization - An Introduction
al ad A
| Ae} i ry
3
de Image credits: PolicyViz
Python offers multiple graphing libraries that comepacked with lots ofdifferent features.
Matplotlib is the most popularlibrary for creating visualizations in an easy way, so we are
goingto use it as a basis for learningtheartof data visualization.
In this series of lessons on data visualization, wewill start with general Matplotlib usage tips.
Then wewill go throughthedetails of main visualization techniques and also learn how to
create them with code-examples.
® Before we continue, let me just remindyou of whatI said earlier: Learning to use Python
well means using lotof libraries and functions. Butyou don’t have to remembereverything
by heart — Googleisyourfriend!
Data Visualization - Matplotlib Tips
1. Importing Matplotlib
fust as we used the np shorthand for NumPy, and pd for Pandas,pit is the standard shorthand
for Matplotlib:
Note that Matplotlib is a hugelibrary, so weare only importing the pyplotpartofit. This is
useful to save memory and speedupcode. Otherwise we'll be importing gigabytes oflibraries
even when wearejustinterested in using it to perform sometrivial tasks.
2. Setting Styles
plt.style directive can be usedto choosedifferent prettyfying styles for our figures. There a
numberof pre-defined styles provided by Matplotlib. For example, there’s a pre-definedstyle
called ggplot, which tries to copy the look andfeel of ggplot, a popularplotting packagefor R.
Below are some examples, both code andvisual outputs,ofthe available styles; you can go
through theofficial reference sheetfor a complete overview.
ny
NW
WER
Nw
QOOOOO}
3. Displaying Plots
a. If you are using Matplotlib from within a script, the function p1t.show() is the way to go. It
triggers an eventthatlooks for all currently active figure objects and opens one or more
interactive windowsto display them.
b. If you are working with a Jupyter notebook, plotting interactively within the notebook can be
done with the %matplotlib command:
© %matplotlib notebook will create interactive plots embedded within the notebook.
© %matplotlib inline will create static images of your plots embedded in the notebook.
fig. savefig('figure.png')
Useful introductory tips “ We are now ready to learn about somefun visualization
techniques.
Data Visualization Techniques - Scatter, Line, and
Histogram
Visualization Techniques
e 1. Scatter Plots
e 2.Line Plots
© 3. Histograms
Visualization Techniques
1. Scatter Plots
Scatter plots are deceptively simple and commonly used, butsimple doesn’t meanthat they
aren’t useful!
Ina scatter plot, data points are represented individually with a dot, circle, or some other
shape. Theseplotsare great for showingtherelationship between twovariables as we can
directly see the raw distributionof the data.
To createa scatter plot in Matplotlib we can simply use the scatter method. Let’s see how by
creating a scatter plot with randomly generated data points of manycolors andsizes.
First, let’s generate some random datapoints, x andy, and set random valuesfor colors and
sizes because we wanta pretty plot:
x = np.random.randn (100)
y = np.random.randn(190)
colors = np.random.rand(100)
sizes = 1080 * np.random.rand(1@0)
Nowthat wehaveourtwovariables(x, y) and the colors and sizes of the points, we can call the
scatter methodlike this:
alpha = @.2: The optionalalpha input parameter(optional) of the scatter method allows usto
adjust the transparency level so that we can view overlapping data points. You will understand
this concept better, visually, in the final output.
The color argument, c, is automatically mapped to a colorscale, and the size argument, s, is in
pixels. To view the color scale nextto theplot on theright-hand side, we can use the colobar()
command:
plt.colorbar()
Nowlet’s put it all together and run the full codeto see the output. Note that wewill save the
output to a file to display the plot with Educative’s code widget.
Wecan see thatthis scatter plot has given us the ability to simultaneously explore four
different dimensions ofthe data.If this was a realdataset, the x, y location of each point can
be usedto representtwodistinct features. The size of the points can be used to represent a
third feature,and the color mappingcan beused to represent differentclasses or groups of the
data points. Multicolor and multifeature scatterplots like this are very useful both for
exploring and presenting data. Hereis an interesting example:
‘otal_bill
2. Line Plots
Alineplot displays data points connected bystraightline segments,instead of showing them
individually. This type ofplot is useful for finding relationships between datasets .Wecan easily
see if values for one dataset are affected by thevariationsin theother. This will tell us if they
are correlated or not. For examplein theplot below, D and C don’t seem to be going on own
paths independently:
Say we havea datasetwith the population of twocities over time and we wantto see if.
thereis some correlation in their population sizes. We can use the plt.plot() function to
plot population against time. We would puttime onthe x-axis and population on they-axis.
In order to comparethe populationsizes ofthe twocities, we wantthe population data for both
to be represented on the sameplot. To create a single figure with multiple lines, we can just call
the plot function multiple times. And thento distinguish betweenthe two, wecan adjust the
colors using the color keyword.
3. Histograms
Histogramsare useful for understanding thedistribution of data points. Basically, a histogram
is a plot wheredatais split into bins. A bin is a rangeofvaluesorintervals. For each bin, we get
the count of how manyvaluesfall into it. The x-axis represents bin ranges while y-axis shows
frequency. Thebinsare usually specified as consecutive, non-overlappingintervals of a
variable. For example,in the plot below, we have information about how manyflights were
delayed for a rangeoftimeintervals:
25 50 75 100 125
Delay (min)
Example Histogram
import numpy as np
Nouwsunk
histQ has manyoptionsthat can be used to tune boththecalculation andthe display. Let’s see
an example of a more customized histogram for the same data:
To create multiple histogramsin the sameplot, wecan either call the hist() function multiple
times orpassall the input valuesto it at once. The x parameterofhist() can take both a single
array or a sequenceofarrays as input.
Data Visualization Techniques - Bar and Box Plot
Visualization Techniques
° 4. Bar Plots
° 5. Box Plots
* Final Thoughts
Visualization Techniques
4. Bar Plots
Barplots are an effective visualization technique when wehavecategorical data thatconsists
ofvarious categories or groups.
For example, wecan use barplots to view test scores categorized by gender. These plots allow
us to easily see the difference between categories (gender) because thesize ofthe barvaries
with the magnitudeofthe represented variable(score), and categories are clearly divided and
color-coded.
However,thereis a catch. Bar plots work well when weare dealing with a few categories.If
there are too manycategories though,the barscan get very cluttered and theplots can quickly
becomehard to understand.
Wecan haveseveralvariationsof bar plots. The image below showsthree differenttypesof bar
plots: regular, grouped, and stacked:
reget : [ | | | [ :
Notice that bar plots look very similar to histograms. Howdowedistinguish between the
two? Whatare thedifferences betweenthe two?
# Input data
labels = (‘Python', 'C++', ‘Java’, ‘Perl’, 'C#')
num_users = [12,8,15,4,6]
To make barplot from this data, wefirst need to convertthelabels from string data into
numericaldata thatbe used for plotting purposes. We can use NumPy’s arange method to
generate an array of sequential numbersof the samelength asthelabelsarraylike this:
index = np.arange(len(label))
# index: [@,1,2,3,4]
Nowwecaneasily represent languages on the x-axis and num_userson the y-axis using the
plt.bar() method:
1 import numpy as np
2 import matplotlib.pyplot as plt
3
4 # Input data
5 labels = (‘Python’, "C++", ‘Java’, ‘Perl’, ‘C#")
6 numusers = [12,8,15,4,6]
7 index = np.arange(len(labels))
8
9
10 # Plotting
41 plt.bar(index, num_users, align="center’, alpha-0.5)
12 plt.xticks(index, labels)
13. plt.xlabel(' Language")
14 plt.ylabel(‘Num of Users’)
15 plt.title('Programming language usage")
17 plt.savefig(‘output/barplot.png" )
5. Box Plots
Wepreviously looked at histograms which weregreatfor visualizing the distribution of
variables. But whatif we need moreinformation thanthat? Box plots provideusa statistical
summary of the data. They allow us to answerquestionslike:
Note:If all of these terms soundalien to you at the moment, don’t worry! We will
coverthese terms, andget backto boxplots in moredetail, in the Statistics Lessons.
Ina box plot, the bottom andtop of the box are alwaysthe 1st and 3rd quartiles, 25% and 75%
of the data, and the bandinsidetheboxis always the 2nd quartile, the median. The dashed
lines with thehorizontalbars on theend,or whiskers, extend from the box to showthe range
ofthe data:
10
plt.boxplot (data_to_plot)
This method also takes various parameters as input to customize the color and appearanceof
the boxes, whiskers, caps, and median. Let’s see an example based on some randomly
generated data:
Final Thoughts
These were examples of some must-know Data Visualization techniques using Python’s most
popularvisualization library, Matplotlib. These are simpleyet powerful visualization
techniques which you canuse them to extractrich insights from your datasets.
Of course,the visualization story doesn’t endhere: notonly arethere other more elaborate
visualizations to extract even deeperinformation from yourdata, like Heat Maps and Density
Plots, but there are many otheruseful libraries for coolvisualizationstoo. Hereis a brief
introductionof the most popular ones:
Again, data science is an extensivefield, so you cannot, and should not, expect to learn
and rememberallthe possible libraries, methods, andtheir details!
Wewill get back to creating morerichvisualizations, in the Projects Section. Wewill also
workwith morelibraries, like seaborn, and build upon what wehavelearned so far. Stay
Tuned! @
Data Visualization Cheat Sheet
In general, line, bar, and column chartsare goodto represent changeovertime. Pie charts can
showparts-of-a-whole, andscatterplots are niceif you havea lotof data.
To make yourlife easier here is a cool cheat sheet on selecting the right visualization methods
from the Harvard CS-109 extension program:
|___svat
em od
Comparison
—
: ‘What
like wswould you an
|Distribution
aa
SB Composition
SSO Rn "
om
ou r
he T,
oetas,
=
seer | E
eG | Weel
Pick the Right Data Visualization (Image Credits: Harvard CS-109extension program)
Quiz: Data Visualization
O AyScatter Plot
OB) BoxPlot
O Histogram
In the data visualization lessons, we saw thatwecaneasily obtain insights aboutthe data using
various typesof plots. So wheredoesstatisticsfit in?
* Basic Concepts
* Mean
* Median
* Standard Deviation
Correlation Coefficient
Basic Concepts
Thefirst step in analyzing data is to get familiar with it. Our good old NumPyprovides a lot of
methods thatcanhelp us dothis easily. We are going to look at someof these methodsin this
lesson. Along the way, we are going to understand the meaning of important statistical terms as
weencounterthem.
The mostbasic yet powerful terms that you could comeacrossare the mean, mode, median,
standarddeviation, and correlation coefficient. Let’s understand these with an example
dataset and using NumPy.
Runthe codein the widget below andtry to understand what’s happeningbefore reading the
description thatfollows.
import numpy as np
wavanaune
# The dataset
learninghours = [1, 2, 6, 4, 10]
scores = [3, 4, 6, 5, 6]
# Applying some stats methods to understand the data:
print("Mean learning time: ", np.mean(learning_hours))
print("Mean score: ", np.mean(scores))
10 print("Median learning time: ", np.median(learninghours) )
11 print("Standard deviation: ", np.std(learninghours))
12 print("Correlation between learning hours and scores:", np.corrcoef(learning_hours, scores))
Mean
The meanvalueis the average ofa dataset, the sum the elements divided by the numberof
elements. As the namesays, np.mean() returns the arithmetic meanofthedataset.
Median
The medianis the middle elementof the set of numbers. If the length ofthearrayis odd,
np.median() gives us the middlevalueof a sorted copyof thearray.If the length ofthe array is
even, wegetthe average ofthe two middle numbers.
Standard Deviation
Standarddeviation is a measure of how muchthedatais spreadout,andis returned by the
np.std() method. Morespecifically, standard deviation shows us how muchourdatais spread
out around the mean. Standard deviation could answerthe questions “Areall the scores close to
the average?” or, "Are lots of scores way above or way below the average score?"Using standard.
deviation we havea standard way of knowing what is normal and what is high or extra low.
In mathematical terms, standard deviationis the squarerootof the variance. So now you ask,
“Whatis variance?”
Varianceis defined as the average ofthe squared differences from the mean. Let mebreakthis
downfor you.
Let’s calculate the standarddeviation for learning hours manually.First let’s get the mean
value:
= 14+2+64+4+10
1424044410 4
Mean
3
Nowto calculatethe variance,get the difference of each element from the mean, squarethat,
andthen averagetheresult:
Finally, the standarddeviation is just the square root of the variance, so:
Correlation Coefficient
Whentwosetsof data are strongly linked together wesay they havea high correlation.
np.corrcoef() returns a matrix with thecorrelation coefficients. This method comesin handy
whenwewantto see if thereis a correlation between twoor morevariablesin our dataset.
For example, say an ice-cream shop has data for how manysunglasses weresold bya big store
in the neighborhoodovera period oftime, and they decide to comparesunglassessalesto their
ice cream sales. From their results, they find a high correlation between thesales ofthe two.
Doesthis mean that sunglasses make people wantto buy ice cream? Hmm,no!
So, in layman terms,“Correlation Is Not Causation” tries to remindus that correlation does not
proveone thing/event causesthe other, but:
ImageCredits: https:/fxked.com/925/
Getting back to ourfirst example, performinga simplestatistical analysis on our dataset gave
usa lot ofinsights. In summary, wecansee that:
MAXIMUM Greatestvalue,
‘excluding cutirs
LUPPERQUARTILE 25% of
data greater than this value
MEDIAN 50% of data is
‘greater than this value;
idle ofdataset
MINIMUM Leastvalue
‘excludingouters
@—ouTLIERLess than 32
timesof ower quartle
* Theendsof the boxarethefirst (lower) and third (upper) quartiles — the box spans the
so-called interquartile range. Thefirst quartile basically represents the 25th percentile,
meaning that 25% of the data pointsfall below thefirst quartile. The third quartile is the
75th percentile, meaning that 75% of the points in the data fall belowthethird quartile.
« The median, markedbya horizontal line inside the box, is the middle value of the dataset,
the 50th percentile. Median is used instead of mean becauseit is more robust to outlier
values (wewill talk about this again later and understand why).
+ The whiskersarethe twolines outside the box that extendto the highest and lowest(or
min/max) observations in our data.
Five-Number Summary
To recap,a five-number summary is madeupof thesefive values: the maximum value, the
minimum value, the lower quartile, the upper quartile, and the median.
+ Minimum value
Isn't this a lot of useful information from a few simplestatistical features that are easy to
calculate? Rememberto make use of them while doing a preliminary investigation of a large
dataset, when comparing two or moredatasets, and when you need a descriptive analysis
including data skewednessoroutliers of your data.
Basics of Probability
© WhatIs Probability?
* WhyIs Probability Important?
How DoesProbability Fit in Data Science?
* Calculating Probability of Events
* a. IndependentEvents
* b, DependentEvents
* c, Mutually Exclusive Events
d. Inclusive Events
Conditional Probability
What Is Probability?
Probability is the numerical chance that somethingwill happen;it tells us how likelyit is that
someeventwill occur.
For example,if it is 80% likely that my team will win today, the probability of the outcome“the
team won” for today’s matchis 0.8; while the probability of the opposite outcome, “it lost”, is
0.2, i.e., 1 - 0.8. Probability is represented as a numberbetween0 and 1, where0 indicates
impossibility and 1 indicatescertainty.
NumberOfWaysItCanHappen
ProbabilityO fAnEventHappening = TotalNumberO[Outcomes
Example 1
Whatis the probability you get a 6 when youroll a die?
A die has6 sides, 1 side contains the number6. We have1 wanted outcomeoutofthe 6 possible
outcomes, therefore, the probability of gettinga 6 is 1/6.
a. Independent Events
Example 2
Whatis the probability of getting three 6s if we have 3 dice?
b. Dependent Events
Example 3
Whatis the probability of choosing twored cardsin a deck of cards?
Twoevents are dependent whenthe outcomeofthefirst event affects the outcomeof the
secondevent. To determinethe probability of two dependentevents, weuse the following
formula:
Since a deck ofcardshas 26 black cards and26 red cards, the probability of randomly choosing
a red cardis:
P(A and B) = @
The probability that oneof the events occursis the sum oftheir individual probabilities.
Example 4
a. Whatis the probability of getting a King and a Queen from a deckof cards?
Acard cannot be a King AND a Queenatthe sametime! So the probability of a King and a Queen
is 0 (impossible).
d. Inclusive Events
Inclusive events are events that can happenat the sametime. To get the probability of an
inclusive event, wefirst add the probabilities of the individual events and then subtract the
probability of the two events occurring together:
Example 5
If you choose a card from deck, whatis the probability of getting a Queenor a Heart?
It is possible to get a Queen anda Heartat the sametime, the Queen ofHearts which is the
intersection P(X and Y). So:
Conditional Probability
Conditional probability is a measureofthe probability of an eventgiven that another event has
occurred. In other words,it is the probability of one event occurring with somerelationship to
oneor moreotherevents.
Say event X is that it is raining outside, andthere a 0.3 (30%) chanceofrain today. Event Y
mightbe thatyou will need to go outside with a probability of 0.5 (50%).
A conditional probability would lookat these two events, X AND Y, in relationship with one
another.In the previous example this wouldbe the probability thatit is both raining and you
need to gooutside.
Example 6
Whatis the probability of drawing 2 Kings from a deckofcards?
« Forthefirst card the chance of drawing a King is 4 outof 52 since there are 4 Kings ina
deck of 52 cards: P(X) = 4/52
« After removing a King from the deck,only 3 of the 51 cardsleft are Kings, meaning the
probability of the 2nd card drawnbeing a King isless likely: P(Y| X) = 3/51
* So, the chanceofgetting 2 Kings is about 0.5%:
a|
FREQUENTIST STATISTCIAN:
THE PROGAUITYOF TH RESULT
BAYESIAN STFTTOAN:
DAD SY GAME 007 BerYOu $5
<0.05, T. CONKILDE HASNT
Let’s look at an example. Ever wonder how a spam filter could be designed?
Say an email containing, “You wonthe lottery” gets marked as spam. The question is, how cana
computer understandthat emails containing certain wordsarelikely to be spam? Bayesian
Statistics does the magic here!
Spam filtering based on a blacklist would be too restrictive and it would havea highfalse-
negative rate, spam that goes undetected. Bayesianfiltering can help by allowing the spam
filter to learn from previousinstances of spam. As we analyze the words in a message, we can
computeits probability of being spam using Bayes’ Theorem. Andasthefilter gets trained with
more and more messages,it updates the probabilities that certain words lead to spam
messages. BayesianStatistics takes into account previous evidence.
P(A|B) ~
P(A)P(BIA)
P(B)
Thistells us how often A happensgiven that B happens,written P(A |B), when wehave the
following information:
P(spam) P(words|spam)
P(spam|words) = P(words)
Say wehavea clinic that tests for allergies and we wantto find outa patient’s probability of
having an allergy. We knowthat our testis not alwaysright:
« Forpeople thatreally do havethe allergy,the test says “Positive” 80% of the time,
(Positive | Allergy)
« The probability of the test saying “Positive” to anyoneis 10%, P(Positive)
If 1% of the population actually hastheallergy, P(Allergy), and a patient’s test says “Positive”,
what is the chancethat thepatientreally does havetheallergy, i.e., P(Allergy | Positive)?
Introduction
e Random Variables
e Probability Functions
Introduction
Wehavelearnedthat probability gives us the percent chance of an event occurring. Now, what
if we want an understandingof the probabilities ofall the possible values in our experiment?
This is where probability distributions comeinto play.
Random Variables
Forthe next couple of lessons, we are going to look at someof the most importantprobability
distributions. But before wediveinto probability distributions, we need to understandthe
different types of data we can encounter.
« Discrete Data(a.k.a. discrete variables) can only take specified values. For example, when
weroll die, the possible outcomesare1, 2, 3, 4, 5, or 6 and not 1.5 or 2.45.
+ Continuous Data(a.k.a. continuousvariables) can take any value within a range. This
range canbefinite or infinite. Continuous variables are measurementslike height, weight,
and temperature.
Probability Functions
Thereis just one more conceptweneed to understand before jumping into the different
distributions.
The probability function for a discrete random variableis often called Probability Mass
Function, while for continuous variables we have theso-called Probability Density Function
(a.k.a. Probability Distribution Function).
Bie BIN Blu Be Flo Bo
Theprobability mass function (pmf) p(S).or discrete probability distribution function fordiscrete variables, D, specifies the
probability distribution for the sum of counts from two dice. For example,the figure shows that p(11) = 2/36 = 1/18. The
pmfallows the computation of probabilities of events such as P(S > 9) = 1/12 + 1/18 + 1/36 = 1/6, andall other probabilities
in thedistribution.
For example, we could have a continuous randomvariable that represents possible weights
ofpeoplein a group:
Frequency(8)
25
The probability density function showsall possible valuesfor Y. For example, the random
variable Y could be 100 lbs, 153.2 lbs or 201.9999lbs.
The probability density function can help us to answer things like: Whatis the probability that a
person will weigh between 170lbs and 200lbs?
Nowthat wehavedonethe ground work,in the nextlessons weare goingto cover the most
importantdistributions for both discrete and continuousdata types.
Types of Distributions - Uniform, Bernoulli, and Binomial
© TypesofDistributions
* 1, Uniform Distribution
* 2. Bernoulli Distribution
3. Binomial Distribution
Types of Distributions
A few words beforewestart:
&® Pay special attention to the NormalDistribution andits properties; you should
knowthatonereally well as youarelikely to encounterit most frequently.
1. Uniform Distribution
Thisis a basic probability distribution whereall the values have the sameprobability of
occurrencewithin a specified range; all the valuesoutside that range have probability of 0. For
example, when weroll a fair die, the outcomes can only be from to 6 andtheyall have the
sameprobability, 1/6. The probability of getting anything outsidethis range is 0 — youcan’t get
a7.
0.25
oo Uniform(1,6)
0.15 Uniform(4,12)
at
0.08
at
a 2 4 6 @ wm 12
Unifrom Distribution
Wecan see that the shape of the uniform distribution curveis rectangular. This is the reason
whythisis often called the rectangulardistribution.
1
b-a
where a and Db are the minimum and maximum values ofthe possible range for X.
The mean and variance ofthe variable X can then hecalculated like so:
_atb
Mean = E(X)
2
b—a)2
Variance = V(X) = Ona
2. Bernoulli Distribution
Although the name sounds complicated,this is an easy oneto grasp.
For example, the probability, P, of getting Heads (“success”) while flipping a coin is 0.5. The
probability of “failure” is 1 - P, i.e., 1 minus the probability of success. There’s no midway
betweenthe twopossible outcomes.
Arandom variable,X, with a Bernoulli distribution can take value 1 withthe probability of
successp, and the value 0 with theprobability of failure 1-p.
The probabilities of success andfailure donot need to be equallylikely, think about the results
ofa football match. If we are considering a strong team, the chances of winning would be much
higher compared to those of a mediocre one. The probability of success, p, would be much
higherthan theprobability of failure; the two probabilities wouldn’t be the same.
There are many examples of Bernoulli distribution such as whetherit’s going to rain tomorrow
or not(rain in this case would meansuccessandnorain failure) or passing (success) and not
passing (failure) an exam.
Say p=0.3, we can graphically represent the Bernoulli distribution like so:
‘Bernoulli Distribution
&
a
re
iJ
a
0
Scenarios
mao f peed
l-p 2=0
The expectedvalue, F(X), of a random variable,X, having a Bernoulli distribution can be found
as follows:
3. Binomial Distribution
Bernoulli distribution allowedus to represent experimentsthat have two outcomes but only a
single trail. What if we have multiple trials? Say wetoss a coin not one but manytimes. This
is where an extension of Bernoulli distribution comesintoplay, Binomial Distribution.
Again,just like in case of Bernoulli, the outcomes don’t need be equallylikely. Also, eachtrialis
independent — the outcome ofa previous toss doesn’t affect the outcomeof the currenttoss.
BinomialDistribution PDF
Probabiity
Random Variable
ImageCredits: https:/www.boost.org
BinomialDistribution PDF
—n=20 p=0.1
—n=20 p=0.5
—n=20 p=0.9
Probabiity
Random Variable
ImageCredits: https:/www.boost.org
Mean = E(X)=n*p
Variance = V(X) =n*p*(1—p)
Do you notice something from these formulas? Wecan observethat the Bernoulli Distribution
that wesawearlier was just a special case of Binomialif we set n=1.
a
a 3 4 5
Number of Successes
6
Ne
Example of a binomial distribution chart. Image Credits: https:/|ww.spss-tutorials.com/binomial-test/
Wearenot doneyet! We have 3 moredistributionsto cover; and westill have to learn about
the most important continuousdistribution. To the next lesson!
Typesof Distributions - Normal
Thebell curve is symmetrical, half of the data will fall to the left of the mean valueand half will
fall to the rightofit.
ImageCredits: https:/www.mathsisfun.com
« Heights of people
« Blood pressure
* IQ scores
Salaries
* Size of objects produced by machines
The numberof standard deviations from the meanisalso called the standardscoreorz-score.
Z-scores are a way to compareresults from test to a “normal” population.
Mean = E(X) =
Variance = V(X) =
where is the mean value and the standard deviation.
rp
2 b
Standardize
ImageCredits: https:/www.mathsisfun.com/data/standard-normal-distribution.htm|
Note: This is the mostimportant continuous random distribution. So, make sure you
understandit well before moving on to the nextlesson.
Typesof Distributions - Poisson and Exponential
5. Poisson Distribution
6. Exponential Distribution
5. Poisson Distribution
Thisdistribution gives us the probability of a given numberof events happeningin a fixed
interval oftime.
Say we have the numberofbreadssold by a bakery every day.If the average number for seven.
days is 500, we can predict the probability of a certain day having moresales, e.g., more on
Sundays. Another example could be the number of phonecalls received bya call center per
hour.Poisson distributions can be used to makeforecasts about the number ofcustomers or
sales on certain days or seasons ofthe year.
Think aboutit: if more items than necessary arekept in stock, it means loss for the business.
Onthe other hand, under-stocking wouldalso result in loss because customers need to be
turned awaydueto not having enough stock. Poisson can help businesses estimate when
demandis unusually high so that they canplanfor the increase in demandin advance while
keeping wast of resources to a minimum. However,its applications are not only forsales or
specifically business related, somedifferent kinds of examples could be forecasting the number
of earthquakes happening, next month ortraffic flow andideal gap distance.
eh
P(X) = zl
where,
x=0,1,2,3..,
e =the natural numbere,
j= mean numberofsuccesses in the given timeinterval,
X, Poisson Random Variable, is the numberofevents in a timeinterval and P(X)is its
probability distribution (probability massfunction).
f=At
where,
is the rate at which an eventoccurs,
tis the length ofa time interval
Let’s look at a graphical representation ofa Poisson distribution, and howitvaries with the
changein expected numberof occurrences:
-eer
cee
fl
°
ie8
Thehorizontal axis is the index k, the numberof occurrences. A is the expected numberof occurrences, which need not be an
integer. The vertical axis is the probability of k occurrencesgiven A. The functionis defined only at integer values ofk.
6. Exponential Distribution
Exponential distribution allowsus to go a step further from thePoisson distribution. Say we
are using Poisson to model the numberof accidents in a given time period. Whatif we wanted
to understandthetime interval between the accidents? This is where exponential distribution
comes intoplay; it allows us to modelthetime in between each accident.
f(a) =e"
where,
e =the natural numbere,
A= mean time betweenevents,
xX =a randomvariable
A graphical representation ofthe density function for varying values of the mean time between
events looks likethis:
18
probability density
0s 10°
L
0.0
Wecan observethat thegreater the rate of events, the faster the curve drops, andthe lower the
rate, the flatter the curve.
Thefirst step is to identify whether weare dealing with a continuousor discrete random
variable. Once that’s done, here are somehelpful hints to proceed from there:
+ If we see a Normal (Gaussian) Distribution, we should go forit because there are many
algorithms that, by default, will perform well specifically with this distribution;it is the
mostwidely applicable distribution — it’s called “Normal” for a reason!
zy
“T always fee! so normal, so bored, you know. Sometimes I would
like to do something... you know... something... mmm... Poissonian.”
# Import libraries
wavanaune
import numpy as np
import seaborn as sns
# Create some random fake data
x = np.random.random(size=100)
# Plot the distribution
sns.distplot(x);
Xx = np.random.normal(size=100)
# Plot the distribution
sns.distplot(x);
04
03
os 02
on
00 025 000 025 050 075 100 125 00 3 2 2 0 if 2 3 4
Distribution plots: Randomly distributed data (left) and Normally distributed data (right)
Final Thoughts
Probability distributionsarea toolthat you musthavein your data scientist’s toolbox, you will
need them at onepointor another! In these lessons we have donea deepdivein six major
distributions and learned abouttheir applications. Now, you do not need to memorizetheir
functionsandall the nitty-gritty details. However,it is importantto be ableto identify,
relate and differentiate among thesedistributions.
As with most skills and conceptsin life, breaking things downinto sub-skills or basic
componentsis a great way to approachlearning. In this lesson weare going to chunkstatistical
significance into its base componentsandthen put all the pieces together to understandthis
conceptin anintuitive bottom-up approach.
« Hypothesis Testing
¢ NormalDistribution
° P-values
Wewill first understand these three componentstheoretically and then wewill putit all
togetherwith the helpof a practical example.
1. Hypothesis Testing
Hypothesistesting is a technique for evaluating a theory using data. The hypothesis is the
researcher’s initial belief aboutthe situation before the study. The commonlyaccepted factis
knownasthe null hypothesis while the opposite is the alternate hypothesis. The researcher's
taskis to reject, nullify, or disprove the null hypothesis. In fact, the word“null” is meant to
implythat it’s a commonly accepted fact that researchers work to nullify (zero effect).
For example, if we consider a study aboutcell phones and cancerrisk, we might have the
following hypothesis:
tTfol) PNR
cell phone use of cell phone use
There are many hypothesis tests that work by making comparisonseither between two groups
or between one group andtheentire population. Weare goingto look at the most commonly
used z-test.
Does z-test ring anybells? In the previous lessons, we cameacrossthe concept ofz-scores while
learning about normal distributions. Remember?? Thesecondbuildingblockof statistical
significance is built upon normaldistributions andz-scores. Ifyou need a refresher, before
continuingfurther, revisit the section on normal distributions.
2. Normal Distribution
As welearnedearlier, the normaldistributionis used to representthedistribution of our data
andit is defined by the mean,u (centerof the data), and the standard deviation, o (spread in
the data). These are two important measures becauseany point in the data can then be
representedin termsofits standard deviation from the mean:
a er a
For the normaldistribution,the values less than onestandard deviation away from the mean account for 68% of the set;
while two standard deviations from the mean account for 95%; and three standard deviations accountfor 99.7%. Image
Credits: Wikipedia
Standardizing theresults by using z-scores whereyou subtract the mean from the data point
anddivide by the standard deviationgivesus the standard normal distribution.
From z-scoreto z-test: A z-test is a statistical techniqueto test the Null Hypothesis against the
Alternate Hypothesis. This technique is used when the sampledata is normally distributed and
the population size is greater than 30. Why 30?
Accordingto the Central Limit Theorem as the sample size grows and numberof data points
exceeds 30, the samples are considered to be normally distributed. So wheneversamplesize
exceeds 30, we assumedata is normally distributed and wecanusethez-test.
As the nameimplies, z-tests are based on z-scores, which tell us where the sample meanlies
comparedto the population mean:
&|
als
where,
&: mean of sample,
4 mean of population,
o: standarddeviation of the population,
mn: numberof observations
But what determines howhigh the high should be and howlowthe low should be in orderfor us
to accept the results as meaningful?
3. P-value
The p-value quantifies the rarenessin our results. It tells us how often we'd see the numerical
results of an experiment (our z-scores)if the null hypothesisis true andthere are no
differences between the groups. This meansthat wecanusep-values to reach conclusions in
significance testing.
Althoughthe choiceof a dependson the situation, 0.05 is the most widely used valueacrossall
scientific disciplines. This meansthat p<.05 is the threshold beyond whichstudyresults can be
declaredto bestatistically significant,i.e., it’s unlikely the results were a result of random
chance. If we run the experiment100 times, we’d see these same numbers, or more extreme
results, 5 times, assuming the null hypothesisis true.
Again, a p-value ofless than .05 meansthat thereis less than a 5% chanceof seeing our
results, or more extremeresults, in the world wherethe null hypothesisis true.
Note that p<.05 does not meanthere’s less than a 5% chancethat our experimental results are
due to random chance.Thefalse-positive rate for experiments can be muchhigher than 5%!
P-VAWE INTERPRETATION
0.001
0.0102 |HIGHLY SIGNIFICANT
0.
0.03 |
aoe --SeNFIONT
0.0.040590_}— O41 CRAPATRIOENSDO.
0.05) THE EDGE e
doe JOF eurcrnC
007 | HIGHLY SUGGESTIVE,
0.08 |_SGNIRCANTe THE
dor Pow
0.077) Hey wok
0.1 THIS, Hexen
‘SUBGROUP ANALYSIS
ImageCredits: https://xkcd.com/
Note: Sincethis is a tricky concept that most get wrong butis important to understandit
well, again: p-value doesn’t necessarilytell us if our experimentwasa successornot,it
doesn’t prove anything!It just gives us the probability that a result at least as extreme as
that observed would haveoccurredif the null hypothesis is true. The lower the p-value,
the moresignificantthe result becauseit is less likely to be caused bynoise.
Now, putting it all together; if the observed p-valueis lower than the chosen threshold a, then
weconcludethat theresultis statistically significant.
As a final note, an importanttake awayis that at the endof the day calculating p-values is not
the hardest parthere! Therealdealis to interpret the p-valuesso that we can reach
sensible conclusions. Does 0.05 work asthe threshold foryour study or should you use0.01 to
reach any conclusions instead? And whatis ourp-valuereally telling us?
Example
Let’s putall the pieces togetherby looking at an example fromstartto finish.
A company claimsthatit has a high hiring bar whichis reflected in its employees having an IQ
above the average. Say a random sample of their 40 employees has a mean IQscoreof 115.Is this
sufficient evidence to support the company’s claim given the mean populationIQ is 100 with a
standarddeviation of 15?
1. State the Null hypothesis: the accepted factis that the population meanis 100 — Hp: u=
100.
2. State the Alternate Hypothesis: the claim is that the employees have above average IQ
scores — Hy: u > 100.
3. State the threshold for the p-value — level: we will stick with the most widely used
value of 0.05.
115 — 100
= 6.32
15/40
The company meanscoreis 115, which is 6.32 standard error units from the population mean
of 100.
5. Get the p-value from thez-score: Using an onlinecalculator for converting z-scoresto p-
values,wesee that the probability of observing a standard normal value below 6.32 is <
.00001.
6. Interpret the p-value: our result is significantat p < 0.05, so we can reject the null
hypothesis — the 40 employees of interest have an unusually higher IQ score comparedto
random samplesofsimilar size from theentire population.
Final Thoughts
There were quite a few concepts in this lesson. To makesurethat our understandingis crystal
clear, wewill engrain these concepts with some exercises.
Also,this was thelast lesson on Statistics, so well-done for having come this far §Q Let’s keep
going, there are fun Machine Learninglessons awaiting us ahead!
Quiz: Statistics
° 1. Basics
2.Statistical Significance
1. Basics
Forthegiven list of numbers,stored in a variable data, computeits basic statistical
features andstore the results in the given variables. You can use NumPy to calculate
these values.
1 import numpy as np
2
3. # Input list
4 data = [23, 57, 10, 10, 12, 35, 2, 74, 302, 10]
5
6 # Repalce the “None” values with your solutions
7 # Use NumPy to calculate the values for each variable
8 mean = None
9 median = None
19 standard_deviation = None
12. print("Mean is ", mean)
13 print("Median is ", median)
14 print("SD is ", standard_deviation)
Solution eR o
1 mean = np.mean(data)
2 median = np.median(data)
standard_deviation = np.std(data)
Reset Quiz C
2. Statistical Significance
Interpreting P-Values
Say weare workingona study thatteststhe impactof smoking onthe duration of pregnancy. Do
womenwho smokeruntherisk of shorter pregnancy and prematurebirth? Ourdatatells usthat
the meanpregnancylength is 266 days and wehavethe following hypothesis:
Null hypothesis, Ho: = 266
Alternate hypothesis, Ha: < 266
Wealso havedata from a random sample of 40 women who smokedduringtheir pregnancy. The
meanpregnancylengthofthis sample is of 260 days with a standard deviation of21 days. The z-
scoretells us thatthe p-value inthis case is 0.03.
Whatprobability does the p=0.03 describe? Based ontheinterpretationof the p-value, select
whether the given statements are Valid or Invalid.
a ‘There is a 3% chance that women who smokewill have a mean pregnancyduration of 266
days.
O Ayvalid
©. 8) invalia
Question 1 of2 (5
Reset Quiz C O attempted
Introduction
Machine learning (ML)is a term that is often thrown aroundas if it is some kind of magic that
once appliedto yourdata, will create wonders! If we lookatall the articles about machine
learning on planet Internet, wewill stumble uponarticles of two types: heavy academic
descriptionsfilled with complicated jargon orfluff talk about machine learning being a magic
pill.
ImageCredits: https:/ixked.com
In these series of lessons, weare going to havea simple introductionto the subject so that we
can grasp the fundamentals well. We will diveinto the practical aspects of machine learning
using Python’s Scikit-Learn package via an end-to-endproject.
But before we continue, you might be asking yourself, "What’s really the difference between
Data Science and Machine Learning?!"
The twofields do havea big overlap, and they often sound interchangeable. However,if we
wereto consider an oversimplified definition of the two, wecouldsaythat:
Let’s say we wantto recognize objectsin a picture. In the old days programmers would have
hadto write code for every object they wanted to recognize,e.g., person,cat, vehicles. This is
not a scalable approach. Today, thanks to machinelearning algorithms, one system can learn to
recognize both by just showing it many examples of each. For instance, the algorithm is able to
understandthata cat is a cat by looking at examples of pictures labelled as “this is a cat” or
“this is not a cat”, and by being corrected every timeit makes a wrongguess aboutthe object in
thepicture. Then,if showna seriesof newpictures,it begins to identify cat photos in the new
set just like a child learnsto call a cat a cat and a dog a dog.
This magicis possible because the system learns basedon the propertiesof the object in
question, a.k.a. features.
« Data: this is why datais being called the newoil! Data can becollected both manually and
automatically. For example,users’ personal details like age and gender,all their clicks, and
purchasehistory are valuable data for an onlinestore. Do yourecall “ReCaptcha” which
forces you to “Selectall the street signs”? That’s an example of some free manual labor!
Data is not always images;it could be tables of data with manyvariables (features), text,
sensor recordings, sound samples etc., depending on the problem at hand.
« Features: featuresare often also called variables or parameters. Theseare essentially the
factors for a machineto look at — theproperties ofthe “object” in question,e.g., users’ age,
stock price, area ofthe rental properties, numberof wordsin a sentence,petallength,size
ofthe cells.
Choosing meaningful features is very important. Continuing with our example of
distinguishing apples from oranges, say we take bad featureslike ripeness and seed count.
Since these are not really distinct properties ofthe fruits, our machine learning system.
won't be ableto do a good job atdistinguishing between apples and oranges based on
these features.
Rememberthat it takes practice and thoughtto figure out what features to use as they are
not alwaysasclear asin this trivial example.
Predictive
Political policing -—‘Surveillance
campaigns sige
Optical character
recognition
Recommendation
engines
Image Credits: Introduction to Machine Learning- Scientific Figure on ResearchGate. Available from:
https:/Awww.researchgatenet/figure/Machine-Learning-Application_fig1_323108787
Types of Machine Learning Algorithms
* 1.Supervised Learning
* 2. Unsupervised Learning
» 3. Semi-supervised Learning
* 4, ReinforcementLearning
* Final Thoughts
« Supervised Learning
« Unsupervised Learning
« Semisupervised Learning
« Reinforcement Learning
1.Supervised Learning
In Supervised Learning,thetraining data provided as inputto the algorithm includesthefinal
solutions,called labels or class becausethe algorithm learns by “looking” at the examples with
correct answers. In other words,the algorithm has a supervisoror a teacher whoprovides it
with all the answersfirst, like whetherit’s a cat in the picture or not. And the machine uses
these examplesto learn one by one. The spamfilter is another good exampleofthis.
Anothertypical task, of a different type wouldbe to predict a target numeric valuelike housing
prices from setoffeatureslike size, location, numberof bedrooms. Totrain the system, we
again need to provide manycorrect examples of knownhousingprices,including both their
features andtheir labels.
« Linear Regression
« Logistic Regression
« SupportVector Machines
¢ Decision Trees and Random Forests
« k-Nearest Neighbors
¢ Neural networks
While the focusofthis lesson is to learn aboutthe broadcategories, wewill be diving deeper
into each ofthese algorithms individually in the "Machine Learning Algorithms"lesson.
2. Unsupervised Learning
In Unsupervised Learningthe data has nolabels; the goal of the algorithm is to find
relationshipsin the data. This system needsto learn withouta teacher. For instance, say we
havedata about a website’s visitors and we wanttouseit to find groupingsofsimilarvisitors.
Wedon’t know andcan’t tell the algorithm which groupa visitor belongs to;it finds those
connections withouthelp based on somehidden patternsin the data. This customer
segmentation is an example of what is knownasclustering,classification with no predefined
classes and based on some unknownfeatures.
* Clustering: k-Means
« Visualization and dimensionality reduction
3. Semi-supervised Learning
Semi-supervised learning deals with partially labeled training data, usually a lot of unlabeled
data with somelabeled data. Most semi-supervised learning algorithmsare a combination of
unsupervised and supervised algorithms.
Google photosis a good exampleofthis. In a set of family photos, the unsupervised partof the
algorithm automatically recognizes the photos in which eachof the family members appears.
For example,it can tell that person A appearsin picture 1 and 3 whileperson B appearsin
picture 1 and 2. After this step, all the system needsfrom us is onelabel for each person and
thenthe supervisedpart of the algorithm can nameeveryone in every photo. Bingo!
4. Reinforcement Learning
Reinforcement Learningis a special and more advanced category wherethe learning system or
agentneedsto learn to makespecific decisions. The agent observes the environmentto which it
is exposed,it selects and performsactions, and gets rewardsorpenalties in return. Its goalis to
choose actions which maximize the rewardovertime. So, bytrial anderror, and based on past
experience,the system learnsthebeststrategy, called policy, on its own.
A good exampleof Reinforcement Learning is DeepMind’s AlphaGo. The system learned the
winning policy at the gameofGo by analyzing millions of games andthen playing againstitself.
At the championshipof Go in 2017, AlphaGo wasableto beat the human world championjust
by applying thepolicy it had learned earlierbyitself.
Environment s
@ ov<erve
Select action
using policy
© Action:
Get reward
or penalty
Update policy
{learning step)
erate until an
© optimal poticy is
found
Recommender Unsupervised
Machine Ere)
[Advertising
Prediction Populaiy
‘systems
fer 4 Learning en)
forecasting
Targeted Market
Marketing Forecasting
Real-Tine stimating
Decisions Ute Expectangy
Robot
ae SKILAcquiston|
Final Thoughts
This wasa gentle introduction to Machine Learning. Hopefully, you are excited to learn more
aboutthis cool subject! Now that weare familiar with the broad types of machinelearning
algorithms,in the next lesson, weare goingto diveinto the specifics of individual machine
learning algorithms.
Machine Learning Algorithms|
© Introduction
* 1. Linear Regression
* 2. Logistic Regression
* 3. Decision Trees
* 4, Naive Bayes
* 5. Support Vector Machine (SVM)
Introduction
In this lesson, weare going to learn about the most popular machinelearning algorithms. Note
that weare not going to do a technical deep-diveasit would beout ofourscope. Thegoal is to
coverdetails sufficiently enough so that you can navigate through them when needed. The key
is to know aboutthedifferent possibilities so thatyou can then go deeper on a need’s
basis.
1 . Linear Regression
2. . Logistic Regression
3. . Decision Trees
4. . Naive Bayes
7. K-Means
8. Random Forest
9. Dimensionality Reduction
10. Artificial Neural Networks, ANN
1. Linear Regression
LinearRegression is probably the most popular machinelearning algorithm.
Rememberin high school when you hadto plotdata points on a graph with an X-axis and a Y-
axis and then find theline ofbest fit? That was a very simple machinelearningalgorithm,
linear regression. In moretechnical terms, linear regression attempts to representthe
relationship between oneor moreindependent variables (points on X axis) and a numeric
outcomeor dependentvariable (value on axis) by fitting the equation of a line to the data:
Y=axX+b
Example of simple linear regression, which has oneindependent variable (x-axis) and a dependent variable (y-axis)
For example, you mightwantto relate the weights (Y) of individuals to their heights (X) using
linearregression. This algorithm assumesa strong linear relationship between input and
output variables as we would assumethatif height increases then weight also increases
proportionally in a linear way.
The goalhereis to derive optimal valuesfor a and b in the equation above,so that our
estimated values,Y, can be asclose as possible to their correctvalues. Note that we know the
actual values for Y duringthe training phase because wearetrying to learn our equation from.
the labelled examples given in thetraining dataset.
Once our machine learning modelhas learned thelineofbest fit via linear regression,this line
can then be usedto predictvalues for newor unseendatapoints.
Different techniques can heused to learn the linear regression model. The most popular
methods is thatof least squares:
Ordinary least squares: The method of least squares calculates the best-fitting line such that
thevertical distances from each data pointto the line are minimum. Thedistances in green
(figure below) should be kept to a minimumsothat thedata points in red canbe as close as
possible to the blueline(line ofbestfit). If a pointlies on thefitted line exactly then its vertical
distance from thelineis 0.
To be morespecific, in ordinary least squares, the overall distance is the sum ofthe squares of
thevertical distances (greenlines) for all the data points. Theideais to fit a model by
minimizing this squared erroror distance.
In linear regression, the observations(red) are assumed to bethe result of random deviations (green) from an underlying
relationship (blue) between a dependentvariable (y) and anindependent variable (x). While finding theline of best fit, the
goalis to minimize the the distance shownin green-- red points as close as possible to theblueline.
Note: While using libraries like Scikit-Learn, you won’t have to implement any of
thesefunctionsyourself. Scikit Learn provides out of the box modelfitting! We will
see this in action once we reach our Projects section.
2. Logistic Regression
Logistic regression has the same mainidea as linearregression. Thedifferenceis thatthis
techniqueis used whenthe outputor dependentvariable is binary meaning the outcome can
haveonly twopossible values. For example, let’s say that we wantto predictif age influences
the probability of having a heartattack.In this case, ourpredictionis only a “yes” or “no”, only
twopossible values.
In logistic regression,theline ofbest fit is not a straight line anymore. Theprediction for the
final outputis transformed using a non-linear S-shaped functioncalledthelogistic function, gO.
Thislogistic function mapsthe intermediate outcomevaluesinto an outcomevariable Y with
values ranging from 0 to 1. These 0 to 1 values can then be interpreted as the probability of
occurrenceofY.
oursoning
Graph ofa logistic regression curve showing probability of passing an exam versus hours studying
3. Decision Trees
Decision Trees also belongto the category of supervised learning algorithms, but they can be
used for solving both regression andclassification tasks.
Say we wantto predict whetheror not we should waitfora table at a restaurant. Below is an
example decisiontree that decides whetheror not to wait in a given situation based on
differentattributes:
Patrons
SomeFall
No Yes WaitEstimate
<1l0
30460
No Yes No Yes
ImageCredits: http:/Awww.cs.bham.ac.uk/~mmk/Teaching/Al/
Our data instances are thenclassified into “wait” or “leave”based ontheattributes listed
above. From thevisual representationof the decision tree, we canseethat “wait-estimate” is
moreimportantthan “raining” because it is present ata relatively higher nodein thetree.
4. Naive Bayes
Naive Bayesis a simple yet widely used machinelearningalgorithm based on the Bayes
Theorem Rememberwetalked aboutitin theStatistics section?It is called naive because the
classifier assumes that the inputvariables are independentof each other, quite a strong and
unrealistic assumptionfor real data!. The Bayes theorem is given by the equation below:
Pll) = Es
P(a\c) * P(c)
where,
P(c|x) = probability of the eventofclass c, given the predictorvariable x,
P(x|c) = probability ofx given c,
P(c) = probability of the class,
P(x) = probability of the predictor.
Say wehavea training data set with weatherconditions, x, and the correspondingtarget
variable “Played”, c. We canuse this to obtain the probability of “Players will playif it is rainy”,
P(c|x). Note that evenif the answeris a numerical value ranging from to 1, this is an example
of a classification problem — wecanuse the probabilities to reach a “yes/no” outcome.
The distance between the hyperplaneandtheclosestclass point is called the margin. The
optimal hyperplaneis one that hasthe largest margin that classifies points in such a way that
the distance betweenthe closest data point from both classes is maximum.
In simple words, SVM tries to draw twolines betweenthedata points with thelargest margin
between them. Say weare given plot oftwoclasses, black and whitedots, on a graph as
shownin the figure below. The job of the SVM classifier wouldthen be to decide the bestline
that can separatethe black dots from the whitedots, as shownin thefigure below:
H1 does not separate the two classes. H2 does, but only with a small margin. H3 separates them with the maximal margin.
Machine Learning AlgorithmsII
8. Random Forest
9. Dimensionality Reduction
Final Thoughts
Theselectionofk is critical here; a small value can resultin a lot of noise and inaccurate
results, while a large valueis not feasible and defeats the purposeof the algorithm.
Although mostly used forclassification, this techniquecan also be used for regression
problems. For example, when dealing with a regression task, the output variable can be the
meanofthe k instances, while for classification problems this is often the modeclass value.
Thedistance functions for assessing similarity between instances can be Euclidean, Manhattan,
or Minkowski distance. Euclidean distance, the most commonly usedone,is simply an
ordinary straight-line distance between twopoints. To be specific,it is the squareroot of the
sum of the squares ofthe differences between the coordinates of thepoints.
7. K-Means
K-meansis a type of unsupervised algorithm for dataclustering. It follows a simple procedure
to classify a given data set.It tries to find K numberofclustersor groupsin the dataset. Since
weare dealing with unsupervised learning,all we haveis our training data X and the number
ofclusters, K, that we wantto identify, but no labelled traininginstances (i.e., no data with
knownfinal output category that wecoulduse to train our model). For example, K-Means could
be used to segmentusersinto K groups based ontheir purchase history.
Thealgorithm iteratively assigns each data pointto one ofthe K groups basedontheir features.
Initially, it picks k points for each ofthe K-clusters, knownasthe centroid. A new data pointis
putinto thecluster having theclosest centroid based onfeaturesimilarity. As new elements are
added tothecluster, the cluster centroidis re-computed and keeps changing. The new centroid
becomesthe averagelocationofall the data points currently in thecluster. This processis
continuediteratively until the centroids stop changing. At the end, each centroidis a collection
of feature values that definethe resulting group.
41. Kinitial "means"(in this 2. k clusters are created by 3. The centroid of each of the 4. Steps 2 and 3 are repeated
‘case k=3) are randomly associating every observation k clusters becomes the new until convergence has been
generated within the data with the nearest mean. The mean, reached,
‘domain (shown in color). partitions here represent the
Voronoi diagram generated by
the means.
8. Random Forest
Random Forestis oneof the most popular and powerful machinelearning algorithms. It is a
type of ensemble algorithm. The underlying idea for ensemble learning the is wisdom of
crowds,theidea that the collective opinion of manyis morelikely to be accurate than that
of one. The outcomeofeach ofthe models is combined and a prediction is made.
=
® tet x
Miia node 00) *
Be?
mA
ae a
dod wae
pty see
-— a
ERE ERED BEER
{a) In thetraining process, each decision tree is built based on a bootstrap sample ofthe training set. which contains two
kinds of examples(green labels and red labels). (b) In the classification process, decision fortheinput instanceis based on
the majority voting results among allindividualtrees. Image Source:Scientific Figure on ResearchGate,
https:Awww.researchgate.net/figure/llustration-of-random-forest-a-In-the-training-process-each-decision-tree-is-
built_fig3_317274960
9. Dimensionality Reduction
In thelast years, there has been an exponential increase in the amountof data captured. This
meansthat many machinelearning problems involve thousandsor even millions of features
for each training instance! This not only makestraining extremely slow but makesfinding a
good solution muchharder. This problem is often referred to as the curse of dimensionality.
In real-world problems,it is often possible to reduce the numberoffeatures considerably,
making problemstractable.
For example,in an imageclassification problem,if the pixels on the image borders are almost
always white, these pixels can completely be dropped from thetrainingset withoutlosing
much information.
In simple terms, dimensionality reduction is about assembling specific features into more high-
level ones withoutlosing the most importantinformation. Principal ComponentAnalysis (PCA)
is the most popular dimensionality reduction technique. Geometrically speaking, PCA reduces
the dimensionof a dataset by squashing it onto a lower-dimensional line, or more generally a
hyperplane/subspace, whichretains as muchofthe original data’s salient characteristics as
possible.
zeimeag
Feature 1
Say wehavea set of 2D points as shownin thefigure above. Each dimension correspondsto a
feature weareinterested in. Although thepoints seem to be scattered quite randomly, if we pay
close attention, we can see that wehavea linearpattern (blueline). As wesaid, the key pointin
PCAis Dimensionality Reduction, the process of reducing the numberof the dimensionsof the
given dataset; it does this by findingthe direction along which ourdata varies the most.
Lastbutdefinitely not theleast,let’s look into Artificial Neural Networks, whichareat the very
coreof Deep Learning.
The key idea behind ANNis to use the brain’s architecture for inspiration on howto build
intelligent machines.
A
ein star OURPUES
Mysinated axon
Inputs
Neuronwith signalflow from inputs at dendrites to outputs at axonterminals
To train a neural network,a set of neurons are mapped out andassigned a random weight
which determines howthe neurons process newdata, images, text, sounds,etc. The correct
relationship between inputs and outputsis learned from training the neural network on input
data. Since duringthetraining phase the system getsto see the correct answers,if the network
doesn’t accurately identify the input - doesn’t see a face in an image, for example — then the
system adjusts the weights. Eventually, after sufficient training, the neural networkwill
consistently recognize the correct patternsin speech,text or images.
Hidden
Input
Coy
Anartificial neural network is an interconnected groupof nodes,inspired by a simplification of neuronsin a brain. Here, each
circular node represents an artificial neuron and an arrow represents a connection from the output ofone artificial neuron to
theinput of another.
As a neural networkis a essentially a set of interconnected layers with weighted edges and
nodescalled neurons. Betweentheinput and outputlayers wecaninsert multiple hidden
layers. ANN makeuseof only two hidden layers. However,if we increase the depth ofthese
layers then we are dealing with the famous Deep Learning.
v i To=X ty Tr T;
(Label) (Feature/lmage) (Input Layer) (Hidden Layer 1) (Hidden Layer2). (Hidden Layer3)
m=?
(OutputLayer)
Cat \
ImageCredits: https:/www.ibm.com/blogs/research/2019/06/deep-neural-networks
Final Thoughts
Now wehavea very good overview of the most commonly used machinelearningalgorithms.
Hope you enjoyed this walk-through!
a Suppose you have the record of numberof rainy days in Octoberfor the last 20 years. Whatis
thebest modelto estimate the numberof rainy days for current October?
© A) LinearRegression
© B) Logistic Regression
. Question 1 of 10
Reset Quiz @
< O attempted >
Evaluating a Model
ImageCredits: http:/Awww.info.univ-angers fr
A confusion matrix has two rowsandtwo columnsthat report the numberoffalse positives,
false negatives, true positives, and true negatives. Basically, it is a summary table showing how,
good our modelis at predicting examplesofvarious classes.
For example,if we havea classification modelthathas been trainedto distinguish between cats
anddogs, a confusion matrix will summarize the results of testing the algorithm on new data.
Assuming a sample of 13 animals — 8 cats and 5 dogs — our confusion matrix wouldlook like
this:
Actualclass
cat Dog
2B, |oat/s |2
3 4
38
E° (dog 3 3
TP
Precision
Treciston = ———
TP + FP
« Recall: Ratio of correctpositive predictions to the total actual positives examplesin the
dataset, the sensitivity
TP
Recall = LEN
relevantelements
———
false negatives true negatives
ee e ° °
selected elements
How
itemsmany selected
are relevant? Howmany relevant
items are selected?
Puttingthis all together, which would bethe correct measureto answerthe following
questions?
* Accuracy
« Recall
Precision
In our case of predicting if a person has a chronicillness, it would bebetter to have a high
Recall because wedo not wantto leave any untreated any patients whohavethe disease. It’s
better to have false alarmsrather than missing positive cases, so we might be okay with the
low precision buthigh recall trade-off.
Note: In case our dataset is not skewed,butrather a balanced representationof the twoclasses,
thenitis totally okay to use Accuracy as an evaluation measure:
TP+TN _ TP +TN
Accuracy = Sy = TPL TN PP PN
Worked Example
Suppose thefecal occult blood (FOB)screentest is used in 2030 peopleto look for bowel cancer:
screen
test False negative True negative
(FN) = 10 (TN) = 1820
Sensitivity
=TP/(TP + FN)
= 20/ (20 + 10)
= 67%
AUC-ROC Curve
AUC(Area Underthe Curve) - ROC (Receiver Operating Characteristics) curveis a performance
measurement for a classification model at various classification threshold settings. Basically, it
is a probability curve thattells us how well the modelis capableofdistinguishing between
classes. The higher the AUC valueofour probability curve,the better the model is at predicting
Os as Os and 1s as 1s.
The ROC curveis plotted with True Positive Rate (Recall/Sensitivity) against the False Positive
Rate (FPR, 1 - Specificity) where TPRis on y-axis and FPRis on the x-axis, where:
wpe TP TP
Sensitivity = TPR = - -TP aIN
A great model has AUCnearthe1 indicatingit has an excellent measureof separability. On the
other hand, a poor modelhas AUCnearto the 0 meaningitis predicting Os as 1s and 1s as 0s.
And When AUCis0.5, it means the model hasno class separation capacity whatsoeverandit’s
essentially making random predictions.
Let’s understandthis better via an example analysis taken from a medicalresearch journal:
« Ifthe thresholdis too low:a lot of healthy patients will be wrongly diagnosed
« Ifthe thresholdis too high: a lot of healthy patients will be wrongly diagnosed
AROC curvecanhelpus in identifying the sweetspot, a balance between TPR and FPR.
AROC curveis generated acrossall the threshold settings and the AUC (area under the curve)
value is determined(Figure 3 in the image below).
Let’s say the black dashedlineis the ROC curvefor ourdata in this example. We could choose X
= 0.1 and Y= 0.8, so that our modelbased onthe given biomarker, Protein A, would have a
specificity of 90% anda sensitivity of 80% in identifying Alzheimer’s patients.
Condition
ProteinA Vise deease
eee Healthy
True
i ars a ,
3]: i| osdacase pe positive Er Yongonttionatert
z ied aor = Poor [Biomarker3)
E3 Healthy negative fe
Sensitivity -
Proteinsn concentration
nce iat rue postiveHssdsese [rue Specificity -
negate / Healthy 0 021 -Specitcty
04 0.6 0.8 1.0
Feetovnpeg hoganttopastins a
(Sectionsonanen igre 2 caastenctsty aspeateny fre. congnonct
‘hengeac Otane atee petbeats
rouerenepedcevette
‘Setewir
Scanspy caocoer hpal ‘hte ema
etere basaetbey
otpetpoCOS)
eed pea
‘ve lowsont bat hepa
Predicted
Apples Oranges
Actual
O AyRecall = 20%
Specificity = 30%
Precision = 22%
O By Recall = 30%
Specificity = 20%
Precision = 22%
Reset Quiz C
Key Points to Remember
Machinelearning algorithms comewith the promise ofbeing ableto figure out how to perform
importanttasks by learning from data,i.e., generalizing from examples withoutbeing explicitly
told whatto do. This meansthat the higher the amount ofdata, the more ambitious problems
can hetackled by these algorithms. However,developing successful machine learning
applications requires quite some “black art” that is hardto find.
|
|
|
i
Ofcourse, holding out data reduces the amountavailable for training. This can be mitigated by
doing cross-validation: randomly dividing your training datainto subsets, holding out each
one while training on therest, testing each learnedclassifier on the unseen examples, and
averagingtheresults to see how well the particular parametersetting does.
Domain knowledge and an understanding of our data are crucial in making the right
assumptions. The need for knowledgein learning shouldnot be surprising. Machine learning
is not magic;it can’t get something from nothing.It doesis get more from less though.
Programming,like all engineering,is a lot of work: wehaveto build everything from scratch.
Learningis morelike farming, whichlets nature do most ofthe work. Farmers combine seeds
with nutrients to grow crops. Learners combine knowledge with data to grow programs.
Image Credits: Machine Learning for Biomedical Applications: From Crowdsourcing to Deep Learning;
http:/mediatum.ub.tum.de/doc/1368117/47614.pdt
Very often the raw data does not even comeina form readyfor learning. But we can construct
features fromit that can beusedforlearning. In fact, this is typically where mostof the
effort in a machine learning project goes.It is often also one ofthe mostinteresting parts,
whereintuition, creativity and “black art” are as importantasthe technical stuff.
ot
amo
house_info {
snum_ bedrooms: 3
Fanaa >ue0, Provessofeating
‘streel_name: "Shorebird Way”
vin ooms: I
sementr 9221
232" Is feature engineering,
Col,
|} 0,
Raw data doesntcome 1
tous as featurevectors,
In the early days of machinelearning, people tried manyvariations of different learners but
still only used the bestone. But then researchers noticed that, if insteadofselecting the best
variation found, we combine manyvariations, the results are better often much better and
with onlya little extra effort for the user. Creating such model ensembles is now very common:
In the Netflix prize, teams from all over the world competedto build the best video
recommendersystem.As the competition progressed, teams foundthat they obtained the best
results by combining their learners with other teams’, and mergedinto larger andlarger
teams. The winner and runner-up wereboth stacked ensembles ofover 100 learners and
combining the two ensemblesfurther improvedtheresults. Togetheris better!
Pe
nc
RBs
GOING ToASSUME CANCER
2 HIN THe Us CAUSES CELL PHONES.
ImageCredits: https:/ixked.com
Wehaveall heardthat correlation doesnot imply causation butstill people frequently thinkit
does.
Often the goal of learning predictive models is to use them asguidesto action. If we find that
beer anddiapersare often boughttogetherat the supermarket, then perhapsputting beer next
to the diapersectionwill increase sales. But unless we do an actual experimentit’s difficult to
tell if this is true. Correlationis a sign of a potential causal connection, and we can useit as a
guideto further investigation andnotasour final conclusion.
Note: This lesson is an excerptfrom my blogpost onthis topic. You can readthefull
article here.
Machine Learning Project Checklist
Checklist
RA
You have beenhired as a new DataScientist and you have an exciting project to work on!
Howshould yougo aboutit?
In this lesson, weare going to go through a checklist and talk about somebestpractices that
you should consider adopting when working on an end-to-end MLproject.
. Getthe data: Do NOT forget aboutdata privacy and compliancehere, they are of
N
Summarizethe data: find the type ofvariables or map out the underlying data structure,
find correlations amongvariables, identify the most importantvariables, check for
missing values and mistakesin the data etc.
Visualize the data to take a broad lookat patterns, trends, anomalies, and outliers. Use
data summarization and data visualization techniques to understandthestory the data is
telling you.
4. Start simple: Begin with a very simplistic model, like linearorlogistic regression, with
minimal and prominentfeatures (directly observed and reportedfeatures). This will allow
youto gain a good familiarity with the problem at handandalsoset theright direction for
the nextsteps.
. Fine-tune the parameters of your models and consider combining them for the best
x
results.
Rememberto tailor your presentation based on the technical levelof your target
audience. For example, when presenting to non-technical stakeholders, rememberto
convey key insights without using heavy technical jargon. Theyare likely not going to be
interested in hearing about all the cool ML techniques you adopted,but rather on end
results and keyinsights.
. If the scope of yourproject is more than just extracting and presentinginsights from data,
00
Of course,this checklist is just a referencefor getting started. Once youstart working on real
projects, adapt, improvise, andat the endof each projectreflect on the takeaways, learning
from mistakesis essential!
Note: Weare going to go into thetechnical details ofall these steps, and their sub-steps, in
the “Project Lessons”; the purposeofthis checklist is to serve as a very high-level guideline
orbest practices reminder.
In this section, we are going to deconstructthe main step needed to work on a MLproject via a
real end-to-end example with code. Weare going to work with a challenge based on a Kaggle
Competition.
G8). ©
Get Data Train Model Improve
Our project is based on the Kaggle Housing Prices Competition. In this challenge we are given a
dataset with different attributes for houses and their prices. Ourgoalis to develop a modelthat
can predict the prices of houses based on this data.
What do these tell us in terms of framing our problem? The first point tells us that this is clearly
a typical supervised learningtask, while the second one tells us that this is a typical
regression task.
If we look at thefile with data description, data_description.txt, we can see the kind of
attributes we are expected to have for the houses we are workingwith. Here is a sneak peek
into someof the interesting attributes andtheir description from that file:
« SalePrice - the property’s sale price in dollars. Thisis the target variable that we are
trying to predict.
« MSSubClass: The buildingclass.
« LotFrontage: Linearfeet of street connected to property.
« LotArea: Lot size in square feet. Street: Type of road access.
« Alley: Type of alley access.
« LotShape: General shape of property.
« LandContour: Flatness of the property.
« LotConfig: Lot configuration.
« LandSlope:Slope of property. Neighborhood: Physical locations within Ames city limits.
* Condition1: Proximity to main road or railroad.
« HouseStyle: Style of dwelling.
« OverallQual: Overall material and finish quality.
* OverallCond: Overall condition rating.
« YearBuilt: Original construction date.
#® Note: Before moving further, download thedataset, train.csv, from here. Launch your
Jupyter notebookand thenfollow along! It is important to get your handsdirty; don’t
just read throughthese lessons!
# You can also find the Juptyter notebookwith all the codeforthis project on my Git
profile, here.
# Youcan seethe live execution ofcode in the Jupyter Notebookat the endof the
lesson and can alsoplay with it.
Let’s start by importing the modules and getting the data. In the codesnippet below,it is
assumed that you have downloaded the csv file and saved it in the working directory as
‘/data/train.csv’.
# Core Modules
@Vounune
import pandas es pd
import numpy as np
# Basic modules for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
Ymatplotlib inline
In [3]: M_ housing.columns
Out[3]: Index(["Id", 'MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape', ‘LandContour', ‘Utilities’, ‘LotConfig’,
*LandSlope’, ‘Neighborhood’, ‘Condition1', ‘Condition2", ‘Bldgtype",
‘HouseStyle', ‘OverallQual', ‘OverallCond', ‘YearBuilt’, ‘YearRenodAdd’ ,
*RoofStyle’, ‘RoofMatl', ‘Exteriorist’, ‘Exterior2nd’, ‘MasVnrType’,
“MasVnrArea’, ‘ExterQual’, 'ExterCond’, ‘Foundation’, ‘BsmtQual’,
“BsmtCond", 'BsmtExposure’, ‘BsmtFinTypel’, ‘BsmtFinSF1',
“BsmtFinType2", 'BsmtFinSF2", "BsmtUnfSF', ‘TotalBsmtSF', ‘Heating’,
"Weatinggc’, ‘CentralAir’, ‘Electrical’, ‘1stFirsF’, ‘2ndFirsF’,
“LonQualFinSF’, ‘GrLivarea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘Fullath’,
‘HalfBath’, 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual
“TotRnsAbvGrd', ‘Functional’, ‘Fireplaces’, ‘Fireplacegu’, ‘GarageType’,
*GarageYrBlt', ‘GarageFinish’, ‘GarageCars’, ‘Garagedrea’, ‘GarageQual’,
“GarageCond’, ‘PavedOrive’, ‘WoodDeckSF’, ‘OpenPorchSF*
*EnclosedPorch', ‘3SsnPorch', ‘ScreenPorch’, ‘PoolArea’, ‘PoolQc’,
“Fence’, 'NiscFeature’, 'MiscVal', ‘MoSold’, ‘YrSold’, ‘SaleType’,
*saleCondition’, ‘SalePrice’],
dtype=‘ object")
Howmanyattributes do wehavein total? Of course, we are not going to count them ourselves
from the results above! Let’s get the number of rows and columnsin our DataFrame by calling
shape.
There are only 1460 training examples in the dataset, which meansthat it is small by machine
learning standards. The shape of the dataset also tells is that we have81 attributes. Of the 81
attributes, one is the Id for the houses — not useful as a feature - and oneis the target variable,
SalePrice, that the model should predict. This means that we have 79 attributes that have the
potentialto be used to train ourpredictive model.
Now let’stake a look at the top five rows using the DataFrame’s head() method.
In [5]: W_housing-head()
outs}:
|d_MSSubCiass MSZoning Lotfrontage LotArea Sweet Alley LotShape LandContour Utilities. PoolArea PoolC Fence MiscFeature MiscVal
on o Re 50-650 Pave NON Reg AP NaN NaN MeN
12 2 RL 2009600 Pave NaN Reg ta ANP NaN NaN nen oo
2a © RL 20 11280 Pave NN IRI La AIP NaN NaN Nan 0
a4 70 RL 600 9550 Pave NaN IRI La ANP NaN NaN NN 0
as © RL 24014200 Pave NN IR La AIP NaN NaN NaN 0
5 rows * 81 columns
Each row represents one house. Wecansee that we have both numerical(e.g., LotFrontage)
and categoricalattributes (e.g., LotShape). Wealso notice that we have many missing values
(NaN) as not all the houses have values set for all the attributes.
Wehavea columncalled Jd whichis not usefulas an attribute. We can either omit it or use it as
an indexfor our DataFrame. Weare going to drop that column becauseindexes of houses are
not relevant for this problem anyway.
The info() method is usefulto get a quick descriptionof the data, in particularthe total
number of rows, each attribute’s type and number of non-null values. So let’s drop the Id
column andcall the info() method:
The info() method tells us that we have 37 numericalattributes, 3 float64 and 34 int64, and 43
categorical columns. Notice that we have manyattributes that are not set for most of the
houses. For example, the Alley attribute has only 91 non-null values, meaning that all other
houses are missingthis feature. We will need to take care of thislater.
Here wehave a mix of numerical andcategorical attributes. Let’s look at these separately and
also use the describe() method to get their statistical summary.
Numerical Attributes
1 # Get the data summary with upto 2 decimals and call transpose() for a better view of the results
2 housing.select_dtypes(exclude=[ ‘object’ ]).describe() .round(decimals=2).transpose()
The count, mean, min, and max columnsare self-explanatory. Note that the null values are
ignored; for example, the count of LotFrontageis 1201, not 1460. The std column showsthe
standard deviation which measures howdispersed the values are.
The 25%, 50%, and 75% columnsshow the corresponding percentiles: a percentile indicates the
value belowwhicha given percentage of observations in a groupofobservationsfalls. For
example, 25%of the houses have YearBuilt lower than 1954, while 50%are lower than 1973
and 75%are lower than 2000.
Recall from thestatistics lessons that the 25th percentile is also knownas the 1st quartile, the
50thpercentile is the median, and the 75th percentile is also knownas the 3rd quartile.
Categorical Attributes
Note that for categorical attributes wedo not get a statistical summary. But we can get some
important information like number of uniquevalues and top values for eachattribute. For
example, wecansee that wecan have 8 types of HouseStyle, with 1Story houses being the most
frequent type.
0.000008,
0.000007
0.000006
0.000008,
0.000004
0.000003,
0.000002
0.000001
0.000000
‘© 100000 200000300000.400000 500000600000 700000800000
Salerce
The distribution plot tells us that we have a skewed variable. In fact from the statistical
summary, we already saw that the meanprice is about 181K while 50%of the houses weresold
for less than 163K.
When dealing with skewed variables, it is a good practice to reduce the skew ofthe dataset
becauseit can impact the accuracyof the model. This is an important step if weare going to use
linear regression modeling; other algorithms,like tree-based RandomForests can handle
skewed data. Wewill understand this in detail later under “Feature Scaling”. For now,let’s look
at the updated distribution ofour target variable once we applya log transformation toit.
Applying a log transformation meansto simply take the log of the skewed variable to improve
the fit by altering the scale and makingthe variable more normally distributed.
sns.distplot(np.log(housing[ "SalePrice’]))
plt.title( Distribution of Log-transformed SalePrice’)
plt.xlabel(‘log(SalePrice)")
plt.show()
Wecan clearly see that the log-transformed variable is more normally distributed and we have
managed to reduce the skew.
What about all the other numerical variables? What dotheir distributions look like? We can
plot the distributionsof all the numerical variables by calling the distplot() method ina for
loop, like so:
Notice how varying the distributions andscales for the different variablesare,this is the
reason weneedto do feature scaling before wecan usethese features for modeling. For
example, we can clearly see how skewedLotAreais. It is in dire need of somepolishing before
it can be used for learning.
EB Wewill get back to all the needed transformations and “applyingthefixes” later. In this
exploratory analysis steps, we are just taking notes on whatweneedto take care ofin order to
create a goodpredictive model.
In thestatistics lesson, we learnedthat boxplots give us a good overviewofour data. From the
distribution of observationsin relation to the upper and lower quartiles, we can spotoutliers.
Let’s see this in action with the boxplot() method anda for looptoplotall the attributes in one
go:
for i in range(len(num_attributes.columns)):
ig.add_subplot(9, 4, i+1)
sns.boxplot(y=num_attributes.iloc[:,i])
plt.tight_layout()
plt.show()
Pi
overatQual
otares
i”
0
ww
we :
vos oop
e.ebid
5foo
go
j
§ 1000 i~
f= 5
B From the boxplots wecan see thatfor instance LotFrontage values above 200 and LotArea
above 150000 can be marked asoutliers. However,insteadof relying on our own “visual sense”
to spotpatterns and definethe range for outliers, when doingdata cleaning, wewill use the
knowledgeofpercentiles to be more accurate. For now our takeawayfromthisanalysis is that
weneedto take care ofoutliers in the data cleaning phase.
Just-for-fun plot
Ourbrains are very goodat spotting patterns on pictures, but sometimes we needto play
aroundwith visualization parametersandtry out different kindof plots to make those patterns
standout. Let’s create a fun exampleplot for learning to play with visualizations, especially
whenwewantto analyze relations among multiple variables at once.
Weare going to look attheprices. The radius ofeachcircle represents GrLivArea(option s),
and the colorrepresents theprice (option c). We will use a predefined color map (option cmap)
called jet, which rangesfrom blue (low values)to red(high prices).
Theplotabovetells us that the housingprices are very much related to the YearBuilt (y-axis)
and OverallQual (x-axis). Newerand higher quality houses mean more expensiveprices.This is
shownincreasing red going towards upper-right endofthe plotandvice versa.Prices are also
related to the GrLivArea,radiusofthecircle.
Wecan use the corr() methodtoeasily getthe correlations and then visualize them using the
heatmap() method - Python does feel like magic often, isn’t it?!
The corr() methodreturns pairsofall attributes and their correlation coefficients in range[-1;
1], where1 indicatespositive correlation, -1 negative correlation and 0 meansnorelationship
betweenvariablesatall.
corr = housing.corr()
# Using mask to get triangular correlation matrix
¥, ax = plt.subplots(figsiz
mask = np.zeros_like(corr, dtyp.
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax,
a0
From the heatmap,wecaneasily see that we have somevariables that are highly correlated
with price (darker red) andthat therearevariables highly correlated among themselves as
well. The heatmapis useful for a first high-level overview.Let’s get a sorted list of correlations
among all theattributes andthetargetvariable, SalePrice, for a deeper understanding of what’s
going on.
From thesevalues, we can see that OverallQual and GrLivAreahave the most impacton price,
while attributes like PoolArea and MoSold are not related toit.
Pair-wisescatter matrix
Wehavea lot of uniquepairsof variables i.e. N(N - 1)/2. Joint distribution can be used to look
for a relationship betweenall of the possible pairs, two ata time.
Forthe sake of completeness, we might wantto display a rough jointdistribution plot for each
pair ofvariables. This can be doneby using pairplot() from sns. Since we havea fairly big N,
so weare going to create scatter plots for only someofthe interesting attributesto get a visual
feel for these correlations.
1875
w 1900 1950 2000
SalePrice OveraliQusl YearBuilt
From thepairplots, we can clearly see how with an increase in GrLivAreatheprice increases as
well. Play aroundwith otherattributes as well.
In orderto train our “creatingplots muscle”, let’s look at other typesofplots that can make the
relationship for the highest correlated variables, OverallQual, with the targetvariable,
SalePrice, really standout.
1 sns.barplot(housing.Overallqual, housing.SalePrice)
1 # Boxplot
2 pit. figure(figsize=(18, 8))
3 sns.boxplot(x-housing.Overallqual, y-housing.SalePrice)
In [26]: W_ #boxplot
plt.figure(figsize=(18, 8))
sns.boxplot(xshousing.Overallqual, yshousing-SalePrice)
Out(26]: <matplotlib.axes._subplots.AKesSubplot at 0x26330d37518>
- eettT
Wecansee that we have manyhighlycorrelatedattributes andthese results confirm our
commonsenseanalysis.
B Wecan take somenotes hereforthe feature selection phase whereweare going to drop the
highly correlated variables. For example, GarageCars and GarageAreaare highly correlated but
since GarageCars hasa higher correlation with thetarget variable, SalePrice, weare going to
keep GarageCars and drop GarageArea. Wewill also droptheattributes that have almost no
correlation with price, like MoSold, 3SsnPorch and BsmtFinSF2.
¥, ax = plt.subplots(figsize=(10,6))
sns.boxplot(y=housing.SalePrice, x=var)
plt.show()
'
700000
20000 ‘ :
00000
& «00000
3 +
200000
200000
100000
°
@ a = 7
KachenQual
Wecan nowsee that Ex seemsto be the more expensive option while Fa brings the prices
down.
Whataboutthestyle of the houses? Which styles do we have and howdothey impact prices?
1 f, ax = plt.subplots(figsize=(12,8))
2 sns.boxplot(y=housing.SalePrice, x-housing.HouseStyle)
3. pit.xticks(rotation=40)
4 plt.show()
In [31]: M| f, ax = plt.subplots(Figsize:
ssns..boxplot(y=housing.SalePric ousing.HouseStyle)
plt-xticks(rotation=48)
plt.show()
'
0000 . °
wm} .
in | :
300000 ’ . —
' ’
~_ =. ™
—_—" L
:
“yf * # > SF #
Wecan see that 2Story houses havethe highestvariability in prices and they also tend to be
more expensive, while 1.5Unf arethe cheapestoption.
Say we wantto get the frequency for eachofthese types, we can use the countplot() method
from sns like so:
F é4 54 é8 3“ ; :
a 5 vous a a
Now weknowthat most of the housesare 1Story type houses. Say we do not wanta frequency
distribution plot, but only the exact countfor each category, we canget that easily from the
DataFramedirectly:
1 housing["HouseStyle"].value_counts()
Wearealso curiousto see if the style of the houses has changed overthe years,solet’s plot the
two variables against each other.
Now weknowthat2Story and 1Story havebeen therefor ages andthey continueto be built
while SFoyer and SLvl arerelatively newerstyles. We canalso notice that 2.5Fin, 2.5Unf and
1.5Unf are deprecated styles.
Jupyter Notebook
You can see theinstructions running in the Jupyter Notebook below:
Click on “Click to Launch” 7 button to work andsee the code runninglive in the
notebook.
cick
Youcanclic! [7 to open
pen the the Jupyter
Jupyt Notebook
‘ebook iin a new tab. tab.
Goto File andclick Download as and then choose the formatofthefile to download.
&. You can choose Notebook(.ipynb) to downloadthefile and worklocally or on
your personal Jupyter Notebook.
Aaand weare donewith theinitial Exploratory Analysis! We will move onto Data Preprocessing
in the nextlesson.
Kaggle Challenge - Data Preprocessing
coverthe following A
Feature Scaling
Jupyter Notebook
1 housing. isnull().sum().sort_values(ascending-False)
From the results above we can assumethat PoolQC to Bsmtattributes are missing for the
houses that do not havethese facilities (houses without pools, basements, garageetc.).
Therefore, the missing values couldhefilled in with “None”. MasVnrType and MasVnrArea
both have 8 missing values, likely houses without masonry veneer.
Most machinelearning algorithms cannot work with missing features, so we needto take care
ofthem.Essentially, we have three options:
Wecan accomplish these easily using DataFrame’s dropna() , drop() ,and fillna() methods.
# Note: Whenever you choosethethird option, say imputing valuesusing the median,
you should compute the median value onthetraining set, anduseit to fill the missing
valuesin the trainingset. But you should also rememberto later replace missing values in
thetest set using the same median value when you wantto evaluate your system, and
also once the model gets deployedto replace missing values in new unseendata.
Wearegoing to apply different approachesto fix our missing values, so that we can various
approaches in action:
Right now,we are goingto lookat howto dothese fixes by explicitly writing the nameof the
columnin the code. Later, in the upcomingsection on transformationpipelines, wewill learn
howto handle them in an automated manner aswell.
high_quant = housing_processed.quantile(.999)
for i in num_attributes.columns:
housing_processed = housing_processed.drop(housing_processed[i][housing_processed[i]>high_quant
housing_processed. info()
Invoking the info() methodon the updated DataFrametells us that weare left with 1422 rows
now.
# Remove attributes that were identified for excluding when viewing scatter plots & corr values
attributes_drop = ["Miscval’, ‘MoSold', ‘YrSold', "BsmtFinSF2’, "BsmtHalfBath’,, ‘MSSubCless",
“Garagesrea’, ‘GarageYrBlt", '3SsnPorch"]
housing_processed = housing_processed.drop(attributes_drop, axis=1)
Acommonapproachto deal with textual datais to create one binary attribute for each
category of the feature: for example, for type of houses, we would haveoneattribute equal to 1
whenthecategory is 1Story (and 0 otherwise), anotherattribute equal to 1 when the category is
2Story (and 0 otherwise), and so on. Thisis called one-hot encoding, because only oneattribute
will be equalto 1 (hot), while the otherswill be 0 (cold). The new attributes are also known as
dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values
into one-hot vectors:
Notice thatas a result of creating new one-hot attributes our total numberofattributes has
jumpedto 7333! We have a 1422x7333 matrix whichis mostly sparse(zeros).
Feature Scaling
FeatureScaling is one of the most important transformations we need to apply to our data. As
wesaidearlier, machine learning algorithms mostly do not perform well if they are fed
numericalattributes with very different scales as input. Thisis the case for the housing data. If
you go backandlookat thedistribution plots that wecreated in the very beginning, wenotice
that LotArea rangesfrom 0 to 200000, while GarageCarsranges only from to 4.
There are two commonwaysto get all attributes to have the samescale: min-max scaling and
standardization.
« Min-max scaling (also knownas normalization): this is a simple technique. Values are
shifted and rescaled sothatthey end up ranging from 0 to 1. This can be done by
subtracting the min value anddividing by the max minusthe min,but fortunately Scikit-
Learn providesa transformer(wewill talk about transformersin a bit) called mMinMaxScaler
to do this in a hassle-free manner.This transformer also provides the feature_range
hyperparameterso that we can change the rangeif for some reason wedon’t wantthe0 to
1 scale.
(X = Xmin)
Mee Xe Xmnin
« Standardization: this is a more sophisticated approach. Rememberthelessons from.
statistics? Standardization is doneby first subtracting the meanvalue(so standardized
values always have a 0 mean), andthen dividing by the standard deviation so thatthe
resulting distribution has unitvariance. Sinceit only cares about“fixing” the mean and
variance, standardization does not limit values to a specific range, which may be
problematic for some algorithms (e.g., neural networks often expect an input value
ranging from 0 to 1). However,standardization is muchless affected by outliers. Say Bill
Gates walks into a bar, suddenly the median incomefor people in the bar would shoot up
to the moon,so min-max scaling would bea poor choice for scaling here. On the other
hand, standardization would not be muchaffected. Scikit-Learn provides a transformer
called standardScaler for standardization.
gatik
o
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
——
3. Transformation Pipelines
* Jupyter Notebook
3. Transformation Pipelines
As you cansee, from imputing missing values to featurescaling to handling categorical
attributes, we have manydata transformationstepsthat need to be executedin the rightorder.
Fortunately, Scikit-Learn is here to makeourlife easier: Scikit-Learn provides the Pipeline
class to help with such sequencesof transformations.
SomeScikit-Learn terminology:
« Estimators: An object that can estimate some parameters based on dataset, e.g., an
imputer is an estimator). Theestimationitself is performed bysimply calling the fit()
method.
Based on someofthe data preparation steps we haveidentified so far, we are going to create a
transformation pipeline based on simpleImputer (*) and StandardScalar Classes for the
numericalattributes and OneHotEncoder for dealing with categorical attributes.
(*)Scikit-Learn provides a very handyclass, simpleImputer to take care of missing values. You
just tell it the type of imputation,e.g. by median,andvoila,the job is done. We have already
talked abouttheothertwoclasses.
First, we will look at a simple examplepipelineto impute and scale numerical attributes. Then
wewill create a full pipeline to handle both numerical andcategorical attributes in one go.
The numericalpipeline:
1 # Import modules
2 from sklearn.pipeline import Pipeline
3. from sklearn.preprocessing import Standardscaler
4 from sklearn.compose import ColumnTransformer
5 from sklearn.impute import SimpleImputer
6
7 # Separate features and target variable
8 housing_X = housing_processed.drop("SalePrice", axis=1)
9 housingy = housing_processed["SalePrice"].copy()
11 # Get the list of names for numerical and categorical attributes separately
12 num_attributes = housingX.select_dtypes(exclude=" object’)
13 cat_attributes = housingX.select_dtypes(include=" object’)
15 num_attribs ist(num_attributes)
16 cat_attribs = ist(cat_attributes)
18 # Numerical Pipeline to impute any missing values with the median and scale attributes
19 num_pipeline = Pipeline([
20 (imputer’, SimpleImputer(strategy="median")),
2 ('std_scaler’, StandardScaler()),
22 D
Note that we haveseparated the SalePrice attribute into a separatevariable, because for
creating the machinelearning model, weneed to separateall the features, housing_X, from the
target variable, housing_y.
The Pipeline constructor takesa list of name/estimatorpairs defining a sequenceof steps. The
namescan be whatever we wantas longas they are unique and without double underscores,
Thepipelineis run sequentially, one transformerat a time, passing the outputof each call as
the parameterto the nextcall. In this example,the last estimator is a StandardScaler(a
transformer), andthe pipelineapplies all the transformsto the data in sequence.
So far, we have handled categorical and numerical attributes separately.It is more convenient
andclean to havea single transformer handleall columns, applying the appropriate
transformationsto each column.Scikit-Learn comes to the rescue again by providing the
ColumnTransformer for the very purpose. Let’s useit to apply all the transformationsto our data
andcreate a completepipeline.
full_pipeline = Columntransformer([
num_pipeline, num_attribs),
", OneHotEncoder(), cat_attribs),
In this example, wespecify that the numerical columnsshould be transformed using the
num_pipeline that wedefined earlier, and the categorical columnsshould be transformed
using a OneHotEncoder.Finally, we apply this ColumnTransformerto the housing data
using fit_transform(.
Andthat’s it! We have a preprocessing pipeline that takes the housing data and applies the
appropriate transformations to each column.
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
——_ «
Thegreat newsis that thanks toall the previoussteps,things are going to be way simpler than
you mightthink! Scikit-learn makesit all very easy!
As a first step weare goingto split ourdata into twosets: training set andtest set. We are going
to train our modelonly on part ofthe data because weneedto keep someofit aside in order to
evaluate the quality of our model.
Creating a test set is quite simple: the most commonapproach is to pick some instances
randomly, typically 20% of the dataset, and set them aside. The simplest function for doing this
Scikit-learn’s train_test_split() .
With thetraining andtestdata in hand,creating a modelis really easy. Say we wantto create a
Linear Regression model.In general, this is whatit lookslike:
# Import modules
wavanaune
Andthat’s it! There you have a linear regression modelin threelines of code!
Nowwewantto create and compare multiple models, so weare goingto storetheresults from
the evaluation of each modelin variable. Since weare dealing with a regression problem, we
are also going to use RMSEas the main performance measureto assess the quality of our
models.
The equation for RMSEis simple: we sum the squareofall the errors between predicted values
andactual values, wedivide bythe total numberoftest examples and then wetake the square
root ofthe results:
Again,not to worry about implementing formulas, because we are going to measure RMSEof
our regression models using Scikit-learn’s mean_squared_error function.
def inv_y(y):
return np.exp(y)
Wehavetrained four different models. As you cansee,training from one modelto another
just meansthat youjust select a different one from Scikit-Learn’s library and change a single
line of code!
Nowlet’s get the performance measures for our models in sorted order, from best to worst:
The simplest model, Linear Regression, seemsto be performing thebest, with predicted prices
that are off by about 24K. This might or might not be an acceptable amountof deviation
dependingon the desired level of accuracy or the metric wearetrying to optimize based on
our business objective.
General Notes
largeprediction error usually means an exampleof a model underfitting the training data.
Whenthis happensit can mean thatthe features do not provide enough information to make
good predictions,or that the modelis not powerful enough. The main ways to fix underfitting
are to select a more powerful model, to feed the training algorithm with better features, or to
reduce the constraints on the model.
In this case, we havetrained more powerful models, capable offinding complex nonlinear
relationshipsin thedata,like a DecisionTreeRegressoras well. However, the more powerful
model seems to be performing worse! The Decision Tree modelis overfitting badly enough to
perform even worse than the simpler Linear Regression model.
Possible solutions to deal with overfitting are to simplify the model, constrain it, or get more
training data.
Random Forests workby training many Decision Trees on random subsetsof the features, then
averaging outtheir predictions. Building a modelon top of many other modelsis called
Ensemble Learning,andit is used to improvethe performanceofthealgorithms. In fact, we
cansee that Random Forests are performing much betterthan Decision Trees.
Onewayto evaluate modelsisto split the training set into a smallertraining set and a
validationset, then train the modelsagainst the smaller training set and evaluate them against
the validationset. Thisis called cross-validation. Wecan useScikit-Learn’s cross-validation
feature, cross_val_score, forthis.
Let’s perform a K-fold cross-validation on our best model: the cross-validation function
randomlysplits the training set into K distinct subsets orfolds, then it trains and evaluates the
model K times, pickinga differentfold for evaluation every timeand trainingon theother 9
folds. The result is an array containing theK evaluation scores:
#® Note:In general, save your models so that you can comeback to any model you want.
Make sureto save the hyperparameters,the trained parameters, andalso the evaluation
scores. Why? Becausethis will allow you to easily comparescoresacross model types and
comparethetypesoferrors they make. This will especially be useful whenthe problem is
complex, your notebookis huge and/or modeltraining timeis very large.
Scikit-learn models can be saved easily using the pickle module, or using
sklearn.externals.joblib , whichis moreefficientat serializing large NumPyarrays:
# Save model
joblib.dump(my_model, "my_model.pkl"
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
——
Grid Search
Should wefiddle with all the possible values manually and then compareresults to find the
best combination of hyperparameters? This would bereally tedious work, and we would end
up exploring only a few possible combinations.
Luckily, we can use Scikit-learn’s GridSearchcv to do this tedious search workfor us. All we
need todois tell it which hyperparameters we wouldlike to explore and which valuesto try
out, andit will evaluateall the possible combinations of hyperparametervalues, using cross-
validation.
For example,let’s see how tosearch for the best combination of hyperparametervalues for the
RandomForestRegressor:
Wecan use best_params_ to visualize thebest values for the passed hyperparameters, and
best_estimator_ to get thefine-tuned model:
# Best values
wewne
grid_search.best_params_
# Model with best values
grid_search.best_estimator_
In [52]: M grid_search.best_parans_
ut{s2]: ("bootstrap': False, 'max_features': 58, ‘n_estinators': 150)
In [53]: W_ grid_search.best_estimator_
ut{53]: RandonForestRegressor(bootstrapsFalse, criterions'nse', max_depth=None,
inax_features=50, max_leaf_nodessNone, min_smpurity_decrease-0.0,
snin“impurity_splitelione, min_samples_leaf=1,1
in_sanples_splite2, min_weight_fraction_le2f8.0,
n_estimators=158, n_jobs-None, o0b_score-False,
randon_statesNone, verbose=@, wara_starteFalse)
rf_model_final.fit(x_train, ytrain)
rf_final_val_predictions = rf_model_final.predict(X_test)
# Get RMSE
rf_final_val_rmse = mean_squared_error(inv_y(rf_final_val_predictions), inv_y(y_test))
np.sqrt(rf_final_val_rmse)
10 # Get Accuracy
11 rf_model_final.score(x_test, y_test)*100
Wow!Our accuracyhas gone up from about 84.8 to 87.8 while the RMSEhasdecreased from
31491 to 28801. This is a significant improvement!
1. Randomized Search
Thegrid search approachis acceptable when weareexploring relatively few combinations, but
whenthe numberof combinations of the hyperparametersis large,it is often preferable to use
RandomizedSearchcv. Thisis similar to GridSearchCV class,but insteadof trying outall possible
combinations,it evaluates a given numberof random combinationsat every iteration.
2. Ensemble Methods
Another wayto fine-tuneis to try to combine the models that perform best. The group, or
ensemble,will often perform better than thebestindividual model,just like Random Forests
perform better thanthe individualDecision Treestheyrely on, especially if the individual
models makevery different types of errors.
Jupyter Notebook
You cansee the instructions running in the Jupyter Notebook below:
* Click on “Click to Launch” g7 button to workandsee the code running live in the
notebook.
——
# Side Note: Say this housing example wasa real project. Thefinal performanceof the
model could be used to understandif ML based solution can be usedto replace human
experts in the loop. Automating thesetasks is useful because it meansthatthe experts get
to have morefree time which they can dedicate to moreinteresting and productivetasks.
When ML models are in production,it is crucial to have monitoring in placein order to check
the system’s performanceat regularintervals andtrigger alerts when things go bananas.
Finally, you will likely need to train your models at regularintervals using fresh data. In order
to avoid doing the sametasks over andoveragain,strive to automatethis process as much as
possible. Automating meansthat you can run updates at exact intervals without
procrastination issues and your system will stay up-to-date and show badfluctuations over
time.
Of course,thesesteps are not needed if you arejust building a model, say for a Kaggle
competition. In that case youcanstopat fine-tuning!
© Open Datasets
Open Datasets
Thereare thousandsof open datasets, ranging acrossall sorts of domains, just waiting for you.
Hereare a few popularplaces you canlookatto get lots of open data:
« Kaggle datasets
« Amazon's AWSdatasets
I would recommendyoustart on Kaggle because you will have a gooddataset to tackle, a clear
goal, and people to share yourexperience with.
Data scienceis a vastfield. Thereis always moreto learn andexplore. Especially, if you start
browsing andlooking for resources aroundthe Internet, it won’t take long before you get
information overload. The keyis to not become overwhelmed.
Chooseone or twocourses or booksat a time. Read, learn, understand, and apply the
concepts before jumping on to the next, new, shinything.It is importantto apply the
concepts as you learn them,as we did throughoutthis course. It can be tempting to buy every
bookandstart every course, but then you will more thanlikely neverfinish any of them. So
rather than going all over the place, rememberto focus andto learn by practicing.
Henceto keepit short and sweet, I am notgoing to give you a list of 100 resources! Here are my
top two recommendationsfor you to go deeper and/or widerinto the topics we have covered
(andnot covered):
You have mastered the mostessential concepts in Data Science now. Obviously, there are a
gazillion technologies, techniques, algorithms and everything in betweenthat a DataScientist
“must” know.Should you try to learn everything before you feel qualified to apply for a Data
Scientist job? Definitely not! Don’t get trappedin theblack hole of attempting to tick every
check box in the world. Chances are you will stay stuck ticking check boxes. It might eventually
work, but it wouldbe aninefficient process. Thereis a good reason whythereis the conceptof
learning on thejob. Let me give you two smart approaches to launch your career as a Data
Scientist based on yourpersonality type.
If your answeris 1, pick routeA.If you chose 2,pick route B. If you are in between,read both
anddecide based on whatsoundsbest to you.
1. Think about your dreamjobas a DataScientist. Which industry wouldit be in? Do you see
yourself in finance, sports, health,fitness, beers, cookies, oceans? Whatarethethings that
pique your curiosity?
. Have you identified your kick? Good! Nowfind a public dataset from thatfield and press
N
the fast-forward button: imagine thatyou aregiving a presentation about your awesome
project. Whatstories from your data would you be telling your audience? How would you
be providing valueto the people listening? Based onthis vision of your future
presentation, reverse engineer the problem, formulate interesting questions, and define
the endgoal for your project.
. Once you are donewith your exciting project, put the bait on the hookandthrowit in the
wo
water. Build a PUBLIC portfolio and MAKE SOME NOISE! Use the powerof LinkedIn to
reach outto recruiterslet your portfolio do the talking instead of some boring
conventional CV. Go to meetups andtalk to people aboutyourprojects. Use the powerof
networking events to find potential employers.
Your portfolio will already unlock manydoors. Butif on top ofthat, your topic and findings are
valuableto a large audience, you'll be receiving more incomingrecruiting calls than you can
imagine.
If it wasn’t obvious enough,you will end up with a lot more than a compelling portfolio, you
will have learnedthelatest tools and techniquesin a fun,curiosity-driven,andlasting way.
Tam going to give you an uncommon approachthat can makeyou instantly standout from the
crowdand get youthat high-payingjob. It is a deceptively simple approach,but only suitable
for those whocantakeondifficult challenges and do whateverit takes to reachtheir goals. Are
you readyforit? So hereis goes:
Find a companythat you wouldlike to work at. Reach outto the decision makersin the
companyusing LinkedIn, references, or even just walk in. Andoffer to work for FREE for 2
months and while being assessed on the performance. Yes,insist that you don’t wanta salary,
or even a stipend.Say that, “I’m hereto learn and assessthesuitability of this role for me”. Then
buckle downto workfor those 2 monthsandfocus on providing VALUE.You will have become
a very valuable resourceby the end andyouwill be surprised how muchthey’ll want you to
stay with a high-paying offer after those 2 months. That’s it. Two monthsandthejob will be
yours.
Route A or RouteB, you havethe key to unlock that high-payingjob as a Data Scientist at
your dream company.
PS. If you spend even a couple weeksdoingthis, it will change your life.
Imposter Syndrome
“It’s only a matterof time until I’m called out. I’m just a fraud.”
If you end up with these feelings just before or during an interview, it can BLOW UPeverything
that you had been workingfor. This is why I decidedto talk aboutthis with you. I do not want
youto give upatthe very last moment. I do not wantyouto stop asking questions for the fear
ofbeing “discovered”, I do not wantyou to shut your mouth andstay away from speaking up
becausethereis a voice in your headsayingthat you do notbelonghere. I want you to
understand wherethatvoice is coming from and how you candeal with it. Because let metell
you onething: you are NOT an imposter.
| have written 11 books but each time| think ‘Uh-oh, they're goingto find out
now.I’ve run a game on everybody, and they're going to find me out.
—MayaAngelou
Every time | was called on in class, | was sure that | was about to embarrass
myself. Every time | took a test, | was surethatit had gone badly. And every time |
didn’t embarrass myself — or even excelled — | believed that | had fooled
everyone yet again. One day soon, the jig would be up ... This phenomenon of
capable people being plagued by self-doubt has a name — the impostor
syndrome. Both men and women are susceptible to the impostor syndrome, but
womentend to experience it moreintensely and be more limited byit.
—John Steinbeck
Ultra-successful people are also plagued with these doubts andfeelings. Almostno oneis totally
immuneto the aweful Imposter Syndrome. But whatis really behind it?
Fieldslike Data Science aresovastthat you can’t possibly know everything.Also, the more you
learn, the more things you will find to learn further. The knowledgegap will seem to widen
more and more.Andthis can result in making you feellike crap; like someone whoisn’t just
able to keep up with everything they mustlearn.
Thereis another hugefactor behindit, our urge to compare ourselves with others. Wefeel as if
all the people around us have way more knowledgethan wedo,they are waybetter than us,
they belong here while wejust got theticket by pure luck. Soundslike a devastating state of
mindto be in? Well, itis.
Imposter Syndrome makesyoufocuson all the things you do not know, especially before an
interview. Whenyoudetect the red alarm,redirect your focus from thelimitless possibilities of
things you do not knowtoall the things you do know,andaregoodat.It’s not the kind of
thoughts you wantbeforeaninterview. Remindyourself that you have your ownpositive
strengths, and you do not need to compare yourself with Tom and Harry. Tom and Harry might
be greatatskills “x,y,z” but you might be awesomeat “a,b,c” — we all have our own unique
strengths and weaknesses.
It’s good to rememberthat people who don’t feellike impostorsare no more intelligent or
competentor capablethan therest ofus.
A big warninghere: I’m NOT telling you to adopt an attitude of arrogance. Farfrom it! We all
havea lot to learn,andthis is where the beauty of Continuous Learning and Growth Mindset
comeinto play. Keep learning, stay humble, and BELIEVE in yourself.
Final Thoughts
Give yourself a PAT ON THE BACKfor having madeit successfully to the end. Many
congratulations on completing all the lessons.
Before saying goodbye, hereare twothingsthat I really want you to rememberas you continue
on your journeyas a DataScientist:
“The path to becoming a great Data Scientist is not a sprint, but a marathon.”
P.S. [hope you enjoyed this course. Let me know howit went.I'll wait to hear back from you.
Best of luck with the nextsteps in your journey to becominga great DataScientist!