B14_LT2_07_Numpy Matplotlib Pandas
B14_LT2_07_Numpy Matplotlib Pandas
PANDAS
Lê Ngọc Hiếu
hieu.ln@ou.edu.vn
Objectives
• Hiểu được khái niệm và các thao tác với thư viện
NUMPY, MATPLOTLIB và PANDAS.
• Nắm được các thao tác với thư viện NUMPY,
MATPLOTLIB và PANDAS.
• Sử dụng các thư viện NUMPY, MATPLOTLIB và PANDAS
vào các bài toán xử lý và phân tích dữ liệu.
2
1. Numpy Package
2. Matplotlib Package
Contents 3. Pandas Package
4. Exercises
3
Numpy Package
4
What is Numpy?
• NumPy is a Python library used for working with arrays.
• It also has functions for working in domain of linear algebra,
fourier transform, and matrices.
• NumPy was created in 2005 by Travis Oliphant. It is an open
source project and you can use it freely.
• NumPy stands for Numerical Python.
Why use Numpy?
• In Python we have lists that serve the purpose of arrays,
but they are slow to process.
• NumPy aims to provide an array object that is up to 50x
faster than traditional Python lists.
• The array object in NumPy is called ndarray, it provides a
lot of supporting functions that make working
with ndarray very easy.
• Arrays are very frequently used in data science, where
speed and resources are very important.
6
Instal and Start to Use Numpy
• Already included in Anaconda.
• If you wish to install Numpy, open Command Prompt
Window (CMD) and type: pip install numpy
• To use numpy, import numpy package before using its
functions.
7
Numpy Array
• A numpy array is a grid of values, all of the same type, and
is indexed by a tuple of nonnegative integers.
• The number of dimensions is the rank of the array.
• Syntax to get the rank of a numpy array:
<array_name>.ndim
• The shape of an array is a tuple of integers giving the size
of the array along each dimension.
• Syntax to get the shape of a numpy array:
<array_name>.shape
• We can initialize numpy arrays from nested Python lists.
8
Numpy Array
9
Create Numpy Array
• There are several way to create Numpy arrays.
• Consider three ways:
• Convert from Python List or Type using the array function.
• Create array with initialize values using ones, or zeros
function.
• Create a sequence of numbers using arrange or linspace
function.
10
Create Numpy Array using array
function
11
Create Numpy Array using ones, or zeros
function
Syntax:
• <var_name> =
np.zeros((ndim, nrows,
ncolumns))
• <var_name> =
np.ones((ndim, nrows,
ncolumns))
12
Create Numpy array using arange or linspace
function
Syntax:
• <var_name> =
np.array(start, end, step)
• <var_name> =
np.linspace(start, end,
number_of_elements)
13
Array Indexing - Slicing
• Similar to Python
lists, numpy
arrays can be
sliced.
• Since arrays may
be
multidimensional,
you must specify
a slice for each
dimension of the
array.
14
Array Indexing - Integer array
indexing
Integer array
indexing
allows you to
construct
arbitrary
arrays using
the data from
another array.
15
NumPy Data Types
• Basic Data Types in Python:
• strings - used to represent text data, the text is given under
quote marks. e.g. "ABCD"
• integer - used to represent integer numbers. e.g. -1, -2, -3
• float - used to represent real numbers. e.g. 1.2, 42.42
• boolean - used to represent True or False.
• complex - used to represent complex numbers. e.g. 1.0 +
2.0j, 1.5 + 2.5j
16
NumPy Data Types
• NumPy has some extra data types, and refer to data types with
one character:
• i - integer
• b - boolean
• u - unsigned integer
• f - float
• c - complex float
• m - timedelta
• M - datetime
• O - object
• S - string
• U - unicode string
• 17
NumPy Data Types
• Checking the Data Type of an Array:
• The NumPy array object has a property
called dtype that returns the data type of
the array.
• Creating Arrays With a Defined Data
Type:
• The array() can take an optional
argument dtype that allows us to define
the expected data type of the array
elements.
• Converting Data Type on Existing Arrays:
• The astype() function creates a copy of
the array and allows you to specify the
data type as a parameter. 18
NumPy Array Copy vs View
• The main difference between a copy and a view of an array
is that the copy is a new array, and the view is just a view
of the original array.
• The copy own s the data and any changes made to the
copy will not affect original array, and any changes made to
the original array will not affect the copy.
• The view does not own the data and any changes made to
the view will affect the original array, and any changes
made to the original array will affect the view.
19
NumPy Array Copy vs View
20
NumPy Array Reshaping
• Reshaping means changing the shape of an array.
21
22
23
NumPy Array Iterating – Use for loop
24
NumPy Joining Array
25
NumPy Joining Array - concatenate
• Concatenation refers to joining. This function is used to join
two or more arrays of the same shape along a specified
axis.
• Syntax: numpy.concatenate((array1, araay2, ...),
axis)
• If axis is not explicitly passed, it is taken as 0.
26
NumPy Joining Array - concatenate
27
NumPy Joining Array - stack
• This function joins the sequence of arrays along a new axis.
• Syntax: numpy.stack(arrays, axis)
28
NumPy Joining Array - hstack
• Variants of numpy.stack function to stack so as to make a
single array horizontally.
• Syntax: numpy.hstack(array1, array2, …, arrayn)
29
NumPy Joining Array - vstack
• Variants of numpy.stack function to stack so as to make a
single array vertically.
• Syntax: numpy.vstack(array1, array2, …, arrayn)
30
NumPy Splitting Array - numpy.split
• Syntax: numpy.split(array, indices_or_sections, axis)
31
NumPy Searching Arrays - where() method
• where() method search an array for a certain value and
return the indexes that get a match.
32
NumPy Searching Arrays - searchsorted()
method
• searchsorted() method performs a binary search in the
array and returns the index where the specified value would
be inserted to maintain the search order.
33
NumPy Sorting Arrays
• Syntax: numpy.sort(a, axis=- 1, kind=None, order=None)
• a: Array to be sorted.
• axis: int or None, optional. Axis along which to sort. If None,
the array is flattened before sorting. The default is -1, which
sorts along the last axis.
• kind{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, optional
• order: str or list of str, optional. When a is an array with fields
defined, this argument specifies which fields to compare first,
second, etc.
34
NumPy Sorting Arrays
35
NumPy Sorting Arrays
36
Numpy Array Math
Basic Elementwise Operators
• Elementwise sum: + or numpy.add
• Elementwise difference: - or numpy.subtract
• Elementwise product: numpy.multiply
• Elementwise division: numpy.divide
• Elementwise square root: numpy.sqrt
37
38
Numpy Array Math
• Inner products of
vectors:
• Multiply a vector by a
matrix:
• Multiply matrices
• Syntax: Given
matrix/vector a and b
• numpy.dot(a, b)
• a.dot(b)
39
Full list of Numpy mathematical
functions
• Link:
https://numpy.org/doc/stable/reference/routines.math.html
• Categories: • Floating point routines
• Trigonometric functions • Rational routines
• Hyperbolic functions • Arithmetic operations
• Rounding • Handling complex numbers
• Sums, products, differences • Extrema Finding
• Exponents and logarithms • Miscellaneous
• Other special functions
40
Numpy - Broadcasting
• The term broadcasting refers to the ability of NumPy to
treat arrays of different shapes during arithmetic
operations.
• If the dimensions of two arrays are dissimilar, element-to-
element operations are not possible.
• However, operations on arrays of non-similar shapes is still
possible in NumPy, because of the broadcasting capability.
• The smaller array is broadcast to the size of the larger array
so that they have compatible shapes.
41
Numpy - Broadcasting
Figure from:
https://www.tutorialspoint.com/numpy/numpy_broadcasti
ng.htm
42
Matplotlib Package
Reference:
https://www.w3schools.com/python/matplotlib_intr 43
What is Matplotlib?
• Matplotlib is a low level graph plotting library in python that
serves as a visualization utility.
45
Matplotlib Pyplot
• Most of the Matplotlib utilities lies under
the pyplot submodule, and are usually imported under
the plt alias:
• Syntax: import matplotlib.pyplot as plt
46
Basic Plot Type
• Line plot
• Scatter plot
47
Line plot
• The plot() function is used to draw points (markers) in a
diagram.
• By default, the plot() function draws a line from point to
point.
• Basic syntax: plt.plot(xpoints, ypoints)
• xpoints is an array containing the points on the x-axis.
• ypoints is an array containing the points on the y-axis.
48
plot() function with default X-Points
• If the points in the x-axis are not specified, they will get the
default values 0, 1, 2, 3, …
49
Plotting Options
• All options:
https://matplotlib.org/2.1.2/api/_as_gen/matplotlib.pyplot.pl
ot.html
50
Plot Label and Title
• To set a label for the x- and • To set a title for the plot:
y-axis: • title()
• xlabel()
• ylabel()
51
Legends
52
Legends
53
Legend Position
54
Legend Position
Location Location Location Location
String Code String Code
'best' 0 'center left' 6
'right' 5
55
Legend Position - bbox_to_anchor
56
57
Scatter Plots
58
59
Customizing Markers in Scatter Plots
• Four main features of the markers used in a scatter plot
that can be customized:
• Size
• Color
• Shape
(https://matplotlib.org/stable/api/markers_api.html#module-
matplotlib.markers)
• Transparency
60
61
62
63
64
65
66
67
ColorMap
Available ColorMaps:
https://www.w3schools.com/python/matplotlib_scatter.asp
68
69
plt.scatter(x, y, c=colors,
cmap='Accent')
70
plt.scatter(x, y, c=colors,
cmap='Blues')
71
72
Bar Plot
73
74
Matplotlib Multiple Bar Chart
75
76
Create Multiple Bar Chart
• Syntax: plt.bar(x, height, width=None, bottom=None,
align='center', data=None, **kwargs)
• The parameters are defined below:
• x: specify the x-coordinates of the bars.
• height: y-coordinates specify the height of the bars.
• width: specify the width of the bars.
• bottom: specify the y coordinates of the bases of the bars.
• align: alignment of the bars.
77
Matplotlib Histograms
A histogram is a graph
showing frequency
distributions.
78
Syntax to create a histogram plot:
matplotlib.pyplot.hist(x, bins=None, range=None, density=False,
weights=None, cumulative=False, bottom=None, histtype='bar',
align='mid', orientation='vertical', rwidth=None, log=False, color=None,
label=None, stacked=False, *, data=None, **kwargs)
79
Matplotlib Histograms - Options
• bins:
• If bins is an integer, it defines the number of equal-width bins
in the range.
• If bins is a sequence, it defines the bin edges, including the
left edge of the first bin and the right edge of the last bin; in
this case, bins may be unequally spaced. All but the last
(righthand-most) bin is half-open
• Example: if bins is [1, 2, 3, 4] then the first bin is [1, 2), and
the second [2, 3). The last bin is [3, 4].
• rwidth (default: None)
• The relative width of the bars as a fraction of the bin width. If
None, automatically compute the width.
80
plt.hist(commutes, bins=10,
edgecolor='black') 81
plt.hist(commutes, bins=20,
edgecolor='black') 82
Matplotlib Pie Charts
83
Matplotlib Pie Charts
84
Matplotlib Pie Charts
85
Pandas Package
86
Introduction
88
Why Use Pandas?
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
89
What Can Pandas Do?
• Pandas gives you answers about the data.
• For examples:
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
• Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
90
Pandas Getting Started
• Install Pandas: pip install pandas
• Import Pandas:
• import pandas
• import pandas as pd
91
Pandas Series
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
92
Pandas Series - Labels
• If nothing else is specified, the values are labeled with their
index number.
• First value has index 0, second value has index 1 etc.
• This label can be used to access a specified value.
Output: 7
93
Pandas Series - Create Labels
• With the index argument, you can name your own labels.
Output: 7
94
Key/Value Objects as Series
95
Pandas DataFrames
• A Pandas DataFrame is a 2 dimensional data structure, like
a 2 dimensional array, or a table with rows and columns.
96
Access to DataFrame Elements
• Syntax: pandas.loc[row_index][column_index]
97
Named Indexes.
• With the index argument, you can name your own indexes
98
Pandas Read CSV
• What is CSV (comma separated value) files:
• A simple way to store big data sets.
• CSV files contains plain text and is a well know format that
can be read by everyone including Pandas.
100
Load the CSV into a DataFrame
• Use read_csv() function.
• Syntax: pandas.read_csv(csv_filename)