Data Pre-Processing (Pandas)
Data Pre-Processing (Pandas)
# a to int
a =5
print(type(a))
# b to float
b = 1.0
print(type(b))
# Python automatically converts
c = a//b
print(c)
print(type(c))
Output:
<class 'int'>
<class 'float'>
5.0
<class 'float'>
In the above example, it can be seen that Python handles all the type
conversion automatically without any user involvement.
Explicit Type Conversion
In Explicit type conversion, user involvement is required. The user converts
one data type to another according to his own need. This can be done with
the help of str(), int(), float(), etc. functions. Let’s see the handling of various
type conversions.
# string
a = 10
s = str(a)
print(s)
print(type(s))
Output:
10
<class 'str'>
This can be useful when we want to print some string containing a number to
the console. Consider the below example.
Example:
Python3
s = "GFG"
n = 50
Output:
String: GFG
Number: 50
Converting String to Number
A string can be converted to a number using int() or float() method. To do
this pass a valid string containing the numerical value to either of these
functions (depending upon the need).
Note: If A string containing not containing a numeric value is passed then an
error is raised.
Example:
Python3
# number
s = '50'
# Converting to int
n = int(s)
print(n)
print(type(n))
# Converting to float
f = float(s)
print(f)
print(type(f))
Output:
50
<class 'int'>
50.0
<class 'float'>
There are basically two types of numbers in Python – integers and floating-
point numbers. Weare often required to change from one type to another.
Let’s see their conversion in detail.
f = 10.0
# Converting to integer
n = int(f)
print(n)
print(type(n))
Output:
10
<class 'int'>
Integer to Floating Point
An integer can be converted to float using the float() method. To do this pass
an integer inside the float() method.
Example:
Python3
# integer to float
n = 10
# Converting to float
f = float(n)
print(f)
print(type(f))
Output:
10.0
<class 'float'>
In Python, Both tuple and list can be converted to one another. It can be
done by using the tuple() and list() method. See the below examples for
better understanding.
Example:
Python3
# and tuples
t = (1, 2, 3, 4)
l = [5, 6, 7, 8]
# Converting to tuple
T = tuple(l)
print(T)
print(type(T))
# Converting to list
L = list(t)
print(L)
print(type(L))
Output:
(5, 6, 7, 8)
<class 'tuple'>
[1, 2, 3, 4]
<class 'list'>
category
Live Demo
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
print s
Its output is as follows −
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
pd.Categorical
Live Demo
import pandas as pd
Live Demo
import pandas as pd
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'])
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any
value which is not present in the categories will be treated
as NaN.
Live Demo
import pandas as pd
cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'],ordered=True)
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
Logically, the order means that, a is greater than b and b is
greater than c.
Description
Using the .describe() command on the categorical data, we get
similar output to a Series or DataFrame of the type string.
Live Demo
import pandas as pd
import numpy as np
print df.describe()
print df["cat"].describe()
Its output is as follows −
cat s
count 33
unique 2 2
top cc
freq 22
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.
Pandas DataFrame isnull() Method
Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are True for NaN values
Example
In the following example, The Team column is checked for NULL values and
a boolean series is returned by the isnull() method which stores True for ever
NaN value and False for a Not null value.
Python3
import pandas as pd
data = pd.read_csv("employees.csv")
bool_series = pd.isnull(data["Team"])
# filtering data
data[bool_series]
Output:
As shown in the output image, only the rows having Team=NULL are
displayed.
Python3
data = pd.read_csv("employees.csv")
bool_series = pd.notnull(data["Gender"])
data[bool_series]
Output:
As shown in the output image, only the rows having some value in Gender
are displayed.
Pandas dropna() method allows the user to analyze and drop
Rows/Columns with Null values in different ways.
Pandas DataFrame.dropna() Syntax
Syntax: DataFrameName.dropna(axis=0, how=’any’, thresh=None,
subset=None, inplace=False)
Parameters:
axis: axis takes int or string value for rows/columns. Input can be 0 or 1
for Integer and ‘index’ or ‘columns’ for String.
how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops
the row/column if ANY value is Null and ‘all’ drops only if ALL values are
null.
thresh: thresh takes integer value which tells minimum amount of na
values to drop.
subset: It’s an array which limits the dropping process to passed
rows/columns through list. inplace: It is a boolean which makes the
changes in data frame itself if True.
data = pd.read_csv("nba.csv")
len(new_data),
(len(data)-len(new_data)))
Output:
Since the difference is 94, there were 94 rows that had at least 1 Null
value in any column.
Example #1: Replacing NaN values with a Static value. Before replacing:
Python3
import pandas as pd
nba = pd.read_csv("nba.csv")
nba
Output:
After replacing: In the following example, all the null values in College
column has been replaced with “No college” string. Firstly, the data frame is
imported from CSV and then College column is selected and fillna() method
is used on it.
Python
import pandas as pd
nba = pd.read_csv("nba.csv")
Output:
Syntax of dataframe.replace()
Parameters:
to_replace : [str, regex, list, dict, Series, numeric, or None] pattern that
we are trying to replace in dataframe.
value : Value to use to fill holes (e.g. 0), alternately a dict of values
specifying which value to use for each column (columns not in the dict will
not be filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace : If True, in place. Note: this will modify any other views on this
object (e.g. a column from a DataFrame). Returns the caller if this is
True.
limit : Maximum size gap to forward or backward fill
regex : Whether to interpret to_replace and/or value as regular
expressions. If this is True then to_replace must be a string. Otherwise,
to_replace must be None because this parameter will be interpreted as a
regular expression or a list, dict, or array of regular expressions.
method : Method to use when for replacement, when to_replace is a list.
Returns: filled : NDFrame
Example:
Python3
import pandas as pd
df = {
data = pd.DataFrame(df)
print(data.replace(49.50, 60))
Output:
Array_1 Array_2
0 60.0 65.1
1 70.0 60.0