0% found this document useful (0 votes)
136 views

Data Pre-Processing (Pandas)

This document discusses various methods for handling missing or null data in pandas including isnull(), notnull(), dropna(), fillna(), replace(), and interpolate(). It provides examples of using isnull() and notnull() to check for null values and return boolean series. It also discusses using dropna() to drop rows or columns containing null values.

Uploaded by

shweta mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Data Pre-Processing (Pandas)

This document discusses various methods for handling missing or null data in pandas including isnull(), notnull(), dropna(), fillna(), replace(), and interpolate(). It provides examples of using isnull() and notnull() to check for null values and return boolean series. It also discusses using dropna() to drop rows or columns containing null values.

Uploaded by

shweta mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Performing pre-processing operations on

data for data analysis IN PYTHON

How To Convert Data Types in Python ?


Type conversion is the process of converting one data type to another. There
can be two types of type conversion in Python –
 Implicit Type Conversion
 Explicit Type Conversion
Implicit Type Conversion
It is a type of type conversion in which handles automatically convert one
data type to another without any user involvement.
Example:
Python3

# Python program to demonstrate

# implicit type conversion

# Python automatically converts

# a to int

a =5

print(type(a))

# Python automatically converts

# b to float

b = 1.0

print(type(b))
# Python automatically converts

# c to int as it is a floor division

c = a//b

print(c)

print(type(c))

Output:
<class 'int'>
<class 'float'>
5.0
<class 'float'>
In the above example, it can be seen that Python handles all the type
conversion automatically without any user involvement.
Explicit Type Conversion
In Explicit type conversion, user involvement is required. The user converts
one data type to another according to his own need. This can be done with
the help of str(), int(), float(), etc. functions. Let’s see the handling of various
type conversions.

Type Conversion with strings

A string is generally a sequence of one or more characters. We are often


required to convert string to numbers and vice versa. Let’s see each of them
in detail.
Converting Numbers to String
A number can be converted to string using the str() function. To do this pass
a number or a variable containing the numeric value to this function.
Example:
Python3

# Python program to demonstrate


# type conversion of number to

# string

a = 10

# Converting number to string

s = str(a)

print(s)

print(type(s))

Output:
10
<class 'str'>
This can be useful when we want to print some string containing a number to
the console. Consider the below example.
Example:
Python3

s = "GFG"

n = 50

print("String: " + s + "\nNumber: " + str(n))

Output:
String: GFG
Number: 50
Converting String to Number
A string can be converted to a number using int() or float() method. To do
this pass a valid string containing the numerical value to either of these
functions (depending upon the need).
Note: If A string containing not containing a numeric value is passed then an
error is raised.
Example:
Python3

# Python program to demonstrate

# type conversion of string to

# number

s = '50'

# Converting to int

n = int(s)

print(n)

print(type(n))

# Converting to float

f = float(s)

print(f)

print(type(f))

Output:
50
<class 'int'>
50.0
<class 'float'>

Type Conversion with Numbers

There are basically two types of numbers in Python – integers and floating-
point numbers. Weare often required to change from one type to another.
Let’s see their conversion in detail.

Floating Point to Integer

A floating-point can be converted to an integer using the int() function. To do


this pass a floating-point inside the int() method.
Example:
Python3

# Python program to demonstrate

# floating point to integer

f = 10.0

# Converting to integer

n = int(f)

print(n)

print(type(n))

Output:
10
<class 'int'>
Integer to Floating Point

An integer can be converted to float using the float() method. To do this pass
an integer inside the float() method.
Example:
Python3

# Python program to demonstrate

# integer to float

n = 10

# Converting to float

f = float(n)

print(f)

print(type(f))

Output:
10.0
<class 'float'>

Type conversion between Tuple and List

In Python, Both tuple and list can be converted to one another. It can be
done by using the tuple() and list() method. See the below examples for
better understanding.
Example:
Python3

# Python program to demonstrate


# type conversion between list

# and tuples

t = (1, 2, 3, 4)

l = [5, 6, 7, 8]

# Converting to tuple

T = tuple(l)

print(T)

print(type(T))

# Converting to list

L = list(t)

print(L)

print(type(L))

Output:
(5, 6, 7, 8)
<class 'tuple'>
[1, 2, 3, 4]
<class 'list'>

Python Pandas - Categorical Data


Often in real-time, data includes the text columns, which are
repetitive. Features like gender, country, and codes are always
repetitive. These are the examples for categorical data.

Categorical variables can take on only a limited, and usually fixed


number of possible values. Besides the fixed length, categorical
data might have an order but cannot perform numerical
operation. Categorical are a Pandas data type.

The categorical data type is useful in the following cases −

A string variable consisting of only a few different values.



Converting such a string variable to a categorical variable
will save some memory.
 The lexical order of a variable is not the same as the logical
order (“one”, “two”, “three”). By converting to a categorical
and specifying an order on the categories, sorting and
min/max will use the logical order instead of the lexical
order.
 As a signal to other python libraries that this column should
be treated as a categorical variable (e.g. to use suitable
statistical methods or plot types).
Object Creation

Categorical object can be created in multiple ways. The different


ways have been described below −

category

By specifying the dtype as "category" in pandas object creation.

Live Demo
import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print s
Its output is as follows −
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]

The number of elements passed to the series object is four, but


the categories are only three. Observe the same in the output
Categories.

pd.Categorical

Using the standard pandas Categorical constructor, we can create


a category object.

pandas.Categorical(values, categories, ordered)

Let’s take an example −

Live Demo
import pandas as pd

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])


print cat
Its output is as follows −
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]

Let’s have another example −

Live Demo
import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'])
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any
value which is not present in the categories will be treated
as NaN.

Now, take a look at the following example −

Live Demo
import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'],ordered=True)
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
Logically, the order means that, a is greater than b and b is
greater than c.
Description
Using the .describe() command on the categorical data, we get
similar output to a Series or DataFrame of the type string.
Live Demo
import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan],


categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c",
np.nan]})

print df.describe()
print df["cat"].describe()
Its output is as follows −
cat s
count 33
unique 2 2
top cc
freq 22
count 3
unique 2
top c
freq 2
Name: cat, dtype: object

Working with Missing Data in Pandas


Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful
functions for detecting, removing, and replacing null values in Pandas
DataFrame :
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.
Pandas DataFrame isnull() Method
Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are True for NaN values
Example
In the following example, The Team column is checked for NULL values and
a boolean series is returned by the isnull() method which stores True for ever
NaN value and False for a Not null value.

 Python3

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series True for NaN values

bool_series = pd.isnull(data["Team"])
# filtering data

# displaying data only with team = NaN

data[bool_series]

Output:
As shown in the output image, only the rows having Team=NULL are
displayed.

Pandas DataFrame notnull() Method


Syntax: Pandas.notnull(“DataFrame Name”) or DataFrame.notnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are False for NaN values
Example
In the following example, the Gender column is checked for NULL values
and a boolean series is returned by the notnull() method which stores True
for every NON-NULL value and False for a null value.

 Python3

# importing pandas package


import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series False for NaN values

bool_series = pd.notnull(data["Gender"])

# displayed data only with team = NaN

data[bool_series]

Output:
As shown in the output image, only the rows having some value in Gender
are displayed.
Pandas dropna() method allows the user to analyze and drop
Rows/Columns with Null values in different ways.
Pandas DataFrame.dropna() Syntax
Syntax: DataFrameName.dropna(axis=0, how=’any’, thresh=None,
subset=None, inplace=False)
Parameters:
 axis: axis takes int or string value for rows/columns. Input can be 0 or 1
for Integer and ‘index’ or ‘columns’ for String.
 how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops
the row/column if ANY value is Null and ‘all’ drops only if ALL values are
null.
 thresh: thresh takes integer value which tells minimum amount of na
values to drop.
 subset: It’s an array which limits the dropping process to passed
rows/columns through list. inplace: It is a boolean which makes the
changes in data frame itself if True.

# importing pandas module


import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv")

# making new data frame with dropped NA values

new_data = data.dropna(axis=0, how='any')

# comparing sizes of data frames

print("Old data frame length:", len(data),

"\nNew data frame length:",

len(new_data),

"\nNumber of rows with at least 1 NA value: ",

(len(data)-len(new_data)))

 Output:
 Since the difference is 94, there were 94 rows that had at least 1 Null
value in any column.

 Old data frame length: 458


 New data frame length: 364
 Number of rows with at least 1 NA value: 94

Python | Pandas DataFrame.fillna() to


replace Null values in dataframe
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False,
limit=None, downcast=None, **kwargs)
Parameters:
value : Static, dictionary, array, series or dataframe to fill instead of
NaN. method : Method is used if user doesn’t pass any value. Pandas has
different methods like bfill, backfill or ffill which fills the place with value in the
Forward index or Previous/Back respectively. axis: axis takes int or string
value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or
‘columns’ for String inplace: It is a boolean which makes the changes in
data frame itself if True. limit : This is an integer value which specifies
maximum number of consecutive forward/backward NaN value
fills. downcast : It takes a dict which specifies what dtype to downcast to
which one. Like Float64 to int64. **kwargs : Any other Keyword arguments

Example #1: Replacing NaN values with a Static value. Before replacing:
 Python3

# importing pandas module

import pandas as pd

# making data frame from csv file

nba = pd.read_csv("nba.csv")

nba
Output:

After replacing: In the following example, all the null values in College
column has been replaced with “No college” string. Firstly, the data frame is
imported from CSV and then College column is selected and fillna() method
is used on it.

 Python

# importing pandas module

import pandas as pd

# making data frame from csv file

nba = pd.read_csv("nba.csv")

# replacing na values in college with No college

nba["College"].fillna("No College", inplace = True)


nba

Output:

Python | Pandas dataframe.replace()


Pandas dataframe.replace() function is used to replace a string,
regex, list, dictionary, series, number, etc. from a Pandas
Dataframe in Python.

Syntax of dataframe.replace()

Syntax: DataFrame.replace(to_replace=None, value=None, inplace=False,


limit=None, regex=False, method=’pad’, axis=None)

Parameters:
 to_replace : [str, regex, list, dict, Series, numeric, or None] pattern that
we are trying to replace in dataframe.
 value : Value to use to fill holes (e.g. 0), alternately a dict of values
specifying which value to use for each column (columns not in the dict will
not be filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
 inplace : If True, in place. Note: this will modify any other views on this
object (e.g. a column from a DataFrame). Returns the caller if this is
True.
 limit : Maximum size gap to forward or backward fill
 regex : Whether to interpret to_replace and/or value as regular
expressions. If this is True then to_replace must be a string. Otherwise,
to_replace must be None because this parameter will be interpreted as a
regular expression or a list, dict, or array of regular expressions.
 method : Method to use when for replacement, when to_replace is a list.
Returns: filled : NDFrame

Example:

Here, we are replacing 49.50 with 60.

 Python3

import pandas as pd

df = {

"Array_1": [49.50, 70],

"Array_2": [65.1, 49.50]

data = pd.DataFrame(df)

print(data.replace(49.50, 60))

Output:
Array_1 Array_2
0 60.0 65.1
1 70.0 60.0

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy