0% found this document useful (0 votes)

136 views

Data Pre-Processing (Pandas)

This document discusses various methods for handling missing or null data in pandas including isnull(), notnull(), dropna(), fillna(), replace(), and interpolate(). It provides examples of using isnull() and notnull() to check for null values and return boolean series. It also discusses using dropna() to drop rows or columns containing null values.

Uploaded by

shweta mishra

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

Data Pre-Processing (Pandas)

Uploaded by

shweta mishra

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Performing pre-processing operations on

data for data analysis IN PYTHON

How To Convert Data Types in Python ?

Type conversion is the process of converting one data type to another. There
can be two types of type conversion in Python –
 Implicit Type Conversion
 Explicit Type Conversion
Implicit Type Conversion
It is a type of type conversion in which handles automatically convert one
data type to another without any user involvement.
Example:
Python3

# Python program to demonstrate

# implicit type conversion

# Python automatically converts

# a to int

a =5

print(type(a))

# Python automatically converts

# b to float

b = 1.0

print(type(b))
# Python automatically converts

# c to int as it is a floor division

c = a//b

print(c)

print(type(c))

Output:
<class 'int'>
<class 'float'>
5.0
<class 'float'>
In the above example, it can be seen that Python handles all the type
conversion automatically without any user involvement.
Explicit Type Conversion
In Explicit type conversion, user involvement is required. The user converts
one data type to another according to his own need. This can be done with
the help of str(), int(), float(), etc. functions. Let’s see the handling of various
type conversions.

Type Conversion with strings

A string is generally a sequence of one or more characters. We are often

required to convert string to numbers and vice versa. Let’s see each of them
in detail.
Converting Numbers to String
A number can be converted to string using the str() function. To do this pass
a number or a variable containing the numeric value to this function.
Example:
Python3

# Python program to demonstrate

# type conversion of number to

# string

a = 10

# Converting number to string

s = str(a)

print(s)

print(type(s))

Output:
10
<class 'str'>
This can be useful when we want to print some string containing a number to
the console. Consider the below example.
Example:
Python3

s = "GFG"

n = 50

print("String: " + s + "\nNumber: " + str(n))

Output:
String: GFG
Number: 50
Converting String to Number
A string can be converted to a number using int() or float() method. To do
this pass a valid string containing the numerical value to either of these
functions (depending upon the need).
Note: If A string containing not containing a numeric value is passed then an
error is raised.
Example:
Python3

# Python program to demonstrate

# type conversion of string to

# number

s = '50'

# Converting to int

n = int(s)

print(n)

print(type(n))

# Converting to float

f = float(s)

print(f)

print(type(f))

Output:
50
<class 'int'>
50.0
<class 'float'>

Type Conversion with Numbers

There are basically two types of numbers in Python – integers and floating-
point numbers. Weare often required to change from one type to another.
Let’s see their conversion in detail.

Floating Point to Integer

A floating-point can be converted to an integer using the int() function. To do

this pass a floating-point inside the int() method.
Example:
Python3

# Python program to demonstrate

# floating point to integer

f = 10.0

# Converting to integer

n = int(f)

print(n)

print(type(n))

Output:
10
<class 'int'>
Integer to Floating Point

An integer can be converted to float using the float() method. To do this pass
an integer inside the float() method.
Example:
Python3

# Python program to demonstrate

# integer to float

n = 10

# Converting to float

f = float(n)

print(f)

print(type(f))

Output:
10.0
<class 'float'>

Type conversion between Tuple and List

In Python, Both tuple and list can be converted to one another. It can be
done by using the tuple() and list() method. See the below examples for
better understanding.
Example:
Python3

# Python program to demonstrate

# type conversion between list

# and tuples

t = (1, 2, 3, 4)

l = [5, 6, 7, 8]

# Converting to tuple

T = tuple(l)

print(T)

print(type(T))

# Converting to list

L = list(t)

print(L)

print(type(L))

Output:
(5, 6, 7, 8)
<class 'tuple'>
[1, 2, 3, 4]
<class 'list'>

Python Pandas - Categorical Data

Often in real-time, data includes the text columns, which are
repetitive. Features like gender, country, and codes are always
repetitive. These are the examples for categorical data.

Categorical variables can take on only a limited, and usually fixed

number of possible values. Besides the fixed length, categorical
data might have an order but cannot perform numerical
operation. Categorical are a Pandas data type.

The categorical data type is useful in the following cases −

A string variable consisting of only a few different values.


Converting such a string variable to a categorical variable
will save some memory.
 The lexical order of a variable is not the same as the logical
order (“one”, “two”, “three”). By converting to a categorical
and specifying an order on the categories, sorting and
min/max will use the logical order instead of the lexical
order.
 As a signal to other python libraries that this column should
be treated as a categorical variable (e.g. to use suitable
statistical methods or plot types).
Object Creation

Categorical object can be created in multiple ways. The different

ways have been described below −

By specifying the dtype as "category" in pandas object creation.

Live Demo
import pandas as pd

s = pd.Series(["a","b","c","a"], dtype="category")
print s
Its output is as follows −
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]

The number of elements passed to the series object is four, but

the categories are only three. Observe the same in the output
Categories.

pd.Categorical

Using the standard pandas Categorical constructor, we can create

a category object.

pandas.Categorical(values, categories, ordered)

Let’s take an example −

Live Demo
import pandas as pd

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])

print cat
Its output is as follows −
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]

Let’s have another example −

Live Demo
import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'])
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]
Here, the second argument signifies the categories. Thus, any
value which is not present in the categories will be treated
as NaN.

Now, take a look at the following example −

Live Demo
import pandas as pd

cat = cat=pd.Categorical(['a','b','c','a','b','c','d'],
['c', 'b', 'a'],ordered=True)
print cat
Its output is as follows −
[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
Logically, the order means that, a is greater than b and b is
greater than c.
Description
Using the .describe() command on the categorical data, we get
similar output to a Series or DataFrame of the type string.
Live Demo
import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan],

categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c",
np.nan]})

print df.describe()
print df["cat"].describe()
Its output is as follows −
cat s
count 33
unique 2 2
top cc
freq 22
count 3
unique 2
top c
freq 2
Name: cat, dtype: object

Working with Missing Data in Pandas

Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful
functions for detecting, removing, and replacing null values in Pandas
DataFrame :
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.
Pandas DataFrame isnull() Method
Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are True for NaN values
Example
In the following example, The Team column is checked for NULL values and
a boolean series is returned by the isnull() method which stores True for ever
NaN value and False for a Not null value.

 Python3

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series True for NaN values

bool_series = pd.isnull(data["Team"])
# filtering data

# displaying data only with team = NaN

data[bool_series]

Output:
As shown in the output image, only the rows having Team=NULL are
displayed.

Pandas DataFrame notnull() Method

Syntax: Pandas.notnull(“DataFrame Name”) or DataFrame.notnull()
Parameters: Object to check null values for
Return Type: Dataframe of Boolean values which are False for NaN values
Example
In the following example, the Gender column is checked for NULL values
and a boolean series is returned by the notnull() method which stores True
for every NON-NULL value and False for a null value.

 Python3

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv("employees.csv")

# creating bool series False for NaN values

bool_series = pd.notnull(data["Gender"])

# displayed data only with team = NaN

data[bool_series]

Output:
As shown in the output image, only the rows having some value in Gender
are displayed.
Pandas dropna() method allows the user to analyze and drop
Rows/Columns with Null values in different ways.
Pandas DataFrame.dropna() Syntax
Syntax: DataFrameName.dropna(axis=0, how=’any’, thresh=None,
subset=None, inplace=False)
Parameters:
 axis: axis takes int or string value for rows/columns. Input can be 0 or 1
for Integer and ‘index’ or ‘columns’ for String.
 how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops
the row/column if ANY value is Null and ‘all’ drops only if ALL values are
null.
 thresh: thresh takes integer value which tells minimum amount of na
values to drop.
 subset: It’s an array which limits the dropping process to passed
rows/columns through list. inplace: It is a boolean which makes the
changes in data frame itself if True.


# importing pandas module

import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv")

# making new data frame with dropped NA values

new_data = data.dropna(axis=0, how='any')

# comparing sizes of data frames

print("Old data frame length:", len(data),

"\nNew data frame length:",

len(new_data),

"\nNumber of rows with at least 1 NA value: ",

(len(data)-len(new_data)))

 Output:
 Since the difference is 94, there were 94 rows that had at least 1 Null
value in any column.

 Old data frame length: 458

 New data frame length: 364
 Number of rows with at least 1 NA value: 94

Python | Pandas DataFrame.fillna() to

replace Null values in dataframe
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False,
limit=None, downcast=None, **kwargs)
Parameters:
value : Static, dictionary, array, series or dataframe to fill instead of
NaN. method : Method is used if user doesn’t pass any value. Pandas has
different methods like bfill, backfill or ffill which fills the place with value in the
Forward index or Previous/Back respectively. axis: axis takes int or string
value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or
‘columns’ for String inplace: It is a boolean which makes the changes in
data frame itself if True. limit : This is an integer value which specifies
maximum number of consecutive forward/backward NaN value
fills. downcast : It takes a dict which specifies what dtype to downcast to
which one. Like Float64 to int64. **kwargs : Any other Keyword arguments

Example #1: Replacing NaN values with a Static value. Before replacing:
 Python3

# importing pandas module

import pandas as pd

# making data frame from csv file

nba = pd.read_csv("nba.csv")

nba
Output:

After replacing: In the following example, all the null values in College
column has been replaced with “No college” string. Firstly, the data frame is
imported from CSV and then College column is selected and fillna() method
is used on it.

 Python

# importing pandas module

import pandas as pd

# making data frame from csv file

nba = pd.read_csv("nba.csv")

# replacing na values in college with No college

nba["College"].fillna("No College", inplace = True)

nba

Output:

Python | Pandas dataframe.replace()

Pandas dataframe.replace() function is used to replace a string,
regex, list, dictionary, series, number, etc. from a Pandas
Dataframe in Python.

Syntax of dataframe.replace()

Syntax: DataFrame.replace(to_replace=None, value=None, inplace=False,

limit=None, regex=False, method=’pad’, axis=None)

Parameters:
 to_replace : [str, regex, list, dict, Series, numeric, or None] pattern that
we are trying to replace in dataframe.
 value : Value to use to fill holes (e.g. 0), alternately a dict of values
specifying which value to use for each column (columns not in the dict will
not be filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
 inplace : If True, in place. Note: this will modify any other views on this
object (e.g. a column from a DataFrame). Returns the caller if this is
True.
 limit : Maximum size gap to forward or backward fill
 regex : Whether to interpret to_replace and/or value as regular
expressions. If this is True then to_replace must be a string. Otherwise,
to_replace must be None because this parameter will be interpreted as a
regular expression or a list, dict, or array of regular expressions.
 method : Method to use when for replacement, when to_replace is a list.
Returns: filled : NDFrame

Example:

Here, we are replacing 49.50 with 60.

 Python3

import pandas as pd

df = {

"Array_1": [49.50, 70],

"Array_2": [65.1, 49.50]

data = pd.DataFrame(df)

print(data.replace(49.50, 60))

Output:
Array_1 Array_2
0 60.0 65.1
1 70.0 60.0

A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
100% (1)
A Guide To 21 Feature Importance Methods and Packages in Machine Learning (With Code) - by Theophano Mitsa - Dec, 2023 - Towards Data Science
41 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
TP Regression
100% (1)
TP Regression
1 page
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Glass Classification
100% (2)
Glass Classification
3 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
HW1
100% (1)
HW1
8 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Assignment 11
100% (1)
Assignment 11
7 pages
IRIS BPNN - Ipynb - Colaboratory
100% (1)
IRIS BPNN - Ipynb - Colaboratory
4 pages
Lab7.ipynb - Colaboratory
100% (1)
Lab7.ipynb - Colaboratory
5 pages
Actividad Semana 4 - Jupyter Notebook
100% (1)
Actividad Semana 4 - Jupyter Notebook
7 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Logistics Regression
100% (1)
Logistics Regression
5 pages
Computer Science Project
No ratings yet
Computer Science Project
19 pages
Machine Learning: Linear Models For Classification 1
No ratings yet
Machine Learning: Linear Models For Classification 1
30 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
Econ209 f2024 Lab 4 Truong Gia Han
No ratings yet
Econ209 f2024 Lab 4 Truong Gia Han
11 pages
Datascience Lab Manual
No ratings yet
Datascience Lab Manual
46 pages
Data Science
No ratings yet
Data Science
39 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Funciones para Python
No ratings yet
Funciones para Python
33 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Neural Network Based Rainfall Prediction System
100% (1)
Neural Network Based Rainfall Prediction System
6 pages
Parkinsons Disease Prediction - Ieee
No ratings yet
Parkinsons Disease Prediction - Ieee
5 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
And Lists: Jason Myers
No ratings yet
And Lists: Jason Myers
114 pages
A Machine Learning Framework For Sport Result Prediction
No ratings yet
A Machine Learning Framework For Sport Result Prediction
7 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
Repetition Structures Python
No ratings yet
Repetition Structures Python
12 pages
Loops in Python
No ratings yet
Loops in Python
18 pages
Data Science Chapitre 0
No ratings yet
Data Science Chapitre 0
25 pages
Artificial Neural Network (ANN)
No ratings yet
Artificial Neural Network (ANN)
34 pages
Python While Loops
No ratings yet
Python While Loops
1 page
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
No ratings yet
Real-Time Face Detection On A "Dual-Sensor" Smart Camera Using Smooth-Edges Technique
5 pages
Practical-5 - Jupyter Notebook
100% (1)
Practical-5 - Jupyter Notebook
8 pages
Python
No ratings yet
Python
12 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
Week 2 Python For Data Science
No ratings yet
Week 2 Python For Data Science
27 pages
7 Classification
100% (3)
7 Classification
63 pages
Python Tuple
No ratings yet
Python Tuple
23 pages
Class Object
No ratings yet
Class Object
26 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Tutorials - Software Engineering
No ratings yet
Tutorials - Software Engineering
5 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Multicollinearity Exercise
100% (1)
Multicollinearity Exercise
6 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Unit - 2 - Data Types, IO, Types of Errors and Control - Structures
No ratings yet
Unit - 2 - Data Types, IO, Types of Errors and Control - Structures
18 pages
Python Programming Lab: Data Types in Python
No ratings yet
Python Programming Lab: Data Types in Python
24 pages
Lecture 3 Operators Expression and Data Types
No ratings yet
Lecture 3 Operators Expression and Data Types
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Pre-Processing (Pandas)

Uploaded by

Data Pre-Processing (Pandas)

Uploaded by

Performing pre-processing operations on

data for data analysis IN PYTHON

How To Convert Data Types in Python ?

# Python program to demonstrate

# implicit type conversion

# Python automatically converts

# Python automatically converts

# c to int as it is a floor division

Type Conversion with strings

A string is generally a sequence of one or more characters. We are often

# Python program to demonstrate

# Converting number to string

print("String: " + s + "\nNumber: " + str(n))

# Python program to demonstrate

# type conversion of string to

Type Conversion with Numbers

Floating Point to Integer

A floating-point can be converted to an integer using the int() function. To do

# Python program to demonstrate

# floating point to integer

# Python program to demonstrate

Type conversion between Tuple and List

# Python program to demonstrate

Python Pandas - Categorical Data

Categorical variables can take on only a limited, and usually fixed

The categorical data type is useful in the following cases −

A string variable consisting of only a few different values.

Categorical object can be created in multiple ways. The different

By specifying the dtype as "category" in pandas object creation.

The number of elements passed to the series object is four, but

Using the standard pandas Categorical constructor, we can create

pandas.Categorical(values, categories, ordered)

Let’s take an example −

cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])

Let’s have another example −

Now, take a look at the following example −

cat = pd.Categorical(["a", "c", "c", np.nan],

Working with Missing Data in Pandas

# importing pandas package

# making data frame from csv file

# creating bool series True for NaN values

# displaying data only with team = NaN

Pandas DataFrame notnull() Method

# importing pandas package

# making data frame from csv file

# creating bool series False for NaN values

# displayed data only with team = NaN

# importing pandas module

# making data frame from csv file

# making new data frame with dropped NA values

new_data = data.dropna(axis=0, how='any')

# comparing sizes of data frames

print("Old data frame length:", len(data),

"\nNew data frame length:",

"\nNumber of rows with at least 1 NA value: ",

 Old data frame length: 458

Python | Pandas DataFrame.fillna() to

# importing pandas module

# making data frame from csv file

# importing pandas module

# making data frame from csv file

# replacing na values in college with No college

nba["College"].fillna("No College", inplace = True)

Python | Pandas dataframe.replace()

Syntax: DataFrame.replace(to_replace=None, value=None, inplace=False,

Here, we are replacing 49.50 with 60.

"Array_1": [49.50, 70],

"Array_2": [65.1, 49.50]

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.