0% found this document useful (0 votes)

188 views

Steps For Effective Text Data Cleaning

1. The document discusses steps for effectively cleaning text data, beginning with a case study using Python. It describes common issues in text data like HTML characters, encoding, apostrophes, stop words, punctuations, expressions, attached words, slang, word formatting, and URLs. 2. It then outlines 10 steps to clean the data, including escaping HTML, decoding, apostrophe lookup, removing stop words, punctuations, expressions, splitting attached words, slang lookup, standardizing words, and removing URLs. 3. More advanced techniques like grammar checking and spelling correction are also mentioned. The document concludes by encouraging the reader to follow the steps to improve analysis accuracy and share any other questions.

Uploaded by

xwpom2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

188 views

Steps For Effective Text Data Cleaning

Uploaded by

xwpom2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Steps for effective text data

cleaning (with case study using

Python)
BIG DATA BUSINESS ANALYTICS PYTHON

SHARE
SHIVAM BANSAL , NOVEMBER 16, 2014 / 3

The days when one would get data in tabulated spreadsheets are truly behind
us. A moment of silence for the data residing in the spreadsheet pockets. Today,
more than 80% of the data is unstructured it is either present in data silos or
scattered around the digital archives. Data is being produced as we speak
from every conversation we make in the social media to every content
generated from news sources. In order to produce any meaningful actionable
insight from data, it is important to know how to work with it in its unstructured
form. As a Data Scientist at one of the fastest growing Decision Sciences firm,
my bread and butter comes from deriving meaningful insights from unstructured
text information.

One of the first steps in working with text data is to pre-process it. It is an
essential step before the data is ready for analysis. Majority of available text
data is highly unstructured and noisy in nature to achieve better insights or to
build better algorithms, it is necessary to play with clean data. For example,
social media data is highly unstructured it is an informal communication
typos, bad grammar, usage of slang, presence of unwanted content like URLs,
Stopwords, Expressions etc. are the usual suspects.
In this blog, therefore I discuss about these possible noise elements and how
you could clean them step by step. I am providing ways to clean data using
Python.
As a typical business problem, assume you are interested in finding: which are
the features of an iPhone which are more popular among the fans. You have

extracted consumer opinions related to iPhone and here is a tweet you

extracted:
I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo
happppppy

http://www.apple.com

Steps for data cleaning:

Here is what you do:
1. Escaping HTML characters: Data obtained from web usually contains a lot
of html entities like < > & which gets embedded in the original
data. It is thus necessary to get rid of these entities. One approach is to
directly remove them by the use of specific regular expressions. Another
approach is to use appropriate packages and modules (for example
htmlparser of Python), which can convert these entities to standard html
tags. For example: < is converted to < and & is converted to &.

Snippet:

import HTMLParser

html_parser = HTMLParser.HTMLParser()

tweet = html_parser.unescape(original_tweet)

Output:
>> I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo
happppppy

http://www.apple.com

2. Decoding data: Thisis the process of transforming information from

complex symbols to simple and easier to understand characters. Text data
may be subject to different forms of decoding like Latin, UTF8 etc.
Therefore, for better analysis, it is necessary to keep the complete data in

standard encoding format. UTF-8 encoding is widely accepted and is

recommended to use.

Snippet:

tweet = original_tweet.decode("utf8").encode(ascii,ignore)

Output:
>> I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo
happppppy

http://www.apple.com

3. Apostrophe Lookup: To avoid any word sense disambiguation in text, it is

recommended to maintain proper structure in it and to abide by the rules of
context free grammar. When apostrophes are used, chances of
disambiguation increases.

For example its is a contraction for it is or it has.

All the apostrophes should be converted into standard lexicons. One can use a
lookup table of all possible keys to get rid of disambiguates.
Snippet:

APPOSTOPHES = {'s" : " is", "'re" : " are", ...} ## Need a huge dictionary

words = tweet.split()

reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in

words]

reformed = " ".join(reformed)

Outcome:

>> I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo
happppppy

http://www.apple.com

4. Removal of Stop-words: When data analysis needs to be data driven at

the word level, the commonly occurring words (stop-words) should be
removed. One can either create a long list of stop-words or one can use
predefined language specific libraries.
5. Removal of Punctuations: All the punctuation marks according to the
priorities should be dealt with. For example: ., ,,? are important
punctuations that should be retained while others need to be removed.
6. Removal of Expressions: Textual data (usually speech transcripts) may
contain human expressions like [laughing], [Crying], [Audience paused].
These expressions are usually non relevant to content of the speech and
hence need to be removed. Simple regular expression can be useful in this
case.
7. Split Attached Words: We humans in the social forums generate text data,
which is completely informal in nature. Most of the tweets are accompanied
with multiple attached words like RainyDay, PlayingInTheCold etc. These
entities can be split into their normal forms using simple rules and regex.

Snippet:

cleaned = .join(re.findall([A-Z][^A-Z]*, original_tweet))

Outcome:
>> I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo
happppppy

http://www.apple.com

8. Slangs lookup: Again, social media comprises of a majority of slang words.

These words should be transformed into standard words to make free text.
The words like luv will be converted to love, Helo to Hello. The similar
approach of apostrophe look up can be used to convert slangs to standard
words. A number of sources are available on the web, which provides lists of
all possible slangs, this would be your holy grail and you could use them as
lookup dictionaries for conversion purposes.

Snippet:

tweet = _slang_loopup(tweet)

Outcome:
>> I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo
happppppy

http://www.apple.com

9. Standardizing words: Sometimes words are not in proper formats. For

example: I looooveee you should be I love you. Simple rules and regular
expressions can help solve these cases.

Snippet:

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Outcome:
>> I love my <3 iphone & you are awesome apple. Display Is Awesome, so
happy

http://www.apple.com

10. Removal of URLs: URLs and hyperlinks in text data like comments,
reviews, and tweets should be removed.

Final cleaned tweet:

>> I love my iphone & you are awesome apple. Display Is Awesome, so
happy! , <3 ,

Advanced data cleaning:

1. Grammar checking: Grammar checking is majorly learning based, huge

amount of proper text data is learned and models are created for the
purpose of grammar correction. There are many online tools that are
available for grammar correction purposes.
2. Spelling correction: In natural language, misspelled errors are
encountered. Companies like Google and Microsoft have achieved a decent
accuracy level in automated spell correction. One can use algorithms like
the Levenshtein Distances, Dictionary Lookup etc. or other modules and
packages to fix these errors.

End Notes:
Hope you found this article helpful. These were some tips and tricks, I have
learnt while working with a lot of text data. If you follow the above steps to clean
the data, you can drastically improve the accuracy of your results and draw
better insights. Do share your views/doubts in the comments section and I
would be happy to participate.

NETFLIX Supply Chain Management
100% (3)
NETFLIX Supply Chain Management
17 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
Text Cleaning Methods in NLP - Part-2
No ratings yet
Text Cleaning Methods in NLP - Part-2
5 pages
Text Mining Using Python
No ratings yet
Text Mining Using Python
1 page
Experiment No 3
No ratings yet
Experiment No 3
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
Quick Guide_ Steps To Perform Text Data Cleaning in Python
No ratings yet
Quick Guide_ Steps To Perform Text Data Cleaning in Python
6 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Smaexp 3
No ratings yet
Smaexp 3
9 pages
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
No ratings yet
APznzaaezhN_zrfGNBIVQoFpyxQuDJEbpYM-rd1_4RK0dsKNoyaIK1leg5AOwJTuo35Fm7my_JrMLHTTwQc2-C9HancQl3eg5PMXqg3GVh...P8BhsI_jQJy5fp8rf5U6yKHXRfFB-0sfyXvsKcrtjBjLcU1flNWbsLeC886utDYCdlHaYbVGoX44N_s9IQDFZVmSS9erIHdWuLbw1xo7dFCD-1IOTfC4GfUw8x
171 pages
Text Cleaning Methods in NLP
No ratings yet
Text Cleaning Methods in NLP
7 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
4 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
No ratings yet
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
20 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Unit 5
No ratings yet
Unit 5
4 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
Python Programming: Your Advanced Guide To Learn Python in 7 Days
From Everand
Python Programming: Your Advanced Guide To Learn Python in 7 Days
Maurice J. Thompson
No ratings yet
Useful Python
From Everand
Useful Python
Stuart Langridge
No ratings yet
Text Processing in Python
100% (1)
Text Processing in Python
479 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
SMA EXP 3
No ratings yet
SMA EXP 3
7 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
From Everand
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
Roberta Bowman
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
Unit 3
No ratings yet
Unit 3
102 pages
Python One Liners Christian Mayer pdf download
100% (1)
Python One Liners Christian Mayer pdf download
56 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
From Everand
PYTHON DATA SCIENCE FOR BEGINNERS: Unlock the Power of Data Science with Python and Start Your Journey as a Beginner (2023 Crash Course)
Rufus Johnston
No ratings yet
Minimalist Datawrangling Withpython Marek Gagolewski pdf download
No ratings yet
Minimalist Datawrangling Withpython Marek Gagolewski pdf download
79 pages
Beyond the Basic Stuff with Python 1st Edition Al Sweigart download pdf
100% (1)
Beyond the Basic Stuff with Python 1st Edition Al Sweigart download pdf
55 pages
task 1
No ratings yet
task 1
2 pages
String and Text Processing
No ratings yet
String and Text Processing
8 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
No Mistakes Writing, Volume I: Writing Shortcuts
From Everand
No Mistakes Writing, Volume I: Writing Shortcuts
Giacomo Giammatteo
No ratings yet
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
Programming Presentation
No ratings yet
Programming Presentation
8 pages
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
No ratings yet
4aeee7-Ba25-Ff2e-30d7-63d306a7270 Open Ai Playground Example Prompts - Google Sheets
8 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
DW sem
No ratings yet
DW sem
25 pages
Data Science ML Full Stack Roadmap
No ratings yet
Data Science ML Full Stack Roadmap
35 pages
10 Academy - KAIM 5&6 - Pre-Learning Materials (1)
No ratings yet
10 Academy - KAIM 5&6 - Pre-Learning Materials (1)
29 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
HTML For Novices By Novices
From Everand
HTML For Novices By Novices
Mike Abelar
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
Core
No ratings yet
Core
13 pages
REALTIME MANAGEMENT Stockyard
No ratings yet
REALTIME MANAGEMENT Stockyard
7 pages
E-021 Auxiliary Engine Performance.04
100% (1)
E-021 Auxiliary Engine Performance.04
2 pages
0x80248007 Error in Windows Update in Windows 10 (Solved) - Driver Easy
No ratings yet
0x80248007 Error in Windows Update in Windows 10 (Solved) - Driver Easy
15 pages
Active and Reactive Power Regulation in Grid Connected Wind Energy Systems With Permanent Magnet Synchronous Generator and Matrix Converter
No ratings yet
Active and Reactive Power Regulation in Grid Connected Wind Energy Systems With Permanent Magnet Synchronous Generator and Matrix Converter
14 pages
AWS EKS Cluster Setup
No ratings yet
AWS EKS Cluster Setup
5 pages
7009food Panda Case Study
No ratings yet
7009food Panda Case Study
2 pages
Book LIst
No ratings yet
Book LIst
5 pages
Vaadin 14 Reference Card
No ratings yet
Vaadin 14 Reference Card
8 pages
Law, Science and Technology Course Outline
No ratings yet
Law, Science and Technology Course Outline
3 pages
1 - TS MC210MM SM 10 100M Media Converter PDF
No ratings yet
1 - TS MC210MM SM 10 100M Media Converter PDF
2 pages
Abubakar CV
No ratings yet
Abubakar CV
3 pages
SPICE Model of A Real Zener Diode Tested at Room Temperature
No ratings yet
SPICE Model of A Real Zener Diode Tested at Room Temperature
5 pages
RR Underground Drainage - JUNE 2020
No ratings yet
RR Underground Drainage - JUNE 2020
4 pages
Servicesupport Guide
100% (1)
Servicesupport Guide
329 pages
1 - Air Conditioning Systems PDF
No ratings yet
1 - Air Conditioning Systems PDF
29 pages
Agile Delivery PDF
0% (2)
Agile Delivery PDF
13 pages
Product Selection Guide: Cable Accessories
No ratings yet
Product Selection Guide: Cable Accessories
44 pages
Oleh Herych: Contact
No ratings yet
Oleh Herych: Contact
4 pages
Computer Application in Hotels
No ratings yet
Computer Application in Hotels
29 pages
Unit 1. 21St Century Skills: Christopher Lord Tulauan Tuesday, September 8, 2020
No ratings yet
Unit 1. 21St Century Skills: Christopher Lord Tulauan Tuesday, September 8, 2020
9 pages
Warhammer 40k - Adeptus Mechanicus
0% (2)
Warhammer 40k - Adeptus Mechanicus
22 pages
QB M1 and M2
No ratings yet
QB M1 and M2
2 pages
Fuel Oil System Design Guideline
No ratings yet
Fuel Oil System Design Guideline
11 pages
Touchscreen NP Series
No ratings yet
Touchscreen NP Series
104 pages
How To Create A Greeting Card in Microsoft Publisher
No ratings yet
How To Create A Greeting Card in Microsoft Publisher
1 page
Md. Mahabubur Rahman: ST ST
No ratings yet
Md. Mahabubur Rahman: ST ST
2 pages
C-Rnn-Gan: Continuous Recurrent Neural Networks With Adversarial Training
No ratings yet
C-Rnn-Gan: Continuous Recurrent Neural Networks With Adversarial Training
6 pages
CWS Talcher 10 Yr Non Moving As On 30112018
No ratings yet
CWS Talcher 10 Yr Non Moving As On 30112018
75 pages
Toneohm 950 User Manual
No ratings yet
Toneohm 950 User Manual
45 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Steps For Effective Text Data Cleaning

Uploaded by

Steps For Effective Text Data Cleaning

Uploaded by

Steps for effective text data

cleaning (with case study using

extracted consumer opinions related to iPhone and here is a tweet you

Steps for data cleaning:

2. Decoding data: Thisis the process of transforming information from

standard encoding format. UTF-8 encoding is widely accepted and is

3. Apostrophe Lookup: To avoid any word sense disambiguation in text, it is

For example its is a contraction for it is or it has.

reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in

reformed = " ".join(reformed)

4. Removal of Stop-words: When data analysis needs to be data driven at

cleaned = .join(re.findall([A-Z][^A-Z]*, original_tweet))

8. Slangs lookup: Again, social media comprises of a majority of slang words.

9. Standardizing words: Sometimes words are not in proper formats. For

tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))

Final cleaned tweet:

Advanced data cleaning:

1. Grammar checking: Grammar checking is majorly learning based, huge

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.