Steps For Effective Text Data Cleaning
Steps For Effective Text Data Cleaning
SHARE
SHIVAM BANSAL , NOVEMBER 16, 2014 / 3
The days when one would get data in tabulated spreadsheets are truly behind
us. A moment of silence for the data residing in the spreadsheet pockets. Today,
more than 80% of the data is unstructured it is either present in data silos or
scattered around the digital archives. Data is being produced as we speak
from every conversation we make in the social media to every content
generated from news sources. In order to produce any meaningful actionable
insight from data, it is important to know how to work with it in its unstructured
form. As a Data Scientist at one of the fastest growing Decision Sciences firm,
my bread and butter comes from deriving meaningful insights from unstructured
text information.
One of the first steps in working with text data is to pre-process it. It is an
essential step before the data is ready for analysis. Majority of available text
data is highly unstructured and noisy in nature to achieve better insights or to
build better algorithms, it is necessary to play with clean data. For example,
social media data is highly unstructured it is an informal communication
typos, bad grammar, usage of slang, presence of unwanted content like URLs,
Stopwords, Expressions etc. are the usual suspects.
In this blog, therefore I discuss about these possible noise elements and how
you could clean them step by step. I am providing ways to clean data using
Python.
As a typical business problem, assume you are interested in finding: which are
the features of an iPhone which are more popular among the fans. You have
http://www.apple.com
Snippet:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)
Output:
>> I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo
happppppy
http://www.apple.com
Snippet:
tweet = original_tweet.decode("utf8").encode(ascii,ignore)
Output:
>> I luv my <3 iphone & youre awsm apple. DisplayIsAwesome, sooo
happppppy
http://www.apple.com
APPOSTOPHES = {'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
Outcome:
>> I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo
happppppy
http://www.apple.com
Snippet:
Outcome:
>> I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo
happppppy
http://www.apple.com
Snippet:
tweet = _slang_loopup(tweet)
Outcome:
>> I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo
happppppy
http://www.apple.com
Snippet:
Outcome:
>> I love my <3 iphone & you are awesome apple. Display Is Awesome, so
happy
http://www.apple.com
10. Removal of URLs: URLs and hyperlinks in text data like comments,
reviews, and tweets should be removed.
End Notes:
Hope you found this article helpful. These were some tips and tricks, I have
learnt while working with a lot of text data. If you follow the above steps to clean
the data, you can drastically improve the accuracy of your results and draw
better insights. Do share your views/doubts in the comments section and I
would be happy to participate.