Fake News Detection: Text Pre-Processing
With the explosion of online fake news and disinformation, it is increasingly difficult to discern fact from fiction. And as machine learning and natural language processing become more popular, Fake News detection serves as a great introduction to NLP.
Google Cloud Natural Language API is a great platform to use for this project. Simply upload a dataset, train the model, and use it to predict new articles.
But before we download a Kaggle dataset and get cracking on Google Cloud, it’s in our best interest to pre-process the dataset.
What is pre-processing?
To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task.
The goal of pre-processing is to remove noise. By removing unnecessary features from our text, we can reduce complexity and increase predictability (i.e. our model is faster and better). Removing punctuation, special characters, and ‘filler’ words (the, a, etc.) does not drastically change the meaning of a text.
Approach
There are many types of text pre-processing and their approaches varied. We will cover the following:
- Lowercase Text
- URL Removal
- Contraction Splitting
- Tokenization
- Stemming
- Lemmatization
- Stop Word Removal
We’ll be using python due to the availability and power of its data analysis and NLP libraries. Feel free to fire up a jupyter notebook and code along!
Lowercase & URL Removal
Before we start any of the pre-processing heavy lifting, we want to convert our text to lowercase and remove any URLs in our text. A simple regex expression can handle this for us.
Split Contractions
Similar to URLs, contractions can produce unintended results if left alone. The aptly named contractions python library to the rescue! It looks for contractions and splits them into root words.
Tokenization
At it’s simplest, tokenization is splitting text into pieces.
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.
There are a multitude of ways to implement tokenization and their approaches varied. For our project we utilized RegexpTokenizer within the NLTK library. Using regular expressions, RegexpTokenizer will match either tokens or separators (i.e. include or exclude regex matches).
Our string input is split by grouping alphanumeric characters. Notice the “$” and “!” characters do not appear in our tokenized list.
Stemming
Text normalization is the process of simplifying multiple variations or tenses of the same word. Stemming and lemmatization are two methods of text normalization, the former being the simpler of the two. To stem a word, we simply remove the suffix of a word to reduce it to its root.
As an example, “jumping”, “jumps”, and “jumped” all are stemmed to “jump.”
Stemming is not without its faults, however. We can run into the issue of overstemming. Overstemming is when words with different meanings are stemmed to the same root — a false positive.
Understemming is also a concern. See how words that should stem to the same root do not — a false negative.
Let’s take a look at a more nuanced approach to text normalization, lemmatization.
Lemmatization
Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
Lemmatization differs from stemming in that it determines a words part of speech by looking at surrounding words for context. For this example we use nltk.pos_tag to assign parts of speech to tokens. We then pass the token and its assigned tag into WordNetLemmatizer, which decides how to lemmatize the token.
Using the following text we can compare the results of our approaches to stemming and lemmatization.
Notice that ‘enemies’ was stemmed to ‘enemi’ but lemmatized to ‘enemy’. Interestingly, ‘bravery’ was stemmed to ‘braveri’ but the lemmatizer did not change the original word. In general, lemmatization is more precise, but at the expense of complexity.
Stop Word Removal
Let’s start this one off with an example. What comes to your mind when you read the following?
“quick brown fox lazy dog”
Hopefully you read that and thought of the common English pangram:
“the quick brown fox jumped over the lazy dog”
If you got it, you didn’t need the missing words to know what I was referencing. “The” doesn’t add any meaning to the sentence, and you had enough context that “jumped” and “over” weren’t necessary. In essence, I removed the stop words.
Stop words are words in the text which do not add any meaning to the sentence and their removal will not affect the processing of text for the defined purpose. They are removed from the vocabulary to reduce noise and to reduce the dimension of the feature set.
Again the NLTK library comes in clutch here. We can download a set of English words and stop words and compare that against our input tokens (see tokenization). For the purpose of our fake news detector, we return tokens that are English but aren’t stop words.
All Together Now
Combining all our steps together (minus stemming), let’s compare the before and after. We’ll use the following onion article as our article to pre-process. If you want to try it for yourself, check out this google collab.
LAKEWOOD, OH — Following a custom born out of cooperation and respect, local drivers reportedly pulled over to the side of the road Friday to let a pizza delivery guy through. “Gee, I hope it’s nothing serious like a big, hungry party,” said 48-year-old Rosanna Tuttle, who was just one of the dozens of drivers who quickly moved to the shoulder of the road after catching sight of the speeding pizza-delivery vehicle swerving through traffic in the rearview mirror. “It’s honestly just a reflex. Sure, it slows everyone down, but wouldn’t you want others to pull over for you if that was your pizza in there? I don’t care if I’m late; I just hope that pizza is okay. Let’s pray they get there safe.” At press time, drivers at the scene had stopped their cars again to rubberneck as the delivery guy rushed into an apartment building carrying a large stack of pizzas and mozzarella sticks.
Returns the following.
oh follow custom bear respect local driver reportedly pull side road let pizza delivery guy gee hope nothing serious like big hungry party say year old one dozen driver quickly move shoulder road catch sight speed pizza delivery vehicle swerve traffic mirror honestly reflex sure slow everyone would want pull pizza care I late hope pizza let us pray get safe press time driver scene stop car rubberneck delivery guy rush apartment building carry large stack pizza stick
That’s A Wrap
Natural Language Processing is a powerful tool to tackle some really interesting questions. Text pre-processing is an integral step in the process, helping us find the signal through the noise. “Garbage in garbage out,” as they say. I hope this article answered some questions and raised a few more — this is just the tip of the iceberg!
Sources
https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4
https://simonhessner.de/lemmatize-whole-sentences-with-python-and-nltks-wordnetlemmatizer/
https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html