Fake News Detector: NLP Project

Ishant Juyal
Level Up Coding
Published in
5 min readJun 5, 2020

--

In the following post, we will talk about how one can create an NLP classifier to detect whether the news is real or fake.

Nowadays, fake news has become a common trend. Even trusted media houses are known to spread fake news and are losing their credibility. So, how can we trust any news to be real or fake?

In this project, I have built a classifier model that can identify news as real or fake. For this purpose, I have used data from Kaggle, but you can use any data to build this model following the same methods.

Dataset

Kaggle Data
train.csv: A full training dataset with the following attributes:

  • id: unique id for a news article
  • title: the title of a news article
  • author: author of the news article
  • text: the text of the article; could be incomplete
  • label: a label that marks the article as potentially unreliable.
    Where 1: unreliable and 0: reliable.

Reading the data

import pandas as pd
train = pd.read_csv('train.csv')
train.head()
Here’s how the training data looks like

We can see that the features ‘title’, ‘author’ and ‘text’ are important and all are in text form. So, we can combine these features to make one final feature which we will use to train the model. Let’s call the feature ‘total’.

# Firstly, fill all the null spaces with a space
train = train.fillna(' ')
train['total'] = train['title'] + ' ' + train['author'] + ' ' +
train['text']
After adding the column ‘total’, the data looks like this

Pre-processing/ Cleaning the Data

For preprocessing the data, we will need some libraries.

import ntlk
from ntlk.corpus import stopwords
from ntlk.stem import WordNetLemmatizer

The uses of all these libraries are explained below.

Stopwords: Stop words are those common words that appear in a text many times and do not contribute to machine’s understanding of the text.
We don’t want these words to appear in our data. So, we remove these words.

All these stopwords are stored in the ntlk library in different languages.

stop_words = stopwords.words('english')

Tokenization: Word tokenization is the process of splitting a large sample of text into words.
For example:

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"nltk_tokens = nltk.word_tokenize(word_data)
print(ntlk_tokens)

It will convert the string word_data into this:
[‘It’, ‘originated’, ‘from’, ‘the’, ‘idea’, ‘that’, ‘there’, ‘are’, ‘readers’, ‘who’, ‘prefer’, ‘learning’, ‘new’, ‘skills’, ‘from’, ‘the’, ‘comforts’, ‘of’, ‘their’, ‘drawing’, ‘rooms’]

Lemmatization: Lemmatization is the process of grouping together the different inflected forms of same root word so they can be analysed as a single item.
Examples of lemmatization:
swimming → swim
rocks → rock
better → good

lemmatizer = WordNetLemmatizer()

The code below is for lemmatization for our test data which excludes stopwords at the same time.

for index, row in train.iterrows():
filter_sentence = ''
sentence = row['total']
# Cleaning the sentence with regex
sentence = re.sub(r'[^\w\s]', '', sentence)
# Tokenization
words = nltk.word_tokenize(sentence)
# Stopwords removal
words = [w for w in words if not w in stop_words]
# Lemmatization
for words in words:
filter_sentence = filter_sentence + ' ' +
str(lemmatizer.lemmatize(words)).lower()
train.loc[index, 'total'] = filter_sentencetrain = train[['total', 'label']]
This is how the data looks after pre-processing
X_train = train['total']Y_train = train['label']

Finally, we have pre-processed the data but it is still in text form and we can’t provide this as an input to our machine learning model. We need numbers for that. How can we solve this problem? The answer is Vectorizers.

Vectorizer

For converting this text data into numerical data, we will use two vectorizers.

  1. Count Vectorizer
    In order to use textual data for predictive modelling, the text must be parsed to remove certain words — this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
  2. TF-IDF Vectorizer
    TF-IDF stands for Term Frequency — Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.
    Read more about this here.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)
freq_term_matrix = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix)
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)

The code written above will provide with you a matrix representing your text. It will be a sparse matrix with a large number of elements in Compressed Sparse Row format.

Modelling

Now, we have to decide which classification model will be the best for our problem.
First, we will split the data and then train the model to predict how accurate our model is.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(tf_idf_matrix,
Y_train, random_state=0)

We will implement three models here and compare their performance.

  1. Logistic Regression
from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()
logreg.fit(X_train, y_train)
Accuracy = logreg.score(X_test, y_test)

2. Naive Bayes

from sklearn.naive_bayes import MultinomialNBNB = MultinomialNB()
NB.fit(X_train, y_train)
Accuracy = NB.score(X_test, Y_test)

3. Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
Accuracy = clf.score(X_test, Y_test)

Performance

Performance of the models on the test set

As we can see, the decision tree classifier performed the best on the train set and gave an accuracy of 97%.

I also tested the model on a different test set which was not a part of the training set and it gave an accuracy of 96.98% which is pretty good.

Thanks for reading. I hope you like my article and found it to be helpful. If you have any questions or suggestions, feel free to write them down in the comment section. You can connect with me on LinekdIn here: Ishant Juyal

--

--