Tweet Analysis for COVID-19 Fake News Detection

Published in

Level Up Coding

5 min readDec 20, 2020

By no means is Fake News an alien concept; it has been here since man learned to communicate. But the invention of the Internet and Social Media has added another dimension to it, making it more prominent and influential than ever. An example of its influence can be found in the fake-news outbreak that followed the COVID-19 pandemic and the events that transpired thus. The world stopped, and people were at home for the first time in generations with the internet being their primary source of information, a responsibility it wasn’t ready for. The internet was flooded with misinformation about the lockdowns, potential treatments, and hospital bed availability. These posts did more harm than good and managed to create a state of collective panic worldwide.

Twitter is a widely used social media platform and, hence, played a significant part in the fake news outbreak. The following article compares various classification techniques for the segregation of “fake” news posts from the real ones on Twitter during the COVID-19 pandemic.

Tweet Analysis and Preprocessing

The data-set consisted of 6,420 tweets with annotated labels for the Training Data and 2,140 tweets for the Test Data. The data-set was almost equally divided among the two classes, and no class imbalance was observed.

Count Plot against the Class Labels for the Training Data-set

An analysis of the word frequencies within the data-set was conducted, and it was observed that the “real” and “fake” labels had a set of specific words to that label, whereas several words were shared between the two classes. It was concluded that these common words would be of no significance to the classification outcome.

Word Clouds for collective data and “Real” and “Fake” labels

It was observed that words like “coronavirus”, “covid”, “covid19”, “new”, “people”, “death”, “state”, and “one” were common to both the labels and hence they were removed from the data-set along with stopwords, rare words, punctuation marks, and hyperlinks.

Baseline Model Performance

The cleaned tweets from the data-set were tokenized and fed into baseline classification models post Count-Vectorization. It was observed that the Passive-Aggressive Classifier performed better than any other classification model, with SVM being a close second.

The results for the baseline classifications are as follows:
1. Passive-Agressive Classifier- 94.39%
2. SVM- 94.31%
3. Logistic Regression- 92.24%
4. Naive Baye’s- 91.92%
5. Decision Tree- 85.03%
6. KNN- 69.26%

The maximum accuracy of 94.39% was observed, and it could be concluded that Count Vectorization in itself is quite significant when it comes to Natural Language Processing when fueled with a robust classifier like the Passive-Aggressive Classifier.

Deep Learning Models: RNN and BERT

Along with the baseline Machine Learning models, three Deep Learning were constructed for tweet-classification. These three architectures include an RNN model with GloVe embeddings, an LSTM RNN model with a trainable embedding layer, and a fine-tuned pre-trained BERT model.

The RNN with GloVe embeddings used embedding vectors of size 50 and transformed each tweet into a set of GloVe vectors. These vectors were then fed into an RNN model, which was trained for 20 epochs to avoid overfitting. The model displayed an accuracy of 88.27% and performed better than classifiers like Decision Tree and KNN but failed in comparison to other baseline models.

The RNN model with trainable embeddings had an extra embedding layer that transformed the words into a vector of size 64. The remaining architecture for the model was similar to the one with GloVe embeddings. It consisted of 2 Bidirectional LSTM layers followed by 2 Dense layers, with one as the final layer’s output dimension. The RNN with trainable embeddings displayed an accuracy of 91.88% and outperformed the GloVe model by 3.61%, which is definitely an improvement but is still below the baseline.

Train and Test Performance of RNN with Trainable Embeddings

BERT is a pre-trained powerful NLP model that can be fine-tuned for text classification by adding a classification architecture over the existing model. A 2-layered classification architecture was added upon the pre-trained BERT model made available via Hugging Face Library for tweet-classification. As expected, the BERT model outperformed all the baseline models with an accuracy of 97.39%.

Conclusion

The tweet data was classified as “real” and “fake” news using six baseline machine learning models and three deep learning models. It was observed that a few robust baseline models like the Passive-Aggressive Classifier, SVM, and Logistic Regression outperformed both the RNN models. Another significant observation was that RNN with trainable embeddings outperformed the one with GloVe embeddings making it evident that fitting the embeddings to the problem at hand does play a substantial part in NLP classifications.
On the other hand, a pre-trained model like BERT with fine-tuned to fit the particular data-set outperformed every other model. Hence, it is safe to conclude that BERT is so far the best model for COVID-19 Fake News detection with an accuracy of 97.39%.

The project was conducted by Srishti Sahni, Atul Rawat, and Vijay Ponnaganti with equal contributions by the three members under the guidance of our professor Tanmoy Chakraborty
1. Srishti Sahni and Atul Rawat performed the Exploratory Data Analysis and Preprocessed the data.
2. Vijay Ponnaganti, Atul Rawat, and Srishti Sahni constructed baseline machine learning models for classification.
3. Vijay Ponnaganti created the RNN model with GloVe embeddings
4. Srishti Sahni created the RNN model with trainable embeddings, and
5. Atul Rawat fine-tuned the BERT model

The project was a submission for the CONSTRAINT’21 workshop shared challenge. The links to the data-set can be found on the Competition Page on CodaLab.

Tweet Analysis for COVID-19 Fake News Detection

Tweet Analysis and Preprocessing

Baseline Model Performance

Deep Learning Models: RNN and BERT

Conclusion

Written by Srishti Sahni