Analyzing Donald Trump and Joe Biden Tweets using Natural Language Processing

Published in

Level Up Coding

10 min readJun 23, 2020

In 5 months, US voters will head to the polls and choose their next President. The presumptive nominees — Donald Trump and Joe Biden have been engaging with the public to promote their political agendas in social media platforms like Twitter in order to win votes. This post will analyze the tweets of Trump and Biden using a machine learning technique called Natural Language Processing.

The python code notebooks I used in the post are available here and here.

Tweets are available for download off Twitter APIs as long as you have the necessary security keys. You can request from Twitter here https://developer.twitter.com/en/apply-for-access to get your own keys. Once you received the keys, type them in this code and run it in Python to download the tweets of Donald Trump and Joe Biden.

import twitter
import json# Note: Go to https://developer.twitter.com/en/apply-for-access to get your own secret keys.api = twitter.Api(consumer_key = '---',
                      consumer_secret = '---',
                      access_token_key = '---',
                      access_token_secret = '---',cache=None, tweet_mode='extended')

I ran the code on June 19, 2020 and was able to retrieve around 3200 tweets posted by Donald Trump and Joe Biden. I uploaded their tweets here and here and their Python list object pickles here and here

Let’s load their tweets and display the first 500 characters:

import requests
candidates = ['DonaldTrump', 'JoeBiden']
data = {}
for i, c in enumerate(candidates):
    url = "https://raw.githubusercontent.com/gomachinelearning/Blogs/master/" + c + "Tweets.txt"
    req = requests.get(url)
    data[c] = req.text# check the sizes: Count the number of characters
print("Verify the dictionary variables are not empty. Print total number of characters in the variables:\n")
print("Donald Trump: {} , Joe Biden: {}".format(len(data['DonaldTrump']) ,len(data['JoeBiden'])))def print_first_n_characters(n):
  if n == -1:
    print('Printing full tweets of each candidate \n'.format(n) )
  else:
    print('\n\nPrinting the first {} characters of tweets of each candidate \n'.format(n) )
    
  print('DONALD TRUMP: \n ' + data['DonaldTrump'][0:n])  
  print('\n\nJOE BIDEN: \n ' + data['JoeBiden'][0:n])print_first_n_characters(500)

Figure 1: Printing sample tweets of Trump and Biden

One interesting observation is that, overall, Joe Biden tweet’s are longer compared to Trump’s.

import pickle
import cloudpickle as cp
lst_donald_trump_tweets=[]
lst_joe_biden_tweets=[]from urllib.request import urlopen
lst_donald_trump_tweets = cp.load(urlopen("https://raw.githubusercontent.com/gomachinelearning/Blogs/master/DonaldTrumpTweets.pickle"))
lst_joe_biden_tweets = cp.load(urlopen("https://raw.githubusercontent.com/gomachinelearning/Blogs/master/JoeBidenTweets.pickle"))
print("AVerage Number of characters per tweet:\n")
print("Donald Trump: {} , Joe Biden: {}".format(round(len(data['DonaldTrump'])/len(lst_donald_trump_tweets)) , round(len(data['JoeBiden'])/len(lst_joe_biden_tweets))))

Figure 2: On average, Biden’s tweets are longer per tweet compared to Trump’s

Visualize the Data

Wordle, also known as word cloud or tag cloud, is a visual representation of text data in the form of tags. A tag is usually a word in the data whose importance is visualized as font size and color.

Wordle has its own share of strengths and weaknesses and I’ve heard some criticisms about it. But let’s use it for now just to have a quick look and get an idea of what kind of data we’re dealing with.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
wc = WordCloud( background_color="white", colormap="Dark2", max_font_size=150, random_state=42)def show_word_cloud(data, candidate):
  text = data[candidate]
  wordcloud = WordCloud().generate(text)
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()
show_word_cloud(data,'DonaldTrump')
show_word_cloud(data,'JoeBiden')

Oh well, we obviously need to clean up the data. For example, “https” is very prominent in the figure but it is actually not significant in our analysis.

Stop words

Stop words refer to the commonly used words in a language- in this case English- such as a, the, is, at, on and others. More often than not, they do not provide additional insight to the data and therefore have to removed from data before we begin the analysis.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')

Most Frequently Used Words

Once the stop words have been filtered out, we can use several NLP tools that we can use to count the mostly used words — one of them is the document term matrix (DLM) object in Python.

DLM is a math matrix that depicts that frequency of words in a text document. We will show the rows in the matrix as the collection of tweets and the columns will show the words.

import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')dict_dtm={}
  
def document_term_matrix(strTweets):  
  data_cv = cv.fit_transform([strTweets]) 
  return pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
  
dict_dtm['DonaldTrump'] = document_term_matrix(data['DonaldTrump'])
dict_dtm['JoeBiden'] = document_term_matrix(data['JoeBiden'])
print('Donald Trump Tweets Document Term Matrix:\n')
print(dict_dtm['DonaldTrump'])
print('\n\n')
print('Joe Biden Tweets Document Term Matrix:\n')
print(dict_dtm['JoeBiden'])

We got rid of the stop words but the DTMs are still gibberish. We still need to do a lot of cleaning up.

Let’s print the top 20 words of each candidate:

def top_n_words(candidate,n):top_dict = {}
  data = dict_dtm[candidate]
  for c in data.columns:
    top_dict[c]=data[c][0]
  sort_orders = sorted(top_dict.items(), key=lambda x: x[1], reverse=True)
  for i in sort_orders[0:n]:
    print(i[0], i[1])print('Donald Trump:\n')
top_n_words('DonaldTrump',20)
print('\n\nJoe Biden:\n')
top_n_words('JoeBiden',20)

The words are very promising that we can actually get some analysis and derive some meaningful insights . Although, we we still need to do some clean up.

Data Clean up

Here are some data clean up tasks that we can do on the tweets. We can remove: urls, numbers, punctuation, some texts like amp and rt, line breaks and others.

In addition, we can probably limit the analysis to the first 100K characters of the candidates’ most recent tweets. We would want to analyze the tweets that pertain to current events.

import re
import stringdef clean_tweets(tweets):
    '''Make tweetslowercase, remove tweetsin square brackets, remove punctuation and remove words containing numbers.'''
    tweets= tweets.lower()
    tweets= re.sub('\[.*?\]', '', tweets)
    tweets= re.sub('[%s]' % re.escape(string.punctuation), '', tweets)
    tweets= re.sub('\w*\d\w*', '', tweets)
    tweets= re.sub('[‘’“”…]', '', tweets)
    tweets= re.sub('\n', '', tweets)
    tweets= re.sub('\r', ' ', tweets)    
    tweets = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)
    tweets = re.sub(r'http\S+', '', tweets)
    tweets = tweets.replace(' amp ','')
    tweets = tweets.replace(' rt ','')
    tweets = tweets.replace('realdonaldtrump','')
    return tweetsclean_data={}
clean_data['DonaldTrump'] = clean_tweets(data['DonaldTrump'])[0:100000]
clean_data['JoeBiden'] = clean_tweets(data['JoeBiden'])[0:100000]len(clean_data['DonaldTrump']) ,len(clean_data['JoeBiden'])#(100000, 100000)

Let’s compare a sample of raw and clean data. Here’s Trump’s sample data before and after data cleansing:

print('Print 200 characters only:\n')
print('Donald Trump raw data:\n')
print(data['DonaldTrump'][0:200])
print('\n\nDonald Trump clean data:\n')
print(clean_data['DonaldTrump'][0:200])print('\n\n')print('Joe Biden raw data:\n')
print(data['JoeBiden'][0:200])
print('\n\nJoe Biden clean data:\n')
print(clean_data['JoeBiden'][0:200])

Figure 7: Donald Trump sample data before and after data cleansing

Biden’s sample data before and after data cleansing:

Figure 8: Joe Biden sample data before and after data cleansing

Let’s show the wordle with the clean data

Figure 9: Donald Trump Worlde using clean data

Front and center is the word is “great.” This is most likely because of the Trump’s campaign slogan Make America Great Again or Keep America Great. Trump does not seem to refer often to Joe Biden in his tweets.

Figure 10: Joe Binden Worlde using clean data

Joe Biden seems to refer to Donald Trump more often in his tweets as can be seen in the wordle. Biden often used the word crisis in his tweets which most likely pertains to the corona virus pandemic.

Let’s print the most frequently used words again this time with the clean data.

dict_dtm['DonaldTrump'] = document_term_matrix(clean_data['DonaldTrump'])
dict_dtm['JoeBiden'] = document_term_matrix(clean_data['JoeBiden'])print('Donald Trump:\n')
top_n_words(dict_dtm['DonaldTrump'],20)
print('\n\nJoe Biden:\n')
top_n_words(dict_dtm['JoeBiden'],20)

Figure 11: Most frequently used words using clean data

This is obviously much better compared to our first run. We can see plenty of words here that are relevant to what’s currently happening in the country.

Sentiment Analysis

One of the goals of sentiment analysis is to develop machines that are capable of interpreting human emotions— mimicking the way humans experience and respond to emotions of other people.

Note: Image copied from https://blogeduonix-2f3a.kxcdn.com/wp-content/uploads/2018/12/44ceff.jpg

We are going to use TextBlob for predicting the sentiments in the tweets. TextBlob is a Python library for processing textual data that provides an API for sentiment analysis. According to TextBlob documentation, the sentiment API returns two properties given a textual data- polarity and subjectivity. Polarity tells whether the text is negative or positive. Its value ranges from -1.0 (very negative) to 1.0 (very positive) . Subjectivity tells whether the text is objective or subjective. Its value ranges from 0 (very objective) to 1.0 (very subjective). First, let’s test how TextBlob will perform given these set of example statements:

statements=[]
statements.append("I will vote for that mayor again")
statements.append("The senator is the worst politician ever!")
statements.append("Make America Great Again")
statements.append("Keep America Great")
statements.append("OUR BEST DAYS STILL LIE AHEAD")
statements.append("The earth revolves around the sun")
statements.append("The sun revolves around the earth")
statements.append("The sun revolves around the earth")
lst_sentiment_analysis=[]
for statement in statements:
  blob = TextBlob(statement)
  print(statement + ' ' + str(blob.sentiment))

Figure 12: TextBlob evaluates the Sentiments of these sample Texts

Take note of the difference between the first and second example. Adding the word “good” to the second statement made the statement positive and also subjective. The third example is very negative and also very subjective.

Donald Trumps campaign slogans: Make America Great Again and Keep America Great are less positive and more subjective compared to Joe Biden’s OUR BEST DAYS STILL LIE AHEAD. (Note: I am not 100% sure if that is actually Biden’s slogan but it is what I saw on his campaign website).

The example “The earth revolves around the sun” is neutral and very objective. However, TextBlob cannot tell us if a statement is true or not, see the last example.

Let’s run sentiment analysis on the most recent 200 tweets of Trump and Biden as they pertain more on the current events.

import pickle
import cloudpickle as cp
lst_donald_trump_tweets=[]
lst_joe_biden_tweets=[]from urllib.request import urlopen
lst_donald_trump_tweets = cp.load(urlopen("https://raw.githubusercontent.com/gomachinelearning/Blogs/master/DonaldTrumpTweets.pickle"))
lst_joe_biden_tweets = cp.load(urlopen("https://raw.githubusercontent.com/gomachinelearning/Blogs/master/JoeBidenTweets.pickle"))
len(lst_donald_trump_tweets), len(lst_joe_biden_tweets)lst_donald_trump_polarity = []
lst_donald_trump_subjectivity  = []lst_joe_biden_polarity = []
lst_joe_biden_subjectivity  = []#number of tweets to analyze
n=200def sentiment_analysis(lst_tweets,lst_polarity, lst_subjectivity):
  for tweet in lst_tweets[0:n]:
    blob = TextBlob(clean_tweets(tweet))
    lst_polarity.append(blob.polarity)
    lst_subjectivity.append(blob.subjectivity)sentiment_analysis(lst_donald_trump_tweets,lst_donald_trump_polarity,lst_donald_trump_subjectivity) 
sentiment_analysis(lst_joe_biden_tweets,lst_joe_biden_polarity,lst_joe_biden_subjectivity)

Figure 13: Trump and Biden most recent 200 tweets

Donald Trump’s most recent tweets are clustered towards the right on the graph-with increasing positivism and subjectivity. While Joe Biden’s tweets are bottom-heavy on the graph. They are more neutral and objective.

Let’s compare the polarity and subjectivity of their tweets side by side using box plots.

import pandas as pd
import numpy as npdf = pd.DataFrame(np.c_[np.array(lst_donald_trump_polarity),np.array(lst_joe_biden_polarity)], columns=['Donald Trump','Joe Biden'])
df.plot.box(grid='True', title='Polarity of Trump and Biden Tweets')df = pd.DataFrame(np.c_[np.array(lst_donald_trump_subjectivity),np.array(lst_joe_biden_subjectivity)], columns=['Donald Trump','Joe Biden'])df.plot.box(grid='True', title='Subjectivity of Trump and Biden Tweets')

Trump has been posting more positive tweets compared to Biden. But Biden tend to post more extreme tweets in terms of polarity. Meaning Biden tend to post more extremely positive and extremely negative posts.

Trump has also been posting a much wider range of subjectivity compared to Biden.

Figure 14: Donald Trum and Joe Biden most recent tweets

Trump’s recent posts are more positive and more subjective compared to Biden’s.

Topic Modeling using Latent Dirichlet Allocation (LDA)

In this section we will discover abstract topics in a group of tweets from each candidate. I will group their tweets in 3 groups of 1000 characters and use LDA to predict the topics. We will see how the candidates changed the topics of their tweets over time.

Lemmatization is the process of grouping together the changes in the form of a word so they can be analysed as a single word. For example, walking, walked, and walk will be grouped as one and analyzed as walk. We will use the Python natural language toolkit (ntlk) package called WordNetLemmatizer to pre-process and lemmatize the tweets.

Another useful technique in LDA topic modeling is to leave out words in the documents to process except nouns and adjectives. In the code below, NN is the convention for noun and JJ for adjective. The following line of code identifies the nouns and adjectives in the tweets.

pos[:2] = ‘NN’ or pos[:2] == ‘JJ’

Here’s the code that will process the data and generate topics.

#install gensim as needed
#!pip install gensim
from gensim import matutils, models
import scipy.sparsefrom nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
porter = PorterStemmer()
from nltk import word_tokenize, pos_tagdef get_corpus(dtm):  
  sparse_counts = scipy.sparse.csr_matrix(dtm)
  corpus = matutils.Sparse2Corpus(sparse_counts)
  return corpus# tokenize nouns and adjectives only
def nouns_adjectives(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [lemmatizer.lemmatize(word) for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]     
    return ' '.join(nouns_adj)def generate_topics(str_tweets):
  dtm = document_term_matrix(str_tweets)
  corpus = get_corpus(dtm)  
  id2word = dict((v, k) for k, v in cv.vocabulary_.items())
  lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=1, passes=100,minimum_probability =0.5)
  return lda

Let’s group the tweets in 3 groups of 1000 characters and print the topics within the groups.

import matplotlib.gridspec as gridspec
import math
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
def print_topics(candidate,index_from,index_to): 
  n_topics = 1
  dtm = document_term_matrix(nouns_adjectives(clean_data[candidate][index_from:index_to]))
  corpus =get_corpus(dtm)
  id2word = dict((v, k) for k, v in cv.vocabulary_.items())lda = models.LdaModel(corpus, num_topics = n_topics, id2word = id2word, passes=50)
  topics  = [[topic for topic,_ in lda.show_topic(topic_id, topn=5)] for topic_id in range(lda.num_topics)]
  top_betas = [[beta for _,beta in lda.show_topic(topic_id, topn=5)] for topic_id in range(lda.num_topics)]
  print(topics)       
  
print_topics('DonaldTrump',0,1000)
print_topics('JoeBiden',0,1000)print_topics('DonaldTrump',1000,2000)
print_topics('JoeBiden',1000,2000)print_topics('DonaldTrump',2000,3000)
print_topics('JoeBiden',2000,3000)

Figure 15: Changing topics over time. Group 1 is the most recent. Group 3 is the oldest.

Conclusion

The purpose of this post is to show the differences between President Donald Trump and former Vice President Joe Biden in terms of their tweets using machine learning. I purposely deferred to make my own interpretation of the results since I wanted this to be as objective as possible. Ill leave it up to the readers to make up their own interpretations.