Data Harvesting and NLP for Customer Feedback Analysis

Using Twitter API, MongoDB Atlas and spaCy.

Om Kamath
Level Up Coding

--

In this tutorial, we will build a data mining pipeline to extract and analyse customer feedback using the Twitter API. This article is geared towards people who want to start their NLP journey using Python or Data Analysts who want to improve their EDA (Exploratory Data Analysis) process.

I have divided the pipeline into 4 stages:

  1. Fetching tweets using the Twitter API
  2. Analysing and extracting the keywords from tweets using the spaCy library for Python.
  3. Storing the tweets and keywords on the cloud using MongoDB Atlas.
  4. Creating data visualisation on MongoDB Atlas.

Fetching tweets using the Twitter API

To start, you will be requiring a Twitter developer account to get the API credentials. You can watch this video to get more details. Once you have successfully created a developer account, create a file named config.ini to save all your credentials in one place.

[twitter]
api_key = APIKEY
api_key_secret = APIKEYSECRET
access_token = ACCESSTOKEN
access_token_secret = ACCESSTOKENSECRET
bearer_token = BEARERTOKEN

Before moving on to fetching the tweets, we need to authenticate the API. We will be using the tweepy library to make the process simpler.

import tweepy
from configparser import ConfigParser


#configuration
config = ConfigParser()
config.read('config.ini')

#twitter credentials
api_key = config['twitter']['api_key']
api_key_secret = config['twitter']['api_key_secret']
access_token = config['twitter']['access_token']
access_token_secret = config['twitter']['access_token_secret']

#authentication
def twitter_auth():
auth = tweepy.OAuthHandler(api_key,api_key_secret)
auth.set_access_token(access_token,access_token_secret)
return auth

#api
def twitter_api():
auth = twitter_auth()
api = tweepy.API(auth)
return api

For this project, I will be fetching tweets tagged with ‘@VFSGlobal’ (as that’s the company I am currently interning at) with the keywords ‘urgent’ or ‘help’. We will be ignoring all the retweets. The query for the same will be: @VFSGlobal OR @Vfsglobalcare AND urgent OR help -filter:retweets.

Cursor() is a tweepy function that lets us change the parameters to fetch the desired tweets.

The final function:

#fetching tweets
def fetch_df_tweets():
api = twitter_api()
query_topic = '@VFSGlobal OR @Vfsglobalcare__ AND urgent OR help -filter:retweets'
tweets = tweepy.Cursor(api.search_tweets, q=query_topic,count=200,tweet_mode='extended',result_type='recent').items(200)
return converting_to_df(tweets)

converting_to_df()is the function that will convert all the tweets into a pandas dataframe. A dataframe makes it easier to store and manipulate the data items. As we will be using MongoDB to store the tweets (more on that later), we can easily post the dataframe to the collection.

converting_to_df() function:

columns = ['_id','User','Tweet','Date and Time']
data = []
for tweet in tweets:
text = tweet.full_text.split() #split tweets into words
resultwords = filter(lambda x:x[0]!='@', text) #remove all @mentions
result = ' '.join(resultwords) #merge all words
data.append([tweet.id,tweet.user.screen_name,result.capitalize(), tweet.created_at])
df = pd.DataFrame(data,columns=columns)
return df
Sample dataframe output

Analysing and extracting the keywords from tweets using the spaCy library for Python.

We will be using the spaCy model to process the tweets. The NLP process will consist of: tokenization and lemmatization. The analysis will be based on the overall frequency of keywords.

NLP steps:

  1. Sentence Tokenization (breaking tweets into sentences).
  2. Eliminating stop-words (unimportant words).
  3. Lemmatization (the process of grouping the inflected forms of a word so they can be analysed as a single item).
Source: Business Process Incubator

The first step, to begin with, is to import the spaCy model with spacy.load("en_core_web_lg"). You may need to download the model using spacy.cli.

import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import pandas as pd

# spacy.cli.download("en_core_web_lg") #download the model only once
nlp = spacy.load("en_core_web_lg")


def chunking(df):

df = df['Tweet']
all_sentences = []

#sentence tokenization
for sentence in df:
all_sentences.append(sentence)


#lemmatization
lemma=[]
for line in all_sentences:
line = re.sub(r'[^\w\s]', '', line) #filtering the @mentions and punctuations
if line !='':
doc = nlp(line.lstrip().lower())
for token in doc:
lemma.append(token.lemma_)

#Removing all stopwords
lemma2 = []
custom_stop_words = ['please','try','vfs','day','need','hi','apply','visa',' ']

for word in lemma:
if word not in custom_stop_words:
lexeme = nlp.vocab[word]
if lexeme.is_stop==False:
lemma2.append(word)

df2 = pd.DataFrame(lemma2)

#skipping the search keywords
searchfor = ["urgent","help"]
df2 = df2[df2[0].str.contains('|'.join(searchfor)) == False]
df2 = df2.value_counts().rename_axis('_id').reset_index(name='counts')

print(df2)
return df2

value_counts() will count the occurrence of each word and keep track of it. This will help us understand the frequency. The words will be the _id.

custom_stop_words is a list of words which I don’t want to count.

The overall NLP process is straightforward as I didn’t want to dive deep into it, complicating the pipeline and increasing the length of this article.

Storing the tweets and keywords on the cloud using MongoDB Atlas.

You will need a MongoDB Atlas account for this step. MongoDB allows you to create a shared cluster for free.

The MongoDB dashboard

Click on Connect > Connect your application > Select Python and copy the connection string.

Add the connection string to your config.ini file created earlier.

[mongodb]
connection_string = CONNECTIONSTRING
Create a database in MongoDB Atlas with two collections: count and tweets.

To connect to MongoDB, we need to use the pymongo library.

from pymongo import MongoClient, errors
from configparser import ConfigParser

#configuration
config = ConfigParser()
config.read('config.ini')

# Connect to MongoDB
def get_database():

# Provide the mongodb atlas url to connect python to mongodb using pymongo
CONNECTION_STRING = config['mongodb']['connection_string']

# Create a connection using MongoClient. You can import MongoClient or use pymongo.MongoClient
client = MongoClient(CONNECTION_STRING)

# Create the database for our example (we will use the same database throughout the tutorial)
return client['twitter']

We will insert the data using 2 functions viz. insert_df_tweets() and insert_df_count() . insert_df_tweets() will insert the tweets and insert_df_count() will insert the count of keywords.

def insert_df_tweets(df):
# Get the database
dbname = get_database()
# Get the collection
collection_tweets = dbname['tweets']
# Insert dataframe into collection
try:
collection_tweets.insert_many(df.to_dict('records'),ordered=False)
except errors.BulkWriteError:
print("Skipping duplicate tweets")

def insert_df_count(df2):
# Get the database
dbname = get_database()
# Get the collection
collection_count = dbname['count']
#Inserting dataframe into collection
try:
collection_count.insert_many(df2.to_dict('records'),ordered=False)
except errors.BulkWriteError:
print("Skipping duplicate values")

To avoid the re-insertion of multiple tweets, we will be handling the errors using the try except block.

Bringing everything together:

if __name__ == "__main__":  
#fetch tweets in dataframe
df = fetch_df_tweets()
#insert dataframe into mongodb
insert_df_tweets(df)
#chunking tweets
df2 = chunking(df)
#insert dataframe into mongodb (chunks with counts)
insert_df_count(df2)

Congratulations, you have successfully inserted all the values into your MongoDB.

Your dashboard should look like this.

Creating data visualisation on MongoDB Atlas.

This is the most fun and easy part thanks to the built-in data visualisation features of MongoDB Atlas.

Click on Visualize Your Data in the top-right corner.

Create a new dashboard by clicking on Add Dashboard.

Add a new chart and select your current cluster as the data source.

Select ‘count’ as we will be analysing the frequency of keywords.

Add _id to the X-axis and count to the Y-axis.

Final Dashboard

Voila! You created a data visualization in no time. There is a lot more that can be done to improve the visualization but I hope this tutorial gave you a basic understanding of how you can build dashboards and charts using data harvested from Twitter for feedback analysis.

Hope you found this tutorial to be helpful. Let me know in the comments if you have any doubts or suggestions to improve the overall process.

Links:

  1. GitHub
  2. MongoDB Atlas
  3. spaCy
  4. Twitter Developer Platform

If you liked my blogs, you could buy me a coffee.

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--