Data Harvesting and NLP for Customer Feedback Analysis
Using Twitter API, MongoDB Atlas and spaCy.
In this tutorial, we will build a data mining pipeline to extract and analyse customer feedback using the Twitter API. This article is geared towards people who want to start their NLP journey using Python or Data Analysts who want to improve their EDA (Exploratory Data Analysis) process.
I have divided the pipeline into 4 stages:
- Fetching tweets using the Twitter API
- Analysing and extracting the keywords from tweets using the spaCy library for Python.
- Storing the tweets and keywords on the cloud using MongoDB Atlas.
- Creating data visualisation on MongoDB Atlas.
Fetching tweets using the Twitter API
To start, you will be requiring a Twitter developer account to get the API credentials. You can watch this video to get more details. Once you have successfully created a developer account, create a file named config.ini
to save all your credentials in one place.
[twitter]
api_key = APIKEY
api_key_secret = APIKEYSECRET
access_token = ACCESSTOKEN
access_token_secret = ACCESSTOKENSECRET
bearer_token = BEARERTOKEN
Before moving on to fetching the tweets, we need to authenticate the API. We will be using the tweepy
library to make the process simpler.
import tweepy
from configparser import ConfigParser
#configuration
config = ConfigParser()
config.read('config.ini')
#twitter credentials
api_key = config['twitter']['api_key']
api_key_secret = config['twitter']['api_key_secret']
access_token = config['twitter']['access_token']
access_token_secret = config['twitter']['access_token_secret']
#authentication
def twitter_auth():
auth = tweepy.OAuthHandler(api_key,api_key_secret)
auth.set_access_token(access_token,access_token_secret)
return auth
#api
def twitter_api():
auth = twitter_auth()
api = tweepy.API(auth)
return api
For this project, I will be fetching tweets tagged with ‘@VFSGlobal’ (as that’s the company I am currently interning at) with the keywords ‘urgent’ or ‘help’. We will be ignoring all the retweets. The query for the same will be: @VFSGlobal OR @Vfsglobalcare AND urgent OR help -filter:retweets
.
Cursor()
is a tweepy function that lets us change the parameters to fetch the desired tweets.
The final function:
#fetching tweets
def fetch_df_tweets():
api = twitter_api()
query_topic = '@VFSGlobal OR @Vfsglobalcare__ AND urgent OR help -filter:retweets'
tweets = tweepy.Cursor(api.search_tweets, q=query_topic,count=200,tweet_mode='extended',result_type='recent').items(200)
return converting_to_df(tweets)
converting_to_df()
is the function that will convert all the tweets into a pandas dataframe. A dataframe makes it easier to store and manipulate the data items. As we will be using MongoDB to store the tweets (more on that later), we can easily post the dataframe to the collection.
converting_to_df()
function:
columns = ['_id','User','Tweet','Date and Time']
data = []
for tweet in tweets:
text = tweet.full_text.split() #split tweets into words
resultwords = filter(lambda x:x[0]!='@', text) #remove all @mentions
result = ' '.join(resultwords) #merge all words
data.append([tweet.id,tweet.user.screen_name,result.capitalize(), tweet.created_at])
df = pd.DataFrame(data,columns=columns)
return df
Analysing and extracting the keywords from tweets using the spaCy library for Python.
We will be using the spaCy model to process the tweets. The NLP process will consist of: tokenization and lemmatization. The analysis will be based on the overall frequency of keywords.
NLP steps:
- Sentence Tokenization (breaking tweets into sentences).
- Eliminating stop-words (unimportant words).
- Lemmatization (the process of grouping the inflected forms of a word so they can be analysed as a single item).
The first step, to begin with, is to import the spaCy model with spacy.load("en_core_web_lg")
. You may need to download the model using spacy.cli
.
import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import pandas as pd
# spacy.cli.download("en_core_web_lg") #download the model only once
nlp = spacy.load("en_core_web_lg")
def chunking(df):
df = df['Tweet']
all_sentences = []
#sentence tokenization
for sentence in df:
all_sentences.append(sentence)
#lemmatization
lemma=[]
for line in all_sentences:
line = re.sub(r'[^\w\s]', '', line) #filtering the @mentions and punctuations
if line !='':
doc = nlp(line.lstrip().lower())
for token in doc:
lemma.append(token.lemma_)
#Removing all stopwords
lemma2 = []
custom_stop_words = ['please','try','vfs','day','need','hi','apply','visa',' ']
for word in lemma:
if word not in custom_stop_words:
lexeme = nlp.vocab[word]
if lexeme.is_stop==False:
lemma2.append(word)
df2 = pd.DataFrame(lemma2)
#skipping the search keywords
searchfor = ["urgent","help"]
df2 = df2[df2[0].str.contains('|'.join(searchfor)) == False]
df2 = df2.value_counts().rename_axis('_id').reset_index(name='counts')
print(df2)
return df2
value_counts()
will count the occurrence of each word and keep track of it. This will help us understand the frequency. The words will be the _id
.
custom_stop_words
is a list of words which I don’t want to count.
The overall NLP process is straightforward as I didn’t want to dive deep into it, complicating the pipeline and increasing the length of this article.
Storing the tweets and keywords on the cloud using MongoDB Atlas.
You will need a MongoDB Atlas account for this step. MongoDB allows you to create a shared cluster for free.
Click on Connect > Connect your application > Select Python and copy the connection string.
Add the connection string to your config.ini
file created earlier.
[mongodb]
connection_string = CONNECTIONSTRING
To connect to MongoDB, we need to use the pymongo
library.
from pymongo import MongoClient, errors
from configparser import ConfigParser
#configuration
config = ConfigParser()
config.read('config.ini')
# Connect to MongoDB
def get_database():
# Provide the mongodb atlas url to connect python to mongodb using pymongo
CONNECTION_STRING = config['mongodb']['connection_string']
# Create a connection using MongoClient. You can import MongoClient or use pymongo.MongoClient
client = MongoClient(CONNECTION_STRING)
# Create the database for our example (we will use the same database throughout the tutorial)
return client['twitter']
We will insert the data using 2 functions viz. insert_df_tweets()
and insert_df_count()
. insert_df_tweets()
will insert the tweets and insert_df_count()
will insert the count of keywords.
def insert_df_tweets(df):
# Get the database
dbname = get_database()
# Get the collection
collection_tweets = dbname['tweets']
# Insert dataframe into collection
try:
collection_tweets.insert_many(df.to_dict('records'),ordered=False)
except errors.BulkWriteError:
print("Skipping duplicate tweets")
def insert_df_count(df2):
# Get the database
dbname = get_database()
# Get the collection
collection_count = dbname['count']
#Inserting dataframe into collection
try:
collection_count.insert_many(df2.to_dict('records'),ordered=False)
except errors.BulkWriteError:
print("Skipping duplicate values")
To avoid the re-insertion of multiple tweets, we will be handling the errors using the try except
block.
Bringing everything together:
if __name__ == "__main__":
#fetch tweets in dataframe
df = fetch_df_tweets()
#insert dataframe into mongodb
insert_df_tweets(df)
#chunking tweets
df2 = chunking(df)
#insert dataframe into mongodb (chunks with counts)
insert_df_count(df2)
Congratulations, you have successfully inserted all the values into your MongoDB.
Creating data visualisation on MongoDB Atlas.
This is the most fun and easy part thanks to the built-in data visualisation features of MongoDB Atlas.
Click on Visualize Your Data
in the top-right corner.
Create a new dashboard by clicking on Add Dashboard
.
Add a new chart and select your current cluster as the data source.
Add _id
to the X-axis and count
to the Y-axis.
Voila! You created a data visualization in no time. There is a lot more that can be done to improve the visualization but I hope this tutorial gave you a basic understanding of how you can build dashboards and charts using data harvested from Twitter for feedback analysis.
Hope you found this tutorial to be helpful. Let me know in the comments if you have any doubts or suggestions to improve the overall process.
Links:
If you liked my blogs, you could buy me a coffee.
Level Up Coding
Thanks for being a part of our community! Before you go:
- 👏 Clap for the story and follow the author 👉
- 📰 View more content in the Level Up Coding publication
- 🔔 Follow us: Twitter | LinkedIn | Newsletter
🚀👉 Join the Level Up talent collective and find an amazing job