Data Harvesting and NLP for Customer Feedback Analysis

Using Twitter API, MongoDB Atlas and spaCy.

Published in

Level Up Coding

6 min readDec 30, 2022

In this tutorial, we will build a data mining pipeline to extract and analyse customer feedback using the Twitter API. This article is geared towards people who want to start their NLP journey using Python or Data Analysts who want to improve their EDA (Exploratory Data Analysis) process.

I have divided the pipeline into 4 stages:

Fetching tweets using the Twitter API
Analysing and extracting the keywords from tweets using the spaCy library for Python.
Storing the tweets and keywords on the cloud using MongoDB Atlas.
Creating data visualisation on MongoDB Atlas.

Fetching tweets using the Twitter API

To start, you will be requiring a Twitter developer account to get the API credentials. You can watch this video to get more details. Once you have successfully created a developer account, create a file named config.ini to save all your credentials in one place.

[twitter]
api_key = APIKEY
api_key_secret = APIKEYSECRET
access_token = ACCESSTOKEN
access_token_secret = ACCESSTOKENSECRET
bearer_token = BEARERTOKEN

Before moving on to fetching the tweets, we need to authenticate the API. We will be using the tweepy library to make the process simpler.

import tweepy
from configparser import ConfigParser


#configuration
config = ConfigParser()
config.read('config.ini')

#twitter credentials
api_key = config['twitter']['api_key']
api_key_secret = config['twitter']['api_key_secret']
access_token = config['twitter']['access_token']
access_token_secret = config['twitter']['access_token_secret']

#authentication
def twitter_auth():
    auth = tweepy.OAuthHandler(api_key,api_key_secret)
    auth.set_access_token(access_token,access_token_secret)
    return auth

#api
def twitter_api():
    auth = twitter_auth()
    api = tweepy.API(auth)
    return api

For this project, I will be fetching tweets tagged with ‘@VFSGlobal’ (as that’s the company I am currently interning at) with the keywords ‘urgent’ or ‘help’. We will be ignoring all the retweets. The query for the same will be: @VFSGlobal OR @Vfsglobalcare AND urgent OR help -filter:retweets.

Cursor() is a tweepy function that lets us change the parameters to fetch the desired tweets.

The final function:

#fetching tweets
def fetch_df_tweets():
    api = twitter_api()
    query_topic = '@VFSGlobal OR @Vfsglobalcare__ AND urgent OR help -filter:retweets'
    tweets = tweepy.Cursor(api.search_tweets, q=query_topic,count=200,tweet_mode='extended',result_type='recent').items(200)
    return converting_to_df(tweets)

converting_to_df()is the function that will convert all the tweets into a pandas dataframe. A dataframe makes it easier to store and manipulate the data items. As we will be using MongoDB to store the tweets (more on that later), we can easily post the dataframe to the collection.

converting_to_df() function:

columns = ['_id','User','Tweet','Date and Time']
data = []
for tweet in tweets:
    text = tweet.full_text.split() #split tweets into words
    resultwords  = filter(lambda x:x[0]!='@', text) #remove all @mentions
    result = ' '.join(resultwords) #merge all words
    data.append([tweet.id,tweet.user.screen_name,result.capitalize(), tweet.created_at])
df = pd.DataFrame(data,columns=columns)
return df

**Analysing and extracting the keywords from tweets using the spaCy library for Python.**

We will be using the spaCy model to process the tweets. The NLP process will consist of: tokenization and lemmatization. The analysis will be based on the overall frequency of keywords.

NLP steps:

Sentence Tokenization (breaking tweets into sentences).
Eliminating stop-words (unimportant words).
Lemmatization (the process of grouping the inflected forms of a word so they can be analysed as a single item).

The first step, to begin with, is to import the spaCy model with spacy.load("en_core_web_lg"). You may need to download the model using spacy.cli.

import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import pandas as pd

# spacy.cli.download("en_core_web_lg") #download the model only once
nlp = spacy.load("en_core_web_lg")


def chunking(df):
    
    df = df['Tweet']
    all_sentences = []

    #sentence tokenization
    for sentence in df:
        all_sentences.append(sentence)


    #lemmatization
    lemma=[]
    for line in all_sentences:
        line = re.sub(r'[^\w\s]', '', line) #filtering the @mentions and punctuations
        if line !='':
            doc = nlp(line.lstrip().lower())
            for token in doc:
                lemma.append(token.lemma_)

    #Removing all stopwords
    lemma2 = []
    custom_stop_words = ['please','try','vfs','day','need','hi','apply','visa',' ']

    for word in lemma:
        if word not in custom_stop_words:
            lexeme = nlp.vocab[word]
            if lexeme.is_stop==False:
                lemma2.append(word)

    df2 = pd.DataFrame(lemma2)
    
    #skipping the search keywords
    searchfor = ["urgent","help"]
    df2 = df2[df2[0].str.contains('|'.join(searchfor)) == False]
    df2 = df2.value_counts().rename_axis('_id').reset_index(name='counts')

    print(df2)
    return df2

value_counts() will count the occurrence of each word and keep track of it. This will help us understand the frequency. The words will be the _id.

custom_stop_words is a list of words which I don’t want to count.

The overall NLP process is straightforward as I didn’t want to dive deep into it, complicating the pipeline and increasing the length of this article.

**Storing the tweets and keywords on the cloud using MongoDB Atlas.**

You will need a MongoDB Atlas account for this step. MongoDB allows you to create a shared cluster for free.

Click on Connect > Connect your application > Select Python and copy the connection string.

Add the connection string to your config.ini file created earlier.

[mongodb]
connection_string = CONNECTIONSTRING

Create a database in MongoDB Atlas with two collections: count and tweets.

To connect to MongoDB, we need to use the pymongo library.

from pymongo import MongoClient, errors
from configparser import ConfigParser

#configuration
config = ConfigParser()
config.read('config.ini')

# Connect to MongoDB
def get_database():
 
   # Provide the mongodb atlas url to connect python to mongodb using pymongo
   CONNECTION_STRING = config['mongodb']['connection_string']
 
   # Create a connection using MongoClient. You can import MongoClient or use pymongo.MongoClient
   client = MongoClient(CONNECTION_STRING)
 
   # Create the database for our example (we will use the same database throughout the tutorial)
   return client['twitter']

We will insert the data using 2 functions viz. insert_df_tweets() and insert_df_count() . insert_df_tweets() will insert the tweets and insert_df_count() will insert the count of keywords.

def insert_df_tweets(df):
    # Get the database
    dbname = get_database()
    # Get the collection
    collection_tweets = dbname['tweets']
    # Insert dataframe into collection
    try:
        collection_tweets.insert_many(df.to_dict('records'),ordered=False)
    except errors.BulkWriteError:
        print("Skipping duplicate tweets")
  
def insert_df_count(df2):
    # Get the database
    dbname = get_database()
    # Get the collection
    collection_count = dbname['count']
    #Inserting dataframe into collection
    try:
        collection_count.insert_many(df2.to_dict('records'),ordered=False)
    except errors.BulkWriteError:
        print("Skipping duplicate values")

To avoid the re-insertion of multiple tweets, we will be handling the errors using the try except block.

Bringing everything together:

if __name__ == "__main__":  
    #fetch tweets in dataframe 
    df = fetch_df_tweets()
    #insert dataframe into mongodb
    insert_df_tweets(df)
    #chunking tweets 
    df2 = chunking(df)
    #insert dataframe into mongodb (chunks with counts)
    insert_df_count(df2)

Congratulations, you have successfully inserted all the values into your MongoDB.

**Creating data visualisation on MongoDB Atlas.**

This is the most fun and easy part thanks to the built-in data visualisation features of MongoDB Atlas.

Click on Visualize Your Data in the top-right corner.

Create a new dashboard by clicking on Add Dashboard.

Add a new chart and select your current cluster as the data source.

Select ‘count’ as we will be analysing the frequency of keywords.

Add _id to the X-axis and count to the Y-axis.

Voila! You created a data visualization in no time. There is a lot more that can be done to improve the visualization but I hope this tutorial gave you a basic understanding of how you can build dashboards and charts using data harvested from Twitter for feedback analysis.

Hope you found this tutorial to be helpful. Let me know in the comments if you have any doubts or suggestions to improve the overall process.

Links:

If you liked my blogs, you could buy me a coffee.

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job

Data Harvesting and NLP for Customer Feedback Analysis

Using Twitter API, MongoDB Atlas and spaCy.

Fetching tweets using the Twitter API

Analysing and extracting the keywords from tweets using the spaCy library for Python.

Storing the tweets and keywords on the cloud using MongoDB Atlas.

Creating data visualisation on MongoDB Atlas.

Links:

Level Up Coding

Written by Om Kamath

**Analysing and extracting the keywords from tweets using the spaCy library for Python.**

**Storing the tweets and keywords on the cloud using MongoDB Atlas.**

**Creating data visualisation on MongoDB Atlas.**