Building Powerful NLP Library in Python for 2024

Published in

Level Up Coding

14 min readDec 28, 2023

Consider one of the most stressful scenarios that most coders face, dealing with large text data that requires cleaning. When using regex, you need to define different sets of patterns to remove text, and even then, you may not be sure if there is any new garbage data that needs removal. Tasks like these can be stressful for developers because of the time and effort they have to invest, and there’s still uncertainty about whether the new data requires the same procedural coding or not.

The recent trends of Large Language Models (LLMs), whether open source or closed source, have given us a new dimension of how text data can be handled. Since LLMs can analyze text data more quickly than us and can intelligently understand data to a considerable extent, similar to the way we understand it, why don’t we perform our NLP tasks using LLMs? This could automate the process and make the coder’s life less stressful.

The Core Concept Driving My Library

I tried out several open-source LLMs like Mistral 8x7b or LLAMA-2–70b-chat-hf, but they never met my expectations. Although good for question answering and text generation, they fall short when it comes to NLP tasks. On the other hand, ChatGPT exceeded my expectations, but it requires a paid API access to perform NLP tasks on our custom dataset. Gemini, as capable as GPT-4, provides a free API with limited access. I tested it with the help of prompt engineering and found that it can solve almost any NLP task you want to tackle.

Here is a simple visual illustration of how I have used Gemini multi-model to perform NLP tasks on my dataset:

Visual illustration of how my library works ( Created using FIGMA )

This concept can be used by yourself, allowing you to create your own NLP library to improve your productivity. Highlighting all the features of my library here would make the blog lengthy. You can find my GitHub repository that contains detailed information about each task and its usage. However, this blog will focus on the most important features, how to use them, and how to include your own customized NLP tasks.

GitHub - FareedKhan-dev/Most-powerful-NLP-library: Gemini, as capable as GPT-4, provides a free API…

Gemini, as capable as GPT-4, provides a free API with limited access. I tested it with the help of prompt engineering…

github.com

Installing the Library
Understanding File Structure
Initiating the Library
Cleaning the Text
Perform Lemmatization or Stemming
Simplifying NER Detection and POS Tagging
Text Pattern Matching
Text Classification
Semantic Role Labeling (SRL)
Intent Recognition
Handling Large Data
Customizing the Library
What’s Next

Installing the Library

First, you need to clone my GitHub repository.

git clone https://github.com/FareedKhan-dev/Most-powerful-NLP-library.git

If you don’t have Git installed on your machine, you can download the repository as a ZIP file.

download repository as zip file from github link

Once you have cloned the repository, you need to install the required dependencies that allow you to work with the Gemini API.

# Install the Google Generative AI library
pip install -q -U google-generativeai

Understanding File Structure

You can skip this step as it is for later use only if you want to understand the library and how it works. Here is the file structure of this library.

main_directory/
|-- for_beginner/
|   |-- preprocessing.ipynb
|   |-- core_nlp.ipynb
|-- pre_processing.py
|-- core_nlp.py
|-- code_file.ipyb  # Containing example of each function

A “for_beginner” folder containing two Jupyter notebook files with code blocks for each NLP task that will make it easier for you to understand how this library is working. While both Python files are going to be used to call as modules and use them for your requested task.

pre_processing.py contains functions that are used for preprocessing our text, such as clean_text, remove_html_tags, etc., while core_nlp contains functions that are useful for handling text data and performing different tasks, such as summarize_text, translate_text, etc.

Initiating the Library

In the previous step, we installed the required dependencies and cloned our NLP library. Now, we need to import the necessary library that will fetch Gemini LLM API calls and instantiate the required API key.

# Import the Google Generative AI library
import google.generativeai as genai

# Initialize the GenerativeModel with 'gemini-pro'
model = genai.GenerativeModel('gemini-pro')

# Configure the library with your API key
genai.configure(api_key="Your-API-key")

You can obtain your API key from here. Once you have the key, proceed to the next step.

Cleaning the Text

One of the initial and most frustrating steps while handling text data is to clean it. While regex is a powerful way to clean text, here is what an LLM-based library can do.

# Import the clean_text function from the pre_processing module
from pre_processing import clean_text

# User input text
user_input = '''faree$$@$%d khan will arrive at 9:00 AM. 
                He will@%$ 1meet you at the airport. 
                He will be driving a black BMW. 
                His license plate is 123-456-7890.'''

# Clean the text using the specified model
cleaned_text = clean_text(user_input, model)

# Print the cleaned text
print(cleaned_text)

##### OUTPUT OF ABOVE CODE #####

Fareed Khan will arrive at 9:00 AM. He will meet you at the airport. 
He will be driving a black BMW. His license plate is 123-456-7890.

##### OUTPUT OF ABOVE CODE #####

We imported the function from our pre_processing.py module for cleaning our text. clean_text takes two inputs: the text that you have to clean and the model that you created earlier during the library initiation step. With a little bit of prompt engineering behind this function, our LLM can easily understand what should be removed from the input.

Perform Lemmatization or Stemming

Lemmatization and stemming are tricky in NLP because words come in different forms and meanings. Deciding the base or root of a word can be hard, especially with irregular words and various contexts.

Multiple libraries, like NLTK, can be used for tasks like lemmatization and stemming, but they often lack the effectiveness of an LLM approach. LLMs, trained on extensive datasets, outperform traditional libraries in capturing language nuances and providing more accurate results.

# Import the lemmatize_text and stem_text functions from the pre_processing module
from pre_processing import lemmatize_text, stem_text

# User input text
user_input = '''The cats are running and playing in the gardens, 
                while the dogs are barking loudly and chasing their tails'''

# Lemmatize the text using the specified model
lemmatized_sentence = lemmatize_text(user_input, model)

# Stem the text using the specified model
stemmed_sentence = stem_text(user_input, model)

# Print the lemmatized and stemmed sentences
print(lemmatized_sentence)
print(stemmed_sentence)

##### OUTPUT OF Lemmatized Sentence #####

The cat be run and play in the garden, 
while the dog be bark loud and chase their tail

##### OUTPUT OF Stemmed Sentence #####

the cat ar run and play in the garden, 
whil the dog ar bark loud and chas their tail

Both our lemmatizing and stemming functions take two inputs: the text you want to process and the model you initiated earlier. Unlike NLTK, our functions come in handy when working with different languages that may not be supported by NLTK.

Simplifying NER Detection and POS Tagging

NER in NLP is challenging because figuring out where entities start and end can be unclear, and entities often appear in different ways. Also, there are many types of entities, and the task gets even harder when dealing with new or changing entities. On the hand, LLMs work well for NER detection because they learn detailed patterns and context, making them good at recognizing different named entities using their vast training data.

# Import the detect_ner function from the core_nlp module
from core_nlp import detect_ner

# User input text
user_input = "I will meet you at the airport sharp 12:00 AM."

# Specify NER tags (optional, default includes 'person, location, date, number, ...')
ner_tags = 'person, location, date, number, ... cardinal'

# Detect named entities in the text using the specified model and NER tags
ner_result = detect_ner(input_text=user_input, ner_tags=ner_tags, model=model)

# Print the NER result
print(ner_result)

##### OUTPUT OF ABOVE CODE #####

airport: facility
12:00 AM: time

##### OUTPUT OF ABOVE CODE #####

This function requires three inputs: the text for which you need NER tags, the ner_tags specifying the types of entities to extract (with default values like name, organization, etc.), and the model initiated at the beginning. With zero-shot prompt engineering, there's no need to provide specific examples behind the function. Just provide a bit of detail in the prompt, and our function outputs the relevant ner_tags detected in the input.

Similarly, POS tagging is challenging in NLP as it involves distinguishing between word classes, and certain words may serve multiple grammatical functions based on context. Managing slang, informal language, or domain-specific terms adds complexity to accurately assigning part-of-speech tags.

# Import the detect_pos function from the core_nlp module
from core_nlp import detect_pos

# User input text
user_input = "I will meet you at the airport sharp 12:00 AM."

# Specify POS tags (optional, default includes 'NOUN, verb, adjective, adverb, ...')
pos_tags = 'noun, verb, adjective, adverb, pronoun, ... entity_phrase'

# Detect part-of-speech in the text using the specified model and POS tags
pos_result = detect_pos(input_text=user_input, pos_tags=pos_tags, model=model)

# Print the POS result
print(pos_result)

##### OUTPUT OF ABOVE CODE #####

I: pronoun
will: verb
meet: verb
you: pronoun
at: preposition
the: determiner
airport: noun
sharp: adverb
12:00: time
AM: time
.: punctuation

##### OUTPUT OF ABOVE CODE #####

This function also takes three inputs, as you may have already seen in the code. An important point to note is that the default POS tags are approximately more than 50, which is sufficient for a detailed extraction of words based on them. However, if you need a new tag not present in the default values, you can add it for easier implementation based on your specific case.

Text Pattern Matching

When extracting patterns like emails or numbers from text data, defining patterns using regex or other libraries is a common approach. However, the LLM has a significant advantage due to its training on large text data. Unlike regex, you just need to name the pattern you want to extract, such as email or phone number, making it more convenient.

# Import the extract_patterns function from the pre_processing module
from pre_processing import extract_patterns

# User input text
user_input = '''The phone number of fareed khan is 123-456-7890 and 523-456-7892. Please call for assistance and email me at x123@gmail.com'''

# Define patterns for extraction
pattern_matching = '''email, phone number, name'''

# Extract patterns from the input text using the specified model and patterns
extracted_patterns = extract_patterns(user_input, pattern_matching, model)

# Print the extracted patterns
print(extracted_patterns)

##### OUTPUT OF ABOVE CODE #####

type: python list

['123-456-7890', '523-456-7892', 'x123@gmail.com', 'fareed khan']

##### OUTPUT OF ABOVE CODE #####

While extracting emails and phone numbers may not pose significant challenges, patterns like disease codes or license plates may require additional effort. Defining regex patterns for such cases can be uncertain for new data. In extract_patterns function, you only need to provide comma-separated patterns you want to identify, and Gemini will handle the rest of the effort for you.

Although there are many text preprocessing features that I have created in this library, I recommend visiting my GitHub repository to explore the full list of functionalities I have introduced.

Text Classification

I have introduced three features for text classification tasks:

Sentiment Analysis
Topic Classification
Spam Detection

Sentiment analysis, by default, includes three main categories: positive, neutral, and negative. However, you have the flexibility to specify more detailed categories based on your preferences.

# Import the analyze_sentiment function from the core_nlp module
from core_nlp import analyze_sentiment

# User input text
user_input = "I love to play football, but today I am feeling very sad. I do not want to play football today."

# Specify sentiment categories (optional, default includes 'positive, negative, neutral')
category = "positive, negative, neutral"

# Analyze sentiment in the text using the specified model and sentiment categories
sentiment_result = analyze_sentiment(input_text=user_input, category=category, explanation=True, model=model)

# Print the sentiment result
print(sentiment_result)

##### OUTPUT OF ABOVE CODE #####

Category: Negative

Short Explanation:
The overall sentiment of the text is negative. 
The author expresses a love for football but then 
goes on to say that they are feeling very sad and 
do not want to play football today. This indicates 
a negative sentiment towards the activity of playing football.

##### OUTPUT OF ABOVE CODE #####

You can customize the category based on your preferences and also set the explanation parameter to TRUE or FALSE depending on whether you want an explanation for the answer. The rest of the input parameters remain the same, similar to other functions.

For topic classification, you need to set the topics yourself. The explanation parameter is used to justify the answer and explain why it fits into a particular topic category.

# Import the classify_topic function from the core_nlp module
from core_nlp import classify_topic

# User input text
user_input = "I love to play football, but today I am feeling very sad. I do not want to play football today."

# Specify topics (optional, default includes 'story, horror, comedy')
topics = "story, horror, comedy"

# Classify the topic of the text using the specified model and topics
topic_result = classify_topic(input_text=user_input, topics=topics, explanation=True, model=model)

# Print the topic result
print(topic_result)

##### OUTPUT OF ABOVE CODE #####

Category: Story

Short Explanation:
The input text is a story about a person who loves to play football 
but is feeling sad and does not want to play today. 
The text does not contain any elements of horror or comedy,
so the topic is classified as story.

##### OUTPUT OF ABOVE CODE #####

Similarly, spam detection has three default categories: spam, not_spam, and unknown. It also includes an explanation parameter to justify why Gemini has chosen a particular category.

# Import the detect_spam function from the core_nlp module
from core_nlp import detect_spam

# User input text
user_input = "you have just won $14000, claim this award here at this link."

# Specify spam categories (optional, default includes 'spam, not_spam, unknown')
category = 'spam, not_spam, unknown'

# Detect spam in the text using the specified model and spam categories
spam_result = detect_spam(input_text=user_input, category=category, explanation=True, model=model)

# Print the spam result
print(spam_result)

##### OUTPUT OF ABOVE CODE #####

Category: spam

Short Explanation:
The message contains the promise of a large monetary reward, 
which is a classic tactic used by spammers to attract attention 
and entice people to click on the link.

##### OUTPUT OF ABOVE CODE #####

In all three text classification tasks, only topic classification requires you to define the topics you want to classify, whereas the rest of the tasks have default values commonly used for categorization.

Semantic Role Labeling (SRL)

Figuring out what words do in a sentence, known as Semantic Role Labeling (SRL), can be tough in NLP. It gets tricky due to sentence structures and the different jobs words can have based on what’s happening. Large language models (LLMs) are good at understanding these details, like who did what in a sentence.

# Import the perform_srl function from the core_nlp module
from core_nlp import perform_srl

# User input text
user_input = "tornado is approaching the city, please take shelter"

# Perform Semantic Role Labeling (SRL) on the text using the specified model
srl_result = perform_srl(user_input, model)

# Print the SRL result
print(srl_result)

##### OUTPUT OF ABOVE CODE #####

Predicate: approach
Roles:
- Agent: tornado
- Theme: city

##### OUTPUT OF ABOVE CODE #####

It identifies two important components: the predicate, which tells what the subject is doing or what the subject is, and roles, which contain agents and more. The function takes only two inputs: the text data and the model.

Intent Recognition

Intent recognition in NLP involves identifying the purpose or goal behind a user’s input, like understanding if they’re asking a question or making a request. This is crucial for enhancing user interactions with applications, as it enables systems to comprehend user intentions and respond appropriately, creating more effective and personalized user experiences.

# Import the recognize_intent function from the core_nlp module
from core_nlp import recognize_intent

# User input text
user_input = "tornado is approaching the city, please take shelter"

# Recognize intent in the text using the specified model
intent_result = recognize_intent(user_input, model)

# Print the intent result
print(intent_result)

##### OUTPUT OF ABOVE CODE #####

Intent: Emergency alert

##### OUTPUT OF ABOVE CODE #####

It accurately identifies the intent behind the text, understanding the user’s input intention. This function takes the same two inputs as seen earlier: the text data and the model.

Handling Large Data

Up until now, we’ve worked with relatively small text data, like short sentences. If you need to handle larger text, while I haven’t implemented it yet, one approach is to break your text data into chunks and process it accordingly. Here’s an example of how to work with a bigger dataset.

# Example text dataset
text_dataset = "some_big_text_file.txt"

# Break the text into sentences based on full stops
sentences = text_dataset.split('. ')

# some ner_tags you have defined
ner_tags = "person, organization ..."

# Applying NER on it
for i, sentence in enumerate(sentences):
    print(f"Sentence {i + 1}:")
    
    # Applying NER on each sentence
    detect_ner(input_text=sentence, ner_tags=ner_tags, model=model)

Another approach to handling larger data is to break it into more extensive chunks, for example, 500 sentences per chunk, to preserve dataset information. If you want to apply the text_summarization task, you can then provide the summaries of each chunk in a combined manner to generate one detailed summary for the entire text.

Visual Illustration of how to handle large text data

There are several ways to handle big data, but the approaches I’ve just shared are among the most common and practical.

Customizing the Library

Customizing the library involves including your own functions, and a well-crafted prompt is essential for making your customized functions work. For instance, if you want to create a paraphrasing-checking function, you need to start with a prompt for the paraphrasing task.

# Question to be asked for determining paraphrasing
question = f'''Given the input text, determine if two sentences are paraphrases of each other.
Sentence 1: {user_input[0]}
Sentence 2: {user_input[1]}
Answer must be 'yes' or 'no'.
{explanation}
'''

When creating a customized function, it’s crucial to explain the expected output from Gemini to maintain consistency across runs. Additionally, defining the answer format is essential; for instance, in tokenization, you may specify that the output format should be a list. To achieve this, you can later convert the string representation to an actual list using the ast Python library. In the paraphrasing task, the rest of the prompt remains relatively constant, with changes depending on how many sentences you want to input—I've considered two in this example.

Once you create your prompt, you can build a function on top of it.

# function for paraphrase detection
def paraphrasing_detection(input_text, explanation, model):

    # Check if explanation is required
    explanation_text = 'short explanation: ' if explanation else 'no explanation'

    # Question to be asked for determining paraphrasing
    question = f'''Given the input text, determine if two sentences are paraphrases of each other.
    Sentence 1: {input_text[0]}
    Sentence 2: {input_text[1]}
    Answer must be 'yes' or 'no'.
    {explanation_text}
    '''

    # Generate response
    response = model.generate_content(question)
    return response.text.strip()

You can easily call that function on top of your text data.

# Import the paraphrasing_detection function from the core_nlp module
from core_nlp import paraphrasing_detection

# User input text
user_input = ['''The sun sets in the west every evening.''', '''Every evening, the sun goes down in the west.''']

# Perform paraphrasing detection using the specified model
intent_result = paraphrasing_detection(user_input, explanation=True, model=model)

# Print the paraphrasing detection result
print(intent_result)

##### OUTPUT OF ABOVE CODE #####

Answer: yes
Short Explanation: Both sentences express the same idea that the sun 
sets in the west  every evening. They use different words to convey 
the same meaning,  such as "sets" and "goes down" for the verb and 
"every evening" for temporal modifier.

##### OUTPUT OF ABOVE CODE #####

What’s Next

There are many more features introduced in this library. This is just a glimpse of how LLMs reshape NLP tasks and simplify the handling of text data. Explore the full potential by checking out my GitHub repository, which includes features like generating embeddings for cosine similarity, text summarization, and more. Feel free to adapt the library for your specific domain, whether it’s medical or any other. I hope you enjoy reading this blog.

If you want to build your own LLM from scratch or understand the mathematical aspects of transformers, you can refer to my other blogs:

Solving Transformer by Hand: A Step-by-Step Math Example

Performing numerous matrix multiplications to solve the encoder and decoder parts of the transformer

levelup.gitconnected.com

Building a Million-Parameter LLM from Scratch Using Python

A Step-by-Step Guide to Replicating LLaMA Architecture