Create a Self-Moderated Comment System with LLAMA-2 and LangChain

We’re going to create a self-moderated comment response system, using two LLAMA models chained with LangChain. This way, we’ll prevent users from attempting to troll our comment system and turning us into a meme.

Pere Martra
Level Up Coding

--

This article is part of a free course about Large Language Models available on GitHub.

I’ve rewritten the article and the supporting notebook to utilize the new LCEL (LangChain Expression Language) instead of Chains.

We’ll simulate that we’re responding to comments occurring in a hypothetical opinion or support forum of a company.

The solution is based on separating the model responsible for posting the response from the user’s input. In other words, the model that approves and modifies the response has not read the comment it is responding to. This way, we isolate this model from possible prompt engineering attacks or direct user attacks.

The steps that our LangChain chain will follow in order to prevent our moderation system from going crazy, or impolite, are as follows:

  • The first model reads the user’s input.
  • It generates a response.
  • A second model analyzes the response.
  • If necessary, it modifies and finally publishes it.

The source code and the model.

The article is based on a notebook where I have used the model meta-llama/Llama-2–7b-chat-hf. However, there are two more notebooks that implements the same solution. In one of them, I’m using the EleutherAI/gpt-j-6b model, available on Hugging Face, and in the other, models from OpenAI that can be accessed via API.

All three notebooks can be found in the GitHub repository of the Large Language Models course.

Using LLAMA-2 from Hugging Face.

Llama-2 is an open-source model, but Meta requires a register to grant us access to this models family. The request is made through a Meta WebPage, which can be accessed from the model homepage on Hugging Face.

Meta Page to request Access to LLAMA

The request is made for all models in the Llama-2 family. This means that we will be granted access to any of the model sizes. It’s mandatory to use the same email in the request as the one associated with our Hugging Face account.

Fortunately, the approval process doesn’t take too long. In my case, I received the confirmation mail in just a few minutes.

Once you’ve received confirmation of access, you will be able to access the mode from the Hugging Face website. Remember that you need to be registered with the same email address that you used to request permissions on the Meta page.

Install and load the necessary libraries.

The notebook is stored in Colab, but I am subscribed to Colab Pro, so it is possible that you may not be able to run it in that environment unless you also have a Colab Pro subscription.

If you encounter any issues, the notebook is ready to be run in a CUDA environment on a local machine or on a Mac with a Silicon chip.

In either case, the execution of the calls to LLAMA-2 are not as immediate as with the OpenAI API. Each model call can take several minutes, depending on your GPU.

We are going to install the libraries necessary in Colab. But if you are working in your own environment and have already worked with Large Language Models, you probably have langchain and transformers installed.

#Install de LangChain and openai libraries.
!pip install -q langchain==0.1.4
!pip install -q transformers==4.37.1
!pip install -q accelerate==0.26.1
!pip install -q xformers==0.0.23

The transformers library is maintained by Hugging Face and provides access to a multitude of open-source models and tools to work with them. It’s an essential library that underpins the entire open-source Large Language Models revolution. Langchain, on the other hand, is a more recent addition and is the library that will allow us to link models together or with different tools.

Now we can import all the necessaries libraries.

from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain_core.output_parsers import StrOutputParser

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

import torch
from torch import cuda, bfloat16

I’ve separated them into three blocks: the classes from the LangChain libraries, the ones from transformers, and finally some torch libraries. Let’s take a fast look at each of them:

  • PrompTemplate: Allows us to create a prompt templateformed by a text containing variables. The PromptTemplate will replace the values of these variables with the ones that receive from us. Helping us to build dynamic prompts which result will be passed to the model.
  • Transformers: The classes we import here are used to load the model, its tokenizer, and create the pipeline. The tokenizer transforms text into embeddings that the model can understand, and it also works in the reverse direction, turning the embeddings returned by the model into text that can be understood by the user. The pipeline class allows us to use the models for the specific task they were pretrained for.
  • StrOutputParser: Necessary Parser to transform the response of the model into text.
  • Cuda, Bfloat16: These are necessary for loading the model onto the GPU and improving its performance.

Loading LLAMA-2 from Hugging Face.

Loading LLAMA-2 is a bit special and different from most of the models available on Hugging Face. Since it’s a model for which we had to get permission to access, we need to be logged into our Hugging Face account to use it.

To log to the Hugging Face environment, you will need an Access Token. You can obtain it from the Settings option in your Hugging Face profile, where you’ll find the Access Tokens option.

%pip install huggingface_hub

hf_key = "YOUR-HF-KEY-HERE"
!huggingface-cli login --token $hf_key

With this code, we are installing the huggingface_hub library, that is necessary to perform the login into hugging face with our access key.
I’d like to remind you, that the email of the hugging face account must be the same with which the request is made on the Meta page.

#In a MAC Silicon the device must be 'mps'
# device = torch.device('mps') #to use with MAC Silicon
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

To run on Colab or machines with NVIDIA GPUs, leave the code as it is. If you want to load the model on a Mac with a Silicon chip, you need to use the statement that loads the ‘mps’ device.

Now, let’s load the model.

#You can try with any llama model, but you will need more GPU and memory as you increase the size of the model.
model_id = "meta-llama/Llama-2-7b-chat-hf"

The model I selected is Llama-2–7b-chat-hf, which is the 7b version of Llama-2 pretrained to perform well in chat scenarios. You can try any model from the LLAMA family, just keep in mind that if you choose larger models, you’ll need more memory and preferably more processing power.

# begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_key
)

The LLAMA family comes with a pre-configuration stored in Hugging Face, which we need to retrieve for later use in the model loading call. Most models available on Hugging Face don’t have this pre-configuration, so if you have previous experience with other models, you might not be familiar with this way of working. Don’t worry; it’s the only difference we encounter.

model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
device_map='auto',
use_auth_token=hf_key
)
model.eval()

In the two first parameters, we specify the name of the model to load, which is stored in the variable model_id, and the configuration we retrieved earlier by calling transformers.AutoConfig.from_pretrained.

We instruct it to use the device most suitable. If we wanted to force it to use the newly instantiated GPU, we could pass it the assigned number. To check which slot it’s in, we just need to print the content of device, and we’ll see the name and position it occupies.

device

‘cuda:0’

Now we have the model loaded on the GPU and stored in the variable model.

The next steps are to load the tokenizer and create the pipeline. Since loading the tokenizer can be very time consuming, I prefer to load it in a separate cell in the notebook, so that I don’t need to execute it again if I want to make changes to the pipeline.

tokenizer = AutoTokenizer.from_pretrained(model_id,
use_aut_token=hf_key)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=128,
temperature=0.3,
repetition_penalty=1.1,
return_full_text=True,
device_map='auto'
)

assistant_llm = HuggingFacePipeline(pipeline=pipe)

This way, we create the pipeline that will execute the model. As you can see, we are providing it with the task: text-generation, the model, and the tokenizer. These parameters don’t need much explanation; let’s take a look at the others:

  • max_new_tokens: Indicates the maximum length of the text generated by the model. It may stop before reaching this number, but it will not exceed it.
  • temperature: This controls the randomness of the response generated by the model. Higher the values, greater the variety. If you are using the model for code generation tasks, you can keep it at 0, and if your use case requires highly imaginative but not necessarily accurate responses, you can try higher values like 1.0. I’ve kept it at 0.3 to allow the model to provide more imaginative examples. Experiment with different values, and you’ll see how responses vary more with higher values.
  • repetition_penalty: In some cases, models can get stuck in a loop while generating a response, resulting in endless and nonsensical conversations until max_tokens is reached. This parameter prevents this by penalizing word repetition.
  • return_full_text: To work properly with LangChain, we need the model to return the complete response without truncation or cropping.

With this, assistant_llm will contain our pipeline ready to be used.

Creating our first chain.

Our first chain will be responsible for elaborating an initial response to the user’s comment. It will contain a prompt and the pipeline we just loaded.

First, will construct the prompt using a template and the variables we provide, and then pass it to the model which will execute the prompt’s instructions.

# Instruction how the LLM must respond the comments,
assistant_template = """
[INST]<<SYS>>You are {sentiment} assistant that responds to user comments,
using similar vocabulary than the user.
Stop answering text after answer the first user.<</SYS>>

User comment:{customer_request}[/INST]
assistant_response
"""

#Create the prompt template to use in the Chain for the first Model.
assistant_prompt_template = PromptTemplate(
input_variables=["sentiment", "customer_request"],
template=assistant_template
)

The text of the prompt is contained in the variable assistant_template. As you can see, it contains two parameters: sentiment and customer_request. The sentiment parameter sets the personality the assistant will adopt when creating the responses. The customer_request parameter contains the text to which the assistant should respond.

The prompt template is created using PrompTemplate, previously imported from the langchain library. This template receives the input parameters, which, along with the received text, will form the prompt to be sent to the Model.

Now we can create the first chain with LangChain. As I mentioned earlier, it will only link the prompt template with the Model. In other words, it will receive the parameters, use the assistant_promp_template to construct the prompt, and once constructed, pass it to the model.

#OLD CODE USING CHAIN
assistant_chain = LLMChain(
llm=assistant_llm,
prompt=assistant_prompt_template,
output_key="assistant_response",
verbose=False
)

#NEW CODE USING LCEL
output_parser = StrOutputParser()
assistant_chain = assistant_prompt_template | assistant_llm | output_parser

I’ve kept as a commentary the code using LLMChain for comparison with the usage of the new LCEL. Now, we create a chain in a manner quite similar to how pipelines are created in the Unix shell.

This chain will be the first part of our small comment system. We will use it alongside the chain that contains the second model, responsible for moderating the responses of this first. But we can also run it independently.

Let’s run a couple of tests by executing the assistant chain on its own. To do this, I will create a function that receives the sentiment and user text, encapsulating the call to the assistant’s run method.

#Support function to obtain a response to a user comment.
def create_dialog(customer_request, sentiment):
#calling the .invoke method from the chain created Above.
assistant_response = assistant_chain.invoke(
{"customer_request": customer_request,
"sentiment": sentiment}
)
return assistant_response
# This the customer comment in the forum moderated by the agent.
# feel free to update it.
customer_request = """Your product is a piece of shit. I want my money back!"""

# Our assistatnt working in 'nice' mode.
assistant_response=create_dialog(customer_request, "nice")
print(f"assistant response: {assistant_response}")

assistant response: “Sorry to hear that you’re not satisfied with our product! Can you tell me more about what you don’t like? Maybe we can help resolve the issue or provide a refund. Your feedback is important to us.”

#Our assistant running in rude mode.
assistant_response = create_dialog(customer_request, "rude")
print(f"assistant response: {assistant_response}")

assistant response: “Sorry to hear that you’re not satisfied with our product! Can you tell us more about what you don’t like? We value your feedback and would be happy to make it right. Please DM us for a refund or to discuss further.”

Indeed, the two responses obtained are very similar and entirely publishable. It’s evident that Llama-2 is a modern model that has been trained to provide polite responses.

However, the style of the second response is slightly more formal.

Creating the Moderator Chain.

Just like with the Assistant, we need to create a prompt template for this chain. However, this time it will only receive one parameter: the response generated by first model.

#The moderator prompt template
moderator_template = """
[INST]<<SYS>>You are the moderator of an online forum, you are strict and will not tolerate any negative comments.
You will receive an original comment and if it is impolite you must transform into polite.
Try to mantain the meaning when possible.<</SYS>>

Original comment: {comment_to_moderate}/[INST]
"""

# We use the PromptTemplate class to create an instance of our template that will use the prompt from above and store variables we will need to input when we make the prompt.
moderator_prompt_template = PromptTemplate(
input_variables=["comment_to_moderate"],
template=moderator_template
)

The prompt is longer, but the mechanics are the same: a text filled with parameters. In this case, the parameter is a sentence that will be the response from the first chain.

moderator_llm = assistant_llm

#We build the chain for the moderator.
#OLD CHAIN CODE
#moderator_chain = LLMChain(
# llm=moderator_llm, prompt=moderator_prompt_template, verbose=False
#)

#NEW LCEL CODE
moderator_chain = moderator_prompt_template | moderator_llm | output_parser

Now we can execute this second chain and pass it the result we obtained from running the first one.

# To run our chain we use the .run() command
moderator_says = moderator_chain.invoke({"comment_to_moderate": assistant_response})

print(f"moderator_says: {moderator_says}")

moderator_says: “Thank you for sharing your thoughts on our product! We appreciate your feedback and are always looking for ways to improve. Your input is invaluable to us. If you could provide more details about what you liked or disliked, we would greatly appreciate it. Thank you again!”

A response that might be to formal. But it’s what we asked for, because it’s probably what the company wants ;-)

Creating the moderator joining the two LangChain chains.

Let’s put the two chains to work together, and build our system.

assistant_moderated_chain = (
{"comment_to_moderate":assistant_chain}
|moderator_chain
)

If you notice, the output of the first chain: comment_to_moderate, matches the parameter expected in the prompt template of the Moderator in the second chain. This allows to automatically pass the result of the first chain to the second one when we combine them.

To create the chain that links the two models, we need to merge both chains containing the prompts and models.

Let’s test this chain.

# We can now run the chain.
assistant_moderated_chain.run({"sentiment": "rude", "customer_request": customer_request})

Original comment: “Sorry to hear that you’re not satisfied with our product! Can you tell us more about what you don’t like? Maybe we can help resolve the issue or provide a refund. No need to be rude, let’s work together to find a solution.”

Finished chain. “Thank you for sharing your thoughts on our product! We appreciate your feedback and will do our best to address any concerns you may have. Your input helps us improve and provide better products in the future. Let\’s work together to find a resolution!”

That’s great! The moderation has worked perfectly! In the original comment, the one from the first model, there was a slight out-of-line remark, and the model told the customer that there was no need to be rude. The second model noticed this and changed the comment without altering its meaning.

Conclusion and continuing the path.

The process of creating our automated comment moderation system has been quite straightforward, and the result is more than satisfactory.

The main component of the moderation system are the two chains formed by a model and a prompt template. These are very simple chains that essentially build a prompt and pass it to the model.

Once we have these two chains, we only need to combine them. We just need to ensure that the output of the first chain matches what the second one input.

With these simple steps, we’ve created a system that automatically responds to users and is much safer than allowing a single model to respond without any kind of moderation.

Keep in mind that even ChatGPT can be hacked into providing responses that are not politically correct. We all have in mind the example of BadGPT, which managed to generate highly inappropriate responses.

In summary, by separating the model responsible for final responses from the user input, we significantly reduce the chances of obtaining rude or inappropriate responses from our system.

If you want to continue exploring, don’t hesitate to use the notebook and run it multiple times to see the different responses it generates. Modify the prompt, change the models used, try creating user inputs that challenge the system, adjust the pipeline parameters. Basically, play around and customize the notebook as much as you’d like.

Remember that in the Large Language Models course available on GitHub, you can find two more notebooks where I’ve implemented the same system with OpenAI models and another simpler one using the Hugging Face library.

Thanks for reading it, I hope you had a great Time.

Resources:

The full course about Large Language Models is available at Github. To stay updated on new articles, please consider following the repository or starring it. This way, you’ll receive notifications whenever new content is added.

This article is part of a series where we explore the practical applications of Large Language Models. You can find the rest of the articles in the following list:

Large Language Models Practical Course

15 stories

I write about Deep Learning and AI regularly. Consider following me on Medium to get updates about new articles. And, of course, You are welcome to connect with me on LinkedIn.

--

--

AI Architect. Authoring a course about Large Language Models Applications & Techniques. I write, mainly, about LLMs, DeepLearning and TensorFlow.