Deploying a Production-Ready RAG Server: A Comprehensive Guide with LlamaIndex

Marco Bertelli
Level Up Coding
Published in
11 min readMar 27, 2024

--

Deploy to production

Previous Articles:

  • Intro: here
  • Choose the Model: here
  • Flask Server Setup: here
  • ChatEngine Construction: here
  • Advanced RAG performance: here
  • Dynamic Sources: here
  • Frontend Construction: here

Big News: Before publishing this update, I’ve ensured that all sample repositories are now updated to the latest version of LlamaIndex!

In our previous articles, we delved into constructing a fully production-ready chat system using LlamaIndex and Python. If you’ve missed any of those, be sure to catch up before diving into this segment.

Now, while we’ve successfully developed a robust system, it’s crucial to take the next step: deployment in the cloud. A production-ready system isn’t truly operational until it’s deployed and capable of handling real-world traffic. By “production traffic,” we mean that the code, which has so far been tested in our local environment, must perform reliably when accessed by multiple users simultaneously.

Imagine launching your fantastic Chatbot online, only to have it crash when 20 users try to engage with it concurrently. That scenario is far from ideal. We need a server infrastructure that can gracefully handle numerous user requests simultaneously without compromising performance or stability.

Our Manual managed server

Fortunately, building a server from scratch is not necessary. In 2024, numerous providers offer a wide range of production-ready configurations. For this series, we’ve chosen Heroku. Why Heroku over other options?

Well, it’s simple to use and provides a solid, well-managed platform. Additionally, Heroku offers a plethora of plugins with seamless integration, requiring zero code adjustments. In my opinion, this simplicity is crucial at the outset of a project. It allows you to focus on product development rather than server management. Plus, when you’re ready, transitioning away from Heroku is straightforward.

Let’s get started. First, you’ll need a Heroku account. Once you’ve created an account and logged in, navigate to the top-right corner and click the “New” button:

The Button position

After selecting the “New” button, you’ll be prompted to choose a name for your server and the region where it will be hosted. Once you’ve made these selections, it’s time to configure your server.

Begin by navigating to the “Resources” tab. Here, you’ll choose the size of your server. For demo or development servers, opt for the “Eco” or “Basic” tiers. These options provide sufficient resources for testing and experimentation.

However, if you’re setting up a production server, it’s essential to select a higher tier based on the expected user load of your application. Consider factors such as the number of concurrent users and the complexity of your application. Choosing an appropriate tier ensures that your server can handle the demands of production traffic effectively.

Tiers

Once you’ve configured the server size, the next step is to set up Papertrail.

Papertrail allows you to view real-time logs of your application, providing the same debugging experience as your local environment. You can find Papertrail in the “Add-ons” section of the same tab. Simply search for it, and the basic plan, which offers essential features, is free!

After adding Papertrail, navigate to the “Settings” tab. Here, you’ll set the Python buildpack, which is the prerequisite pack for Python applications.

Additionally, you’ll find the public URL of your application in this tab. This URL is crucial as it allows you to access and test your server using tools like POSTMAN in a manner similar to your local environment. With this setup, you’ll be able to seamlessly develop and debug your application on Heroku.

Buildpack and url

The Heroku Python buildpack provides a seamless deployment experience for Python web applications. Under the hood, this buildpack utilizes a specific WSGI (Web Server Gateway Interface) server.

But what exactly is a WSGI server? In essence, it’s a framework designed to handle the web server side of the WSGI interface, allowing Python web applications to run smoothly on the internet. A WSGI server comes with a multitude of pre-configured settings, making it easier to serve your code over the web.

In practical terms, when you deploy your Python application on Heroku using the Python build-pack, it automatically configures and sets up the WSGI server for you. This ensures that your application can handle HTTP requests and interact with web clients effectively.

For more detailed information on WSGI servers and how they work, you can explore the provided previos link.

WSGI basic workflow

Now that we’ve configured our Heroku environment, it’s time to deploy our code. But how do we do it? Do we need to download the Heroku CLI and push the code manually every time? Not necessarily.

Instead, we can set up a Continuous Integration and Continuous Deployment (CI/CD) pipeline to automate the deployment process. CI/CD involves a series of steps taken to deliver a new version of software seamlessly.

Fortunately, most Git online providers offer their own pipeline systems for CI/CD. For this guide, we’ll use one of the most popular ones: GitHub Actions. With GitHub Actions, we can automate the deployment of our code to Heroku every time we push changes to our repository.

By setting up this CI/CD pipeline, we ensure that our deployment process is efficient, reliable, and consistent, allowing us to focus more on developing our application rather than worrying about manual deployment tasks.

Basic CI/CD pipeline workflow

Let’s begin building our CI/CD pipeline. In your repository, add a folder named .github. Within this folder, create another folder named workflows. Finally, create a file called main.yml inside the workflows folder, and write the following code:

name: Deploy

on:
push:
branches:
- master

jobs:
heroku-deploy:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v3
- name: Deploy to Heroku
uses: akhileshns/heroku-deploy@v3.12.12
with:
heroku_api_key: ${{ secrets.HEROKU_API_KEY }}
heroku_app_name: ${{ secrets.HEROKU_APP_NAME }}
heroku_email: marcobert37@gmail.com

The name field in our YAML file represents the name of our pipeline. This pipeline will be triggered every time we push changes to the selected branches (in our case, only the master branch). Next, we define some jobs. A job is executed every time the pipeline is triggered. We've named our job heroku-deploy, and it runs on the latest version of the Ubuntu operating system.

To deploy our code to Heroku, we’re using a pre-built pipeline (in this case, akhileshns/heroku-deploy@v3.12.12). We need to pass some environment variables to this pre-built pipeline: our Heroku API key and the name of our Heroku application, which we chose in the earlier steps.

To set up these environment variables in your GitHub repository, follow these steps:

Heroku API_KEY:

  • Go to the Heroku dashboard.
  • Click on your profile icon and navigate to the “Account settings”.
  • Select the “API Key” section and generate a new API key if you haven’t already.

GitHub Secrets:

  • Go to your GitHub repository.
  • Navigate to the “Settings” tab.
  • In the left-hand menu, select “Secrets” or “Secrets and variables” → “Actions”.
  • Click on “New repository secret”.
  • Set the name of the secret as HEROKU_API_KEY and paste your Heroku API key as the value.
  • Additionally, set another secret named HEROKU_APP_NAME and assign the name of your Heroku application as its value.

By setting up these secrets in your GitHub repository, the pipeline will be able to access them during execution and deploy your code to Heroku seamlessly.

Setted keys

After setting up the environment variables in our GitHub repository, Heroku needs to know how to start our server once the deployment is complete. To do this, we need to create a file at the root level of our project called Procfile. This file is a Heroku standard that specifies what command to run to start our server after deploying new code.

Inside the Procfile, add the following line of code:

web: gunicorn --preload --max-requests 500 --max-requests-jitter 5 -t 3 --worker-class gthread --timeout 120 index:app

Here’s a breakdown of each component:

  • web: This indicates the process type. In this case, it's specifying that this process type will handle incoming web requests.
  • gunicorn: This is the command to start the Gunicorn server, which serves as the WSGI server for our application.
  • index:app: With this part of the command, we're telling Gunicorn that our application is located in a file named index.py and that the variable app within that file represents our WSGI application.
  • --preload: This flag is crucial for memory usage optimization. It instructs Gunicorn to load the application code and dependencies into memory before starting the worker processes. This helps reduce memory usage by ensuring that all workers share the same memory space.
  • --max-requests=500: This parameter tells Gunicorn to automatically restart a worker process after it has served 500 requests. This is a preventive measure against memory leaks and other potential issues that may arise from long-running worker processes.
  • --worker-class=gthread: This specifies the class of worker processes to be used by Gunicorn. In this case, gthread is chosen, indicating that Gunicorn should use the threaded worker class. Other available options include sync, eventlet, and gevent. However, in many cases, the gthread option tends to be more stable and reliable, particularly in Python web applications.

Understanding these configuration options is essential for ensuring the smooth and efficient operation of our application when deployed with Gunicorn on Heroku.

The Graphic explanation of our workflow until now

Now that everything is set up, we’re ready to deploy our application by pushing our changes to the master branch. Let's see what happens when we do that:

The build fail

As always in the IT field, things don’t always go as expected. In this case, our pipeline failed. But why? Well, in the earlier chapters of the series, we opted to use a local embeddings module. While everything worked fine on our local machines without any limits, deploying to Heroku posed a challenge. Heroku imposes constraints on the size of the application, allowing only 500 MB, whereas our embeddings model alone consumed a hefty 2.8 GB!

So, do we need to switch to using some API for embeddings? Not necessarily. Our goal throughout this series has been to utilize open-source solutions and minimize costs wherever possible. Luckily, some brilliant individuals have developed a solution for this exact problem: FastEmbed.

But what exactly is FastEmbed?

It’s a lightweight library with minimal external dependencies. Unlike other solutions, FastEmbed doesn’t require a GPU and doesn’t entail downloading gigabytes of PyTorch dependencies. Instead, it leverages the power of ONNX Runtime. This makes FastEmbed an ideal choice for serverless runtimes or small servers like those on Heroku.

Now, you might be wondering, what is ONNX Runtime? ONNX Runtime inference provides faster customer experiences and lower costs by supporting models from various deep learning frameworks like PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, LightGBM, and XGBoost. ONNX Runtime is compatible with different hardware, drivers, and operating systems. It achieves optimal performance by utilizing hardware accelerators where available, alongside graph optimizations and transformations.

ONNX Runtime workflow

Great! With FastEmbed, we can deploy our local embeddings model selected in the previous steps while significantly reducing the amount of space it consumes. However, to make this work seamlessly with LlamaIndex, we need to make some changes to our code. Let’s navigate to the index_manager.py file and update the way we use the embeddings:

from llama_index.core import (
VectorStoreIndex,
StorageContext,
ServiceContext,
SimpleDirectoryReader
)

from llama_index.core.indices.loading import load_index_from_storage
from utils.vector_database import build_pinecone_vector_store, build_mongo_index
from mongodb.index import getExistingLlamaIndexes

from llama_index.llms.perplexity import Perplexity
from llama_index.embeddings.fastembed import FastEmbedEmbedding

import os

llm = Perplexity(
api_key=os.getenv("PERPLEXITY_API_KEY"), model="mixtral-8x7b-instruct", temperature=0.2
)

embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")

index_store = build_mongo_index()
vector_store = build_pinecone_vector_store()

service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
)

Indeed, the updated code is elegantly simple, thanks to LlamaIndex’s seamless integration with FastEmbed. This integration provides us with a straightforward interface to utilize the power of FastEmbed without any hassle.

Now, let’s put our changes to the test by redeploying our application. The best way to do this is by pushing the updated code to the master branch of our repository. This will trigger our CI/CD pipeline once again, allowing us to verify that our changes are successfully deployed to Heroku.

With just a simple push to the master branch, we can ensure that our application remains up-to-date and fully functional, ready to serve our users with the latest enhancements and optimizations.

Pipeline Work!

This time, our pipeline worked smoothly. Before pushing any changes, ensure that you have also updated your requirements.txt file with the newly installed libraries. Heroku relies on this file to download the required packages, and any discrepancies can lead to strange dependency errors.

But are we done yet? Not quite. We still need to take care of two last things. The first, and perhaps the easiest, is to monitor the logs. If everything is functioning correctly, you should see something like this:

logs of a correct boot

As you can observe, all the downloads occur before the workers start due to our utilization of the preload flag. This flag, employed to conserve space and memory, ensures that all necessary downloads and setups are completed before the worker processes begin their tasks.

Another crucial aspect to understand is the concept of workers and how to configure their numbers.

In the context of Gunicorn, a worker refers to a process responsible for handling incoming requests. The most basic and default type of worker is the synchronous worker. Synchronous workers operate by handling one request at a time, sequentially processing each request before moving on to the next.

Understanding the behavior and configuration of workers is essential for optimizing the performance and scalability of our application. By appropriately adjusting the number and type of workers, we can effectively manage resource utilization and ensure smooth handling of incoming traffic.

Worker graphic explanation in the detail

Before, we set the worker-class flag to gthread and configured the -t 3 flag, allowing each worker to handle up to 3 requests concurrently. Now, you might be wondering how to determine the number of workers needed. By default, Heroku dynamically adjusts the number of workers based on the memory allocation of the chosen instance type. However, you can override this default behavior using an environment variable called WEB_CONCURRENCY.

The formula to determine the number of requests that can be handled simultaneously is:

Number of workers * -t number.

For example, if you have 3 workers and set 3 threads per worker, you can handle 9 requests concurrently.

With our deployment now complete and many questions and errors resolved, we’re ready for the final step: deploying our frontend to enable a production-ready interface. You can find the code for this part on my GitHub repository here.

If you found this post helpful, please consider sharing it on social networks like LinkedIn, and don’t forget to leave a clap and a comment! Your engagement helps promote the post (yeah medium love the interactions with it’s alghoritms).

If you have any questions or feedback, feel free to reach out to me on Discord at sl33p1420. Special thanks to James Pham for alerting me about a bug. Your contributions and feedback are greatly appreciated!

--

--