PandasAI: Unlocking the Power of Data with Generative AI

Meet PandasAI and speak to your data

Tirendaz AI
Level Up Coding

--

Image by Freepik

As you know, one of the most time-consuming stages in the data science project is data preprocessing. When it comes to data preprocessing and data cleaning, Pandas is an awesome Python library. So, do you want to go a step further and add generative AI capabilities to this library? If yes, you should definitely meet PandasAI.

PandasAI is a Python tool that powers the capabilities of Pandas by leveraging generative AI models. It not only allows you to easily deal with large datasets, but also helps you perform complex data manipulations. That is, you can find hidden patterns, detect outliers and handle missing values using this tool.

At this point, you may ask how PandasAI works under the hood. This is no secret that it takes advantage of large language models (LLMs). I have good news for you that PandasAI works with OpenAI models, HuggingFace models, Google PaLM, Google VertexAI, and Azure OpenAI. It also built-in support for LangChain models. Note that PandasAI does not replace it; rather, it is an AI tool that unleashes the power of Pandas.

Today, we’ll explore PandasAI and cover how to perform data analysis with it. Here are the topics we’ll handle in this blog:

  • Installing PandasAI
  • Setting up Pandas AI
  • Exploring data with Pandas AI
  • Data visualization with PandasAI

If you don’t feel like reading, you can watch my video below.

Getting Started with PandasAI

Before getting started with PandasAI, you need to install PandasAI. This is very easy to do using pip as shown in the following code:

pip install pandasai

Ok, we installed the tool. Let’s move on to importing the necessary libraries.

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

Next, we need to set the OPENAI_API_KEY to work with the OpenAI API. To do this, we’re going to use the os module.

import os
openai_api_key = os.environ["OPENAI_API_KEY"]

After that, we’re going to instantiate the OpenAI class with the API key:

llm = OpenAI(api_token=openai_api_key)

Finally, we’re going to create a PandasAI object using this llm.

pandas_ai = PandasAI(llm)

In this tutorial, we’re going to use the OpenAI GPT-3.5 model. You can also utilize the LLM API wrapper for Google PALM or open-source models available on HuggingFace, such as Starcoder and Falcon. Let me show you this:

from pandasai.llm.starcoder import Starcoder
from pandasai.llm.falcon import Falcon
from pandasai.llm.google_palm import GooglePalm

# GooglePalm
llm = GooglePalm(api_token=”YOUR_GOOGLE_API_KEY”)

# Starcoder
llm = Starcoder(api_token=”YOUR_HF_API_KEY”)

# Falcon
llm = Falcon(api_token=”YOUR_HF_API_KEY”)

So far, we’ve imported the necessary libraries and instantiated a PandasAI object. Let’s move on to how to perform data manipulation with PandasAI.

Exploring Data with PandasAI

Image by Writer with Canva

In this section, we’re going to load a dataset and explore it using PandasAI. The dataset we’re going to use is the data science salaries. Let’s load this dataset with Pandas.

df = pd.read_csv("ds_salaries.csv")
df.head()

This dataset shows salaries for different data science fields in data science and contains 11 columns.

Awesome, we’ve loaded the dataset. What we’re going to do to analyze the dataset is pass a DataFrame and a prompt to PandasAI. To show this, let’s prompt Pandas AI to list records of the first five job titles by salary.

response=pandas_ai.run(df, prompt='List the first 5 job titles by salary in usd')
print(response)

# Output:
['Research Scientist', 'Data Analyst', 'AI Scientist', 'Applied Machine Learning Scientist', 'Principal Data Scientist']

As you can see, the first position with the highest salary is research scientist, followed by data analyst and AI scientist.

Nice, you’ve seen how to select rows by talking to our dataset. What a time to be alive! You can do this with Pandas, but it’s easy to do with a prompt, right? Let’s go ahead and take a look at the average salary by job titles using PandasAI.

response=pandas_ai.run(df, prompt='What is the average salary in usd by job titles? Make sure the output is sorted in descending order.')
print(response)

# Output
"""
job_title
Data Science Tech Lead 375000.000
Cloud Data Architect 250000.000
Data Lead 212500.000
Data Analytics Lead 211254.500
Principal Data Scientist 198171.125
...
Autonomous Vehicle Technician 26277.500
3D Computer Vision Researcher 21352.250
Staff Data Analyst 15000.000
Product Data Scientist 8000.000
Power BI Developer 5409.000
Name: salary_in_usd, Length: 93, dtype: float64
"""

Voilà! You can see the average salaries by job titles here. Speaking to our data, it’s easy to see average salaries, isn’t it?

Data Visualization with PandasAI

We all often use the idiom that a picture is worth a thousand words. Yeah, this is absolutely true. In data analysis, one of the easiest ways to understand data is data visualization. Let’s examine top 10 job titles.

response=pandas_ai.run(df, prompt='Plot a bar chart showing top 10 job titles, using different colors for each bar')
print(response)

As you can see from the chart, the most job title is data engineering, followed by data scientist. Notice that, it is not very difficult to plot a bar chart with a simple prompt. This is the power of the generative AI technology that is currently becoming more and more popular.

The dataset includes four levels of experience in the job: Entry-level/Junior (EN), Mid-level/Intermediate (MI), Senior-level/Expert (SE), and Executive-level/Director (EX). Let’s take a look at the average salary by experience level.

response=pandas_ai.run(df, prompt='Plot a bar chart showing average salary in usd by experience level')
print(response)

As we expected, the position with the highest average salary is executive level (director), followed by senior.

Finally, let’s handle how to plot a pie chart with PandasAI. To show this, let’s use the experience level column again and examine the distribution of this column.

response=pandas_ai.run(df, prompt='Plot a pie chart showing the experience level distribution')
print(response)

As you can see, the senior-level positions have the highest count, followed by mid-level and junior positions. There are fewer director-level positions compared to other levels.

Wrap-Up

LLMs are a game changer in AI. PandasAI is an awesome tool to gain insights from datasets by leveraging the power of LLMs. Trust me, this AI tool is a new revolution in data analysis. It helps you easily handle your complex tasks in data exploration and data manipulation by talking to your data. This library also enables you to overcome challenges in big data with generative AI capabilities.

Note that Pandas AI was developed for use with pandas and does not replace it. In addition, it still has limitations and cannot fully replace humans, like all AI tools. So, the accuracy of the analyzed results needs to be checked by humans.

In this blog, we covered how to install, set up and work with PandasAI. We also saw how to plot charts to explore data. That’s it. Thanks for reading. You can find the link to this notebook here. Let’s connect YouTube | Twitter | LinkedIn.

Resources

--

--