Data Visualization
Data Visualization with the plot method in Pandas
Area plot, scatter plot, hexagonal bin plot, pie plot, density plot, scatter matrix with the plot method.
Data visualization is one of the most enjoyable stages of data analysis. Pandas is one of the most used Python libraries for data preprocessing and data cleaning. Libraries such as Matplotlib and Seaborn are often used to visualize data. But, you can easily visualize Series and DataFrame with Pandas.
In my last article, I showed how to use the plot
method and talked about the bar, histogram and box plots with this method. In this post, I’ll cover the following topics:
- Area plot
- Scatter plot
- Hexagonal bin plot
- Pie plot
- Density plot
- Scatter matrix plot
Let’s dive in!
Area Plots
Area plots are drawn by filling in the space below the completed line. Note that for area plots, each column must be either positive or negative. To show area plots let’s import necessary libraries.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Let me set the seaborn-white style as the graphic style.
plt.style.use("fivethirtyeight")
You can draw area plots with the plot.area
method. To show this method, let me create a DataFrame.
df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))
df.head()
Note that you can find the notebook and dataset here. Let’s draw an area plot for only one variable in this dataset.
df['A'].plot.area()
Let’s draw the area plots of all columns.
df.plot.area()
Area plots are stacked by default. To draw an unstacked plot, you can use the stacked = False
parameter.
df.plot.area(stacked=False)
Note that if there is missing data, this value is automatically set to zero. In addition, you can use the fillna
method to remove missing data.
Let’s move on and use real datasets named iris and movies. You can download these datasets here. First, I’m going to load the famous iris dataset with the read_csv
method.
iris=pd.read_csv("iris.data", header=None)
There are no column names in the dataset. Let’s name the columns of the data set with the columns
method.
iris.columns=["sepal_length","sepal_width", "petal_length",
"petal_width", "species"]
Let’s see the types of columns of the dataset with the dtypes
attribute.
iris.dtypes
The first four columns of the iris dataset are numeric and the last column is categorical. Now let’s draw the area plot of the numerical data with the plot.area
method.
iris.plot.area()
Let’s plot the unstacked plot of the variables with the stacked=False
parameter.
iris.plot.area(stacked=False)
Scatter Plots
Scatter plot is used to see the relationship between two numerical variables. The plot.scatter
method is used to draw a scatter plot. Let’s draw a scatterplot between variables A and B in the df dataset with this method.
df.plot.scatter(x='A', y='B')
Let’s now use the IMDb dataset to show the scatter plots. First of all, let’s load this dataset with the read_csv
method.
movies=pd.read_csv("imdbratings.txt")
Let’s see the first rows of this dataset with the head
method.
movies.head()
Let’s see the types of columns in the dataset.
movies.dtypes
Notice that the variables star_rating
and duration
are numeric. Let’s plot the scatter plots of these two variables with the plot.scatter
method.
movies.plot.scatter(x='star_rating', y='duration')
You can draw the scatter plot of two pairs of variables in a plot using the plot
method twice. Let’s see the scatter plots of sepal_length
and sepal_width
and petal_length
and petal_width
variables in the iris dataset on the same plot. To do this, let’s first create a variable named ax
and draw your scatter plot with this variable ax
.
ax=iris.plot.scatter(x='sepal_length', y='sepal_width',
color='Blue', label='sepal')
iris.plot.scatter(x='petal_length', y='petal_width', color='red',
label='petal', ax=ax)
If you want to set the color of each point while comparing two variables, you can write the parameter c
as follows:
iris.plot.scatter(x='sepal_length', y='sepal_width',
c='petal_length', s=100)
You can adjust the size of each of the points on the plot with the s
parameter.
iris.plot.scatter(x='sepal_length', y='sepal_width',
s=iris['petal_length'] * 50)
Hexagonal Bin Plots
If the number of observations in your data is high, you can use a hexagonal plot instead of a scatter plot with the plot.hexbin
method. Let’s draw the hexagonal bin plot of the star_rating
and duration
variables in the movies dataset.
movies.plot.hexbin(x="star_rating", y="duration", gridsize=25)
To determine the number of hexagons on the x-axis, you can use the gridsize
parameter. This value is 100 by default. Let’s set the gridsize
to 10.
movies.plot.hexbin(x="star_rating", y="duration", gridsize=10)
Keep in mind that since we set the as gridsize=10
the hexagons get bigger.
Pie Plots
A pie plot is a circular statistical plot that can show only one series of data. You can use the plot.pie
method for the pie plot of Series and DataFrame. Let’s use the iris dataset to show this plot. First, I’m going to select the petal_width
variable and group the dataset by the variable species
.
iris_avg=iris["petal_width"].groupby(iris["species"]).mean()
iris_avg
Now let’s plot a pie plot with the plot.pie
method.
iris_avg.plot.pie()
Now let’s draw the pie plot of the two numerical variables of the iris dataset that we grouped according to the variable species. First, let’s create the variables named iris_avg_2
.
iris_avg_2=iris[["petal_width",
"petal_length"]].groupby(iris["species"]).mean()
Now let’s draw a pie plot separately for each column of this dataset. For the pie plot of the DataFrame data, either the specific a y
value is entered or the subplots = True
parameter is used.
iris_avg_2.plot.pie(subplots=True)
You can also set other properties such as labels in pie plots. Let’s handle the iris_avg
data for instance and draw a pie plot of this data with the default values.
iris_avg.plot.pie()
Now let’s set the properties.
iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],
colors=list("brg"), fontsize=25, figsize=(10,10))
To see the percentage of pie slices, you would use the autopct='%.2f'
parameter.
iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],
colors=list("brg"),
autopct='%.2f',
fontsize=25,
figsize=(10,10))
Density Plot
Density plots allow to visualize the distribution of a numeric variable for one or several groups. You can draw a density plot the plot.kde
method. This method can be used for both Series and DataFrame. Let’s draw density plots of numerical variables in iris dataset.
iris.plot.kde()
Scatter Matrix
In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix. You can draw a scatter matrix with the scatter_matrix
method. Let’s first import this method from pandas.plotting
.
from pandas.plotting import scatter_matrix
Now let’s see the scatter matrix of the numeric columns in the movies dataset.
scatter_matrix(movies, alpha=0.5, diagonal='kde')
Conclusion
You can use the plot
method in Pandas for data visualization. This method allows you to draw the plots more easily. In this post, I talked about area plot, scatter plot, hexagonal bin plot, pie plot, density plot, scatter matrix with this method. That’s it. I hope you enjoy it. Thank you for reading. You can find this notebook here. Don’t forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn
If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇