Introducing Pandas

This is the guide to get you started in the world of data science

Published in

Level Up Coding

4 min readMay 29, 2020

Pandas is a library for data manipulation and analysis,
written in Python. This is a perfect library for starting your EDA,
because it allows you to read, manipulate, aggregate, and plot your data with
basic steps.

DataFrame

In a simple definition, a DataFrame is like an Excel sheet or table in SQL. It is composed of columns, rows, and an index. When we read some file data, it becomes a DataFrame.

Why is Pandas so popular?

Easy manipulation and use of the library.
The entrance door for the data science world.
In my opinion, pandas is one of the best libraries for making EDA. ❤

Complementary Libraries

Pandas never comes alone:

Seaborn, Statistical data visualization.

NumPy, Library for math functions.

Matplotlib, Library for data visualization.

Scikit-Learn, We use for classification, clustering, and regression.

Before Starting Coding

You should set up an Anaconda environment to run this library. I
recommend one of these environments:

Individual Edition | Anaconda

🐍 Open Source Anaconda Individual Edition is the world's most popular Python distribution platform with over 20…

www.anaconda.com

Using Python Environments in Visual Studio Code

An "environment" in Python is the context in which a Python program runs. An environment consists of an interpreter and…

code.visualstudio.com

Project Jupyter

The Jupyter Notebook is a web-based interactive computing platform. The notebook combines live code, equations…

jupyter.org

Installing Pandas

# jupyter cell
!pip install pandas# Terminal
pip install pandas

Importing

import pandas as pd

Reading Data Files

There are many options to read your data, normally starting pd.read_[file]

CSV

df = pd.read_csv('file_path.csv', sep='separator character')df = pd.read_csv('sales_202005.csv', sep=';')

Excel

df = pd.read_excel('file_path.xlsx', sheet_name='')df = pd.read_excel('sales_202005.xlsx', sheet_name='Jan')

Show Data

Head

df.head()

T (Transposition)

df.T

Dimensions

return the number of lines and columns.

df.shape

Information

df.info()

**Red**, return number of lines not null. **Yellow**, data type column. **Green**, memory usage.

Descriptive Statistics

Returns central trend measurements.

df.describe()

Working With Columns

Add new columns

df['column_name'] = valuedf['month_nm'] = df['date'].dt.month_name()

Delete column

del df['column_name']

Filtering Data Frame

#OneCondition
df[ df['column_name' == 'XPTO' ]#MultipleCondition
df[ (condition 1) & (condition 2) ...  ]#Exemple
df[ (df['date'] >= '2020-05-01') & (df['date'] <= '2020-05-31') ]

Pivot or Group By

Pivot

pd.pivot_table(df      #DataFrame Name
, index   = "day"      #Lines
, columns = "month_nm" #Columns
, values  = "price"    #Values
, aggfunc = "mean"     #Aggregation funtction
)

Group By

df.groupby(['month_nm', 'day']).agg(
{  'price':   pd.Series.mean
, 'order_id': pd.Series.count
}
).reset_index()

Visualization

We usually add graphics in a variable to use the
other properties as a title, y name, x name, legend
and colors.
For more details

BoxPlot

ax = df.boxplot(column=['price'])

Bar

ax = df.plot.bar(x='month', y='price', figsize=(16,5), rot=0)

Line

ax = df.plot.line(x='date', y='price', figsize=(16,5), marker='o', legend=['price'])ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.set_title('Day Over Day x Total Sales Price')
ax

Pie

ax = df.plot.pie(x='month_nm', y='price', figsize=(8,8))

Histogram

ax = df['price'].hist(figsize=(10,5))

Lmplot — Seborn

Show the trend line. We normally use this graph in linear regression.

#New DF
dfLR = pd.DataFrame(
 df.groupby(['day', 'month_nm', 'month'])
  .agg(
   {'price': pd.Series.mean}
   ).reset_index()
)#Chart
ax = sns.lmplot(
data  = dfLR       # DataFrame Name
, x   = "day"      # Line
, y   = "price"    # Column
, hue = "month_nm" # Points break (colors)
, col = "month"    # Charts break
)

Average Price, Days of the Month X Month in Linear Regression Line.

Get this code on Github

Keep learning Pandas

10 minutes to pandas - pandas 1.0.3 documentation

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the …

pandas.pydata.org

Cookbook - pandas 1.0.3 documentation

This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to add to…