Introducing Pandas

This is the guide to get you started in the world of data science

Lucas Ribeiro
Level Up Coding

--

Pandas is a library for data manipulation and analysis,
written in Python. This is a perfect library for starting your EDA,
because it allows you to read, manipulate, aggregate, and plot your data with
basic steps.

DataFrame Exemple

DataFrame

In a simple definition, a DataFrame is like an Excel sheet or table in SQL. It is composed of columns, rows, and an index. When we read some file data, it becomes a DataFrame.

Why is Pandas so popular?

  1. Easy manipulation and use of the library.
  2. The entrance door for the data science world.
  3. In my opinion, pandas is one of the best libraries for making EDA. ❤

Complementary Libraries

Pandas never comes alone:

Seaborn, Statistical data visualization.

NumPy, Library for math functions.

Matplotlib, Library for data visualization.

Scikit-Learn, We use for classification, clustering, and regression.

Before Starting Coding

You should set up an Anaconda environment to run this library. I
recommend one of these environments:

Installing Pandas

# jupyter cell
!pip install pandas
# Terminal
pip install pandas

Importing

import pandas as pd

Reading Data Files

There are many options to read your data, normally starting pd.read_[file]

CSV

df = pd.read_csv('file_path.csv', sep='separator character')df = pd.read_csv('sales_202005.csv', sep=';')

Excel

df = pd.read_excel('file_path.xlsx', sheet_name='')df = pd.read_excel('sales_202005.xlsx', sheet_name='Jan')

Show Data

Head

df.head()

T (Transposition)

df.T
Shows lines in columns

Dimensions

return the number of lines and columns.

df.shape
Line and column

Information

df.info()
Red, return number of lines not null. Yellow, data type column. Green, memory usage.

Descriptive Statistics

Returns central trend measurements.

df.describe()

Working With Columns

Add new columns

df['column_name'] = valuedf['month_nm'] = df['date'].dt.month_name()

Delete column

del df['column_name']

Filtering Data Frame

#OneCondition
df[ df['column_name' == 'XPTO' ]
#MultipleCondition
df[ (condition 1) & (condition 2) ... ]
#Exemple
df[ (df['date'] >= '2020-05-01') & (df['date'] <= '2020-05-31') ]

Pivot or Group By

Pivot

pd.pivot_table(df      #DataFrame Name
, index = "day" #Lines
, columns = "month_nm" #Columns
, values = "price" #Values
, aggfunc = "mean" #Aggregation funtction
)

Group By

df.groupby(['month_nm', 'day']).agg(
{ 'price': pd.Series.mean
, 'order_id': pd.Series.count
}
).reset_index()

Visualization

We usually add graphics in a variable to use the
other properties as a title, y name, x name, legend
and colors.

For more details

BoxPlot

ax = df.boxplot(column=['price'])

Bar

ax = df.plot.bar(x='month', y='price', figsize=(16,5), rot=0)

Line

ax = df.plot.line(x='date', y='price', figsize=(16,5), marker='o', legend=['price'])ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.set_title('Day Over Day x Total Sales Price')
ax

Pie

ax = df.plot.pie(x='month_nm', y='price', figsize=(8,8))

Histogram

ax = df['price'].hist(figsize=(10,5))

Lmplot — Seborn

Show the trend line. We normally use this graph in linear regression.

#New DF
dfLR = pd.DataFrame(
df.groupby(['day', 'month_nm', 'month'])
.agg(
{'price': pd.Series.mean}
).reset_index()
)
#Chart
ax = sns.lmplot(
data = dfLR # DataFrame Name
, x = "day" # Line
, y = "price" # Column
, hue = "month_nm" # Points break (colors)
, col = "month" # Charts break
)
Average Price, Days of the Month X Month in Linear Regression Line.

--

--