Introducing Pandas
This is the guide to get you started in the world of data science
Pandas is a library for data manipulation and analysis,
written in Python. This is a perfect library for starting your EDA,
because it allows you to read, manipulate, aggregate, and plot your data with
basic steps.
DataFrame
In a simple definition, a DataFrame is like an Excel sheet or table in SQL. It is composed of columns, rows, and an index. When we read some file data, it becomes a DataFrame.
Why is Pandas so popular?
- Easy manipulation and use of the library.
- The entrance door for the data science world.
- In my opinion, pandas is one of the best libraries for making EDA. ❤
Complementary Libraries
Pandas never comes alone:
Seaborn, Statistical data visualization.
NumPy, Library for math functions.
Matplotlib, Library for data visualization.
Scikit-Learn, We use for classification, clustering, and regression.
Before Starting Coding
You should set up an Anaconda environment to run this library. I
recommend one of these environments:
Installing Pandas
# jupyter cell
!pip install pandas# Terminal
pip install pandas
Importing
import pandas as pd
Reading Data Files
There are many options to read your data, normally starting pd.read_[file]
CSV
df = pd.read_csv('file_path.csv', sep='separator character')df = pd.read_csv('sales_202005.csv', sep=';')
Excel
df = pd.read_excel('file_path.xlsx', sheet_name='')df = pd.read_excel('sales_202005.xlsx', sheet_name='Jan')
Show Data
Head
df.head()
T (Transposition)
df.T
Dimensions
return the number of lines and columns.
df.shape
Information
df.info()
Descriptive Statistics
Returns central trend measurements.
df.describe()
Working With Columns
Add new columns
df['column_name'] = valuedf['month_nm'] = df['date'].dt.month_name()
Delete column
del df['column_name']
Filtering Data Frame
#OneCondition
df[ df['column_name' == 'XPTO' ]#MultipleCondition
df[ (condition 1) & (condition 2) ... ]#Exemple
df[ (df['date'] >= '2020-05-01') & (df['date'] <= '2020-05-31') ]
Pivot or Group By
Pivot
pd.pivot_table(df #DataFrame Name
, index = "day" #Lines
, columns = "month_nm" #Columns
, values = "price" #Values
, aggfunc = "mean" #Aggregation funtction
)
Group By
df.groupby(['month_nm', 'day']).agg(
{ 'price': pd.Series.mean
, 'order_id': pd.Series.count
}
).reset_index()
Visualization
We usually add graphics in a variable to use the
other properties as a title, y name, x name, legend
and colors.
BoxPlot
ax = df.boxplot(column=['price'])
Bar
ax = df.plot.bar(x='month', y='price', figsize=(16,5), rot=0)
Line
ax = df.plot.line(x='date', y='price', figsize=(16,5), marker='o', legend=['price'])ax.set_xlabel('Date')
ax.set_ylabel('Price')
ax.set_title('Day Over Day x Total Sales Price')
ax
Pie
ax = df.plot.pie(x='month_nm', y='price', figsize=(8,8))
Histogram
ax = df['price'].hist(figsize=(10,5))
Lmplot — Seborn
Show the trend line. We normally use this graph in linear regression.
#New DF
dfLR = pd.DataFrame(
df.groupby(['day', 'month_nm', 'month'])
.agg(
{'price': pd.Series.mean}
).reset_index()
)#Chart
ax = sns.lmplot(
data = dfLR # DataFrame Name
, x = "day" # Line
, y = "price" # Column
, hue = "month_nm" # Points break (colors)
, col = "month" # Charts break
)