SQL v. Pandas: JOIN

shotin
Level Up Coding
Published in
5 min readSep 27, 2020

--

I’ve worked with SQL for several years before I started using Pandas in Python. I eventually got used to it but I struggled with the syntax differences between them when I started.

As you learn pandas, you eventually wonder how you can implement join in pandas like in SQL. I’m going to write down about it in this article.

Before starting, I gathered basic syntax and comparison in this article. Please see also:

Dataset

import pandas as pddf_team = [
[1, "Arizona Diamondbacks"]
, [2, "Colorado Rockies"]
, [3, "Los Angeles Dodgers"]
, [4, "San Diego Padres"]
, [5, "San Francisco Giants"]
]
team_columns = ["Id", "Name"]
teams = pd.DataFrame(data=df_team, columns=team_columns)
teams
df_player = [
[1, "Cody Bellinger", 3, "OF"]
, [2, "Mookie Betts", 3, "OF"]
, [3, "Nolan Arenado", 2, "3B"]
, [4, "Trevor Story", 2, "SS"]
, [5, "Fernando Tatis Jr.", 4, "SS"]
, [6, "Buster Posey", 5, "C"]
, [7, "Starling Marte", 1, "OF"]
, [8, "Manny Machado", 4, "3B"]
]
player_columns = ["Id", "Name", "TeamId", "Position"]
players = pd.DataFrame(data=df_player, columns=player_columns)
players
#To make example easier, only 3 players
df_homerun = [
[2, 16]
, [5, 15]
, [8, 16]
]
homerun_columns = ["PlayerId", "Homerun"]
homeruns = pd.DataFrame(data=df_homerun, columns=homerun_columns)
homeruns

What is JOIN?

JOIN is SQL syntax which merges 2+ datasets (tables) into a single dataset. For example, as above, how do you define the players who belong to the Dodgers?

Firstly, you search in teams dataset and find Dodgers of Id is 3. And then, you search in players dataset with TeamId: 3.

This is not hard, but how about the home runs of the players on the Dodgers? You find all of the players in dodgers and search in the homeruns dataset with all of PlayerIds.

JOIN syntax makes this process skipped. You can display team name, player name and home run all at once.

JOIN has broadly two patterns: INNER JOIN and OUTER JOIN. Let’s see how to write each JOIN pattern in Pandas.

INNER JOIN

INNER JOIN is

  • Merge 2 datasets by the column you specified.
  • Remain some rows that have a column value in both tables.

In SQL, we can write as follow:

SELECT
p.Id
,p.Name
,h.Homerun
FROM players p
INNER JOIN homeruns h
ON p.Id = h.PlayerId

OR

SELECT
p.Id
,p.Name
,h.Homerun
FROM players p, homeruns h
where p.Id = h.PlayerId

In Pandas:

player_homerun = pd.merge(players, homeruns, left_on="Id", right_on="PlayerId", how="inner")
player_homerun[[“Id”, “Name”, “Homerun”]]

If tables have several keys, you set the “on” parameter as list.

player_homerun = pd.merge(players, homeruns, left_on=["Id", "xx", ...], right_on=["PlayerId", "xx", ...], how="inner")

But how can we query the same name columns in different datasets such as “Name” of player and “Name” of the team at the same time?

You can use the suffix for each column.

In Pandas:

player_team = pd.merge(players, teams, left_on="TeamId", right_on="Id", how="inner", suffixes=["_player", "_team"])
player_team[[“Id_player”, “Name_player”, “Name_team”]]

Of course you can rename column later.

player_team = player_team.rename(
columns={
"Id_player": "PlayerId"
,"Name_player": "PlayerName"
,"Name_team": "TeamName"
}
)
player_team[[“PlayerId”, “PlayerName”, “TeamName”]]

LEFT / RIGHT OUTER JOIN

LEFT (RIGHT) JOIN is one of OUTER JOIN. What is LEFT (RIGHT) JOIN?

In the earlier example, we can only query the players who have records in the homeruns dataset. But how can we remain all of the players in players dataset and also query home run value? That’s where the LEFT JOIN comes in.

You can choose the left or right table, and even if some records only exist in the table you choose, these records appear, unlike withINNER JOIN.

In SQL:

SELECT
p.Id
,p.Name
,h.Homerun
FROM players p
LEFT JOIN homeruns h
ON p.Id = h.PlayerId

In Pandas, set the join type as “how” parameter:

player_homerun = pd.merge(players, homeruns, left_on="Id", right_on="PlayerId", how="left")
player_homerun[[“Id”, “Name”, “Homerun”]]

If there is no record correspond to left table, value is NULL in SQL and NaN in Pandas.

FULL OUTER JOIN

By the way, how can we query all of the records in both tables? For example, the homeruns dataset is like this:

homeruns

The player whose id is 9 doesn’t exist in the players dataset. If you want to query all the players and all of the homeruns, how can you do this? Use FULL JOIN.

In SQL:

SELECT
p.Id
,p.Name
,h.Homerun
FROM players p
FULL JOIN homeruns h
ON p.Id = h.PlayerId

In Pandas:

player_homerun = pd.merge(players, homeruns, left_on="Id", right_on="PlayerId", how="outer")
player_homerun[[“Id”, “PlayerId”, “Name”, “Homerun”]]

You can find Id is NaN for PlayerId:9.

That’s all for basic syntax comparison about JOIN between SQL and Pandas. Thank you for reading.

--

--