Bonus Lesson 2: Merging DataFrames in Pandas
Objectives
Illustrate how to concatenate DataFrames using
pd.concat()
and understand its applications.Explain database-style joins with the
merge()
function and illustrate the different join types (inner, outer, left, * right).
Instructor note
This lesson is not intended to cover in live session
Lesson material is available only for the course participants to follow on their own time
In real-world data analysis, you rarely have all the data you need in a single file or table.
Combining data from multiple sources is a fundamental data wrangling skill.
Pandas gives us two main approaches: stacking data (concatenation) and joining data on common keys (merging)
Concatenation: Stacking similar datasets (vertically or horizontally)
Merging/Joining: Combining related datasets based on common keys
Concatenation with pd.concat()
Parameter |
Purpose |
Common Values |
---|---|---|
|
List of DataFrames |
|
|
Direction |
|
Demo
Vertical concatenation (stacking)
Horizontal concatenation (side by side)
import pandas as pd
# Create sample DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
}, index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
}, index=['K3', 'K4', 'K5'])
df3 = pd.DataFrame({
'C': ['C6', 'C7', 'C8'],
'D': ['D6', 'D7', 'D8']
}, index=['K6', 'K7', 'K8'])
# Vertical concatenation (stacking)
result1 = pd.concat([df1, df2])
print("1. Vertical concatenation (stacking)")
print(result1)
# Horizontal concatenation (side by side)
result2 = pd.concat([df1, df3], axis=1)
print("2. Horizontal concatenation (side by side)")
print(result2)
Output
1. Vertical concatenation (stacking)
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 B3
K4 A4 B4
K5 A5 B5
2. Horizontal concatenation (side by side)
A B C D
K0 A0 B0 NaN NaN
K1 A1 B1 NaN NaN
K2 A2 B2 NaN NaN
K6 NaN NaN C6 D6
K7 NaN NaN C7 D7
K8 NaN NaN C8 D8
Note
Concatenation is like stacking datasets either vertically (axis=0) or horizontally (axis=1).
Vertical concatenation is useful when you have data from different dataframes in the same format.
Horizontal concatenation is useful when you have different features for the same observations
Use
ignore_index=True
when original indices aren’t meaningful after combininge.g.,
pd.concat([df1, df2], ignore_index=True)
More info
Additional info
Database-Style Joins with merge():
Key Parameters for pd.merge():
Parameter |
Purpose |
Common Values |
---|---|---|
|
DataFrames to merge |
DataFrame objects |
|
Type of join |
|
|
Column(s) to join on |
Column name or list of names |
|
Column(s) when names differ |
Column name(s) for each DataFrame |
|
For duplicate columns |
Tuple of strings, default |
Basic Merge Examples
Demo
Inner join (default) - only matching rows (intersect)
Left join - keep all rows from left DataFrame
Right join - keep all rows from right DataFrame
Outer join - keep all rows from both DataFrames (union)
# Create sample DataFrames for merging
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'department_id': [101, 102, 101, 103, 102]
})
print("Original dataframe")
print(employees)
departments = pd.DataFrame({
'department_id': [101, 102, 103, 104],
'department_name': ['HR', 'Engineering', 'Finance', 'Marketing'],
'location': ['New York', 'San Francisco', 'Chicago', 'Boston']
})
print("Original dataframe")
print(departments)
# Inner join (default) - only matching rows
inner_merge = pd.merge(employees, departments, on='department_id')
print("1. Inner join (default) - only matching rows")
print(inner_merge)
# Left join - keep all rows from left DataFrame
left_merge = pd.merge(employees, departments, on='department_id', how='left')
print("2. Left join - keep all rows from left DataFrame")
print(left_merge)
# Right join - keep all rows from right DataFrame
right_merge = pd.merge(employees, departments, on='department_id', how='right')
print("3. Right join - keep all rows from right DataFrame")
print(right_merge)
# Outer join - keep all rows from both DataFrames
outer_merge = pd.merge(employees, departments, on='department_id', how='outer')
print("4. Outer join - keep all rows from both DataFrames")
print(outer_merge)
Output
Original dataframe
employee_id name department_id
0 1 Alice 101
1 2 Bob 102
2 3 Charlie 101
3 4 David 103
4 5 Eva 102
Original dataframe
department_id department_name location
0 101 HR New York
1 102 Engineering San Francisco
2 103 Finance Chicago
3 104 Marketing Boston
1. Inner join (default) - only matching rows
employee_id name department_id department_name location
0 1 Alice 101 HR New York
1 3 Charlie 101 HR New York
2 2 Bob 102 Engineering San Francisco
3 5 Eva 102 Engineering San Francisco
4 4 David 103 Finance Chicago
2. Left join - keep all rows from left DataFrame
employee_id name department_id department_name location
0 1 Alice 101 HR New York
1 2 Bob 102 Engineering San Francisco
2 3 Charlie 101 HR New York
3 4 David 103 Finance Chicago
4 5 Eva 102 Engineering San Francisco
3. Right join - keep all rows from right DataFrame
employee_id name department_id department_name location
0 1.0 Alice 101 HR New York
1 3.0 Charlie 101 HR New York
2 2.0 Bob 102 Engineering San Francisco
3 5.0 Eva 102 Engineering San Francisco
4 4.0 David 103 Finance Chicago
5 NaN NaN 104 Marketing Boston
4. Outer join - keep all rows from both DataFrames
employee_id name department_id department_name location
0 1.0 Alice 101 HR New York
1 3.0 Charlie 101 HR New York
2 2.0 Bob 102 Engineering San Francisco
3 5.0 Eva 102 Engineering San Francisco
4 4.0 David 103 Finance Chicago
5 NaN NaN 104 Marketing Boston
Notes:
The
merge()
function is modeled after SQL joins and is used to combine DataFrames based on common columns or indices.The ‘how’ parameter determines which rows to keep: ‘inner’ (matching rows only), ‘left’ (all left rows), ‘right’ (all right rows), or ‘outer’ (all rows).
You can merge on multiple columns by passing lists to ‘on’, ‘left_on’, or ‘right_on’.
When merging on indices instead of columns, use ‘left_index=True’ and/or ‘right_index=True’.
Merging is a powerful way to enrich your dataset with additional information from other sources.
More info
Additional info
Demo
Handling Duplicate Columns:
When merging DataFrames with overlapping column names, you need to handle duplicates:
Default behavior adds
_x
and_y
suffixes orAdd custom suffixes
# DataFrames with overlapping column names
df_a = pd.DataFrame({
'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df_b = pd.DataFrame({
'key': ['K0', 'K2', 'K3'],
'B': ['B3', 'B4', 'B5'],
'C': ['C3', 'C4', 'C5']
})
# Default behavior adds _x and _y suffixes
merged_default = pd.merge(df_a, df_b, on='key')
print("1. Default behavior adds _x and _y suffixes")
print(merged_default)
# Custom suffixes
merged_custom = pd.merge(df_a, df_b, on='key', suffixes=('_left', '_right'))
print("2. Custom suffixes")
print(merged_custom)
Output
1. Default behavior adds _x and _y suffixes
key A B_x B_y C
0 K0 A0 B0 B3 C3
1 K2 A2 B2 B4 C4
2. Custom suffixes
key A B_left B_right C
0 K0 A0 B0 B3 C3
1 K2 A2 B2 B4 C4
Notes:
When merging DataFrames with duplicate column names, Pandas adds suffixes to make them unique.
You can customize these suffixes to make the result more readable.
Another approach is to select only the columns you need, either before or after merging
Duplicate indices can cause unexpected behavior, especially with joins.
Be careful with duplicate values in join keys - each duplicate in one DataFrame will join with all matches in the other DataFrame, potentially creating many rows.
Always inspect the result after merging, especially checking the row count to ensure it matches expectations.
Key Takeaways:
Join Types:
inner
: Keep only matching data (default)outer
: Keep everythingleft
/right
: Keep all rows from one DataFrame
Best Practices:
Clean join keys before merging
Handle duplicate columns with suffixes or column selection
Always verify the shape of the result
Key Takeaways
Keypoints
Concatenation vs. Merging:
Use concatenation for stacking similar data
Use merging for combining related data on keys