Bonus Lesson 1: Handling Missing Data in Pandas
Objectives
Identify missing values in DataFrames using
isna()
and methods.Strategize the filling of missing values with
fillna()
and interpolation methods tailored to scenario needs.Demonstrate how to drop missing data with
dropna()
and discuss various strategies for handling missing data.
Instructor note
This lesson is not intended to cover in live session
Lesson material is available only for the course participants to follow on their own time
Missing data is ubiquitous in real-world datasets
Pandas represents missing values as
NaN
(Not a Number)Proper handling of missing values is critical for accurate analysis
We’ll focus on the three most essential operations: detecting, filling, and dropping missing data
Detecting Missing Values
Key Methods for Finding Missing Data
Method |
Returns |
Purpose |
---|---|---|
|
Boolean mask (True where missing) |
Identify missing values |
|
Boolean mask (True where not missing) |
Identify non-missing values |
|
Count of missing values per column |
Quantify missing data |
|
Whether any value is missing per column |
Quick presence check |
Detecting Missing Values:
Demo
Detect missing values with
isna()
consistent interface for detecting missing values with
isna()
andnotna()
Count missing values per column
Count total missing values
import pandas as pd
import numpy as np
# Create a DataFrame with different types of missing values
data = {
'A': [1, 2, np.nan, 4, 5, 8],
'B': [6, np.nan, 8, 9, np.nan, 15],
'C': [10, 11, 12, np.nan, 14, 20],
'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
'E': ['a', 'b', None, 'd', 'e', 'h'] # None also represents missing value
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)
# Detect missing values with isna()
print("\n1. Boolean mask of missing values:")
print(df.isna()) # True where values are missing
# Count missing values per column
print("\n2. Count of missing values per column:")
print(df.isna().sum(axis=0))
# Count total missing values
total_missing = df.isna().sum().sum()
print(f"\n3. Total missing values in DataFrame: {total_missing}")
# Filter rows with any missing value
rows_with_missing = df.isna().any(axis=1)
rows_with_any_missing = df.loc[rows_with_missing]
print("\n4. Rows with any missing value:")
print(rows_with_any_missing)
# Filter rows with no missing values
## Boolean masks generated by `isna()` or `notna()` can be used directly for filtering
row_without_missing = df.notna().all(axis=1)
complete_rows = df.loc[row_without_missing]
print("\n5. Rows with no missing values:")
print(complete_rows)
Output
1. Boolean mask of missing values:
A B C D E
0 False False False True False
1 False True False True False
2 True False False True True
3 False False True True False
4 False True False True False
5 False False False False False
2. Count of missing values per column:
A 1
B 2
C 1
D 5
E 1
dtype: int64
3. Total missing values in DataFrame: 10
4. Rows with any missing value:
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 NaN 11.0 NaN b
2 NaN 8.0 12.0 NaN None
3 4.0 9.0 NaN NaN d
4 5.0 NaN 14.0 NaN e
5. Rows with no missing values:
A B C D E
5 8.0 15.0 20.0 15.0 h
Filling Missing Values
Most Common Fill Strategies:
Method |
Purpose |
Key Parameters |
---|---|---|
|
Replace with specific value |
|
|
Forward fill (use previous value) |
|
|
Backward fill (use next value) |
Basic Fill Methods
Demo
Fill all missing values with a single value
fillna(value)
Fill with different values for each column
fillna(dict)
data = {
'A': [1, 2, np.nan, 4, 5, 8],
'B': [6, np.nan, 8, 9, np.nan, 15],
'C': [10, 11, 12, np.nan, 14, 20],
'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
'E': ['a', 'b', None, 'd', 'e', 'h'] # None also represents missing value
}
df = pd.DataFrame(data)
# Original DataFrame
print("Original DataFrame:")
print(df)
# 1. Fill all missing values with a single value
df_filled = df.fillna(0)
print("\n1. Fill all missing values with 0:")
print(df_filled)
# 2. Fill numeric columns with their means
df_mean = df.copy()
df_mean.fillna(df_mean.select_dtypes(include=['number']).mean(), inplace=True)
print("\n2. Fill numeric columns with their means:")
print(df_mean)
## The select_dtypes() method is specifically designed to filter columns based on their data types
## The parameter include=['number'] tells pandas to keep only columns whose data type falls under the "number" category,
## which encompasses both integers and floating-point numbers
## The fillna() method replaces all NaN/missing values in the dataframe
## When you pass a Series as its argument (which is what we're doing with the means),
## pandas does something clever: it matches column names between the dataframe and the Series,
## and only fills values in columns that exist in both.
## String columns remain untouched since no means will be calculated for them
## df_mean["E"] = df_mean["E"].ffill() # Fills string column
Output
Original DataFrame:
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 NaN 11.0 NaN b
2 NaN 8.0 12.0 NaN None
3 4.0 9.0 NaN NaN d
4 5.0 NaN 14.0 NaN e
5 8.0 15.0 20.0 15.0 h
1. Fill all missing values with 0:
A B C D E
0 1.0 6.0 10.0 0.0 a
1 2.0 0.0 11.0 0.0 b
2 0.0 8.0 12.0 0.0 0
3 4.0 9.0 0.0 0.0 d
4 5.0 0.0 14.0 0.0 e
5 8.0 15.0 20.0 15.0 h
2. Fill numeric columns with their means:
A B C D E
0 1.0 6.0 10.0 15.0 a
1 2.0 9.5 11.0 15.0 b
2 4.0 8.0 12.0 15.0 None
3 4.0 9.0 13.4 15.0 d
4 5.0 9.5 14.0 15.0 e
5 8.0 15.0 20.0 15.0 h
More info
Additional info
ffill()
and bfill()
:
Forward fill (propagate last valid observation forward)
Fill numeric columns with their means
# 2. Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': 0, 'E': 'missing'}
df_custom_fill = df.fillna(fill_values)
print("\n1. Fill with custom values per column:")
print(df_custom_fill)
# 3. Forward fill (propagate last valid observation forward)
df_ffill = df.ffill()
print("\n2. Forward fill (ffill):")
print(df_ffill)
Output
1. Fill with custom values per column:
A B C D E
0 1.0 6.0 10.0 0.0 a
1 2.0 -1.0 11.0 0.0 b
2 0.0 8.0 12.0 0.0 missing
3 4.0 9.0 999.0 0.0 d
4 5.0 -1.0 14.0 0.0 e
5 8.0 15.0 20.0 15.0 h
2. Forward fill (ffill):
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 6.0 11.0 NaN b
2 2.0 8.0 12.0 NaN b
3 4.0 9.0 12.0 NaN d
4 5.0 9.0 14.0 NaN e
5 8.0 15.0 20.0 15.0 h
Note
The choice of fill method should be based on the nature of your data and the analysis you’re conducting.
Forward fill and backward fill are especially useful for time series when you want to carry values forward or backward.
Statistical measures like mean, median, or mode are common choices for filling numeric data.
Remember that filling is modifying your data - it’s important to document your approach and consider its impact on analysis.
For categorical data, consider whether a missing value has meaning before deciding how to fill it.
Dropping Missing Data
Key Parameters for dropna():
Parameter |
Purpose |
Values |
---|---|---|
|
Specify rows or columns |
0 for rows, 1 for columns |
|
Condition for dropping |
‘any’ (default) or ‘all’ |
|
Minimum non-NaN values to keep |
integer |
|
Columns to consider |
list of column names |
Demo
Dropping Rows and Columns:
Drop rows with any missing values (any missing values)
Drop columns with any missing values
Drop columns where all values are missing
Keep rows with at least
n
non-NaN valuesDrop rows with missing values in specific columns
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [6, np.nan, 8, 9, np.nan],
'C': [10, 11, 12, np.nan, 14],
'D': [np.nan, np.nan, np.nan, np.nan, np.nan],
'E': ['a', 'b', None, 'd', 'e',] # None also represents missing value
}
df = pd.DataFrame(data)
# Original DataFrame
print("Original DataFrame:")
print(df)
# 1. Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n1. After dropping rows with any missing values:")
print(df_drop_rows)
# 2. Drop columns with any missing values
## In filtering/dropping, axis=0 means "operate on the 0-axis (rows)"
df_drop_cols = df.dropna(axis=1)
print("\n2. After dropping columns with any missing values:")
print(df_drop_cols)
## In aggregations: axis specifies the axis to collapse/operate along
# 3. Drop columns where all values are missing
df_drop_all_na_cols = df.dropna(axis=1, how='all')
print("\n3. After dropping columns where all values are missing:")
print(df_drop_all_na_cols)
# 4. Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\n4. Keep rows with at least 3 non-NaN values:")
print(df_thresh)
# 5. Drop rows with missing values in specific columns
df_subset = df.dropna(subset=['A', 'C'])
print("\n5. Drop rows with NaN in columns A or C:")
print(df_subset)
Output
Original DataFrame:
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 NaN 11.0 NaN b
2 NaN 8.0 12.0 NaN None
3 4.0 9.0 NaN NaN d
4 5.0 NaN 14.0 NaN e
2. After dropping columns with any missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5]
3. After dropping columns where all values are missing:
A B C E
0 1.0 6.0 10.0 a
1 2.0 NaN 11.0 b
2 NaN 8.0 12.0 None
3 4.0 9.0 NaN d
4 5.0 NaN 14.0 e
4. Keep rows with at least 3 non-NaN values:
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 NaN 11.0 NaN b
3 4.0 9.0 NaN NaN d
4 5.0 NaN 14.0 NaN e
5. Drop rows with NaN in columns A or C:
A B C D E
0 1.0 6.0 10.0 NaN a
1 2.0 NaN 11.0 NaN b
4 5.0 NaN 14.0 NaN e
More info
Additional info
Dropping is the simplest strategy, but it can lead to data loss - use cautiously.
The
how
parameter is critical - ‘any’ is much more restrictive than ‘all’.The
thresh
parameter lets you keep rows with a minimum amount of data.The
subset
parameter allows targeted dropping based on specific columns.Dropping is often appropriate when missing values are rare, or when you need a complete dataset for specific analyses.
Be aware of potential bias introduced by dropping - if missing data correlates with certain characteristics, dropping can skew results.
Additional reading - Python Data Science Handbook
Key Takeaways
Keypoints
Always detect first: Understand the pattern and extent of missing values before taking action
Choose the right strategy:
Fill when you can reasonably infer values or need to preserve data
Drop when missing data is minimal or would introduce bias if filled
Document your approach: Your handling choices affect analysis results