Bonus Lesson 1: Handling Missing Data in Pandas

Objectives

Identify missing values in DataFrames using isna() and methods.
Strategize the filling of missing values with fillna() and interpolation methods tailored to scenario needs.
Demonstrate how to drop missing data with dropna() and discuss various strategies for handling missing data.

Instructor note

This lesson is not intended to cover in live session
Lesson material is available only for the course participants to follow on their own time

Missing data is ubiquitous in real-world datasets
Pandas represents missing values as NaN (Not a Number)
Proper handling of missing values is critical for accurate analysis
We’ll focus on the three most essential operations: detecting, filling, and dropping missing data

Detecting Missing Values

Key Methods for Finding Missing Data

Method	Returns	Purpose
`isna()` & `isnull()`	Boolean mask (True where missing)	Identify missing values
`notna()` & `notnull()`	Boolean mask (True where not missing)	Identify non-missing values
`isna().sum()`	Count of missing values per column	Quantify missing data
`isna().any()`	Whether any value is missing per column	Quick presence check

Detecting Missing Values:

Demo

Detect missing values with isna()
- consistent interface for detecting missing values with isna() and notna()
Count missing values per column
Count total missing values

import pandas as pd
import numpy as np

# Create a DataFrame with different types of missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)

# Detect missing values with isna()
print("\n1. Boolean mask of missing values:")
print(df.isna())  # True where values are missing

# Count missing values per column
print("\n2. Count of missing values per column:")
print(df.isna().sum(axis=0))

# Count total missing values
total_missing = df.isna().sum().sum()
print(f"\n3. Total missing values in DataFrame: {total_missing}")

# Filter rows with any missing value
rows_with_missing = df.isna().any(axis=1)
rows_with_any_missing = df.loc[rows_with_missing]
print("\n4. Rows with any missing value:")
print(rows_with_any_missing)

# Filter rows with no missing values
## Boolean masks generated by `isna()` or `notna()`  can be used directly for filtering
row_without_missing = df.notna().all(axis=1)
complete_rows = df.loc[row_without_missing]
print("\n5. Rows with no missing values:")
print(complete_rows)

Output

1. Boolean mask of missing values:
       A      B      C      D      E
0  False  False  False   True  False
1  False   True  False   True  False
2   True  False  False   True   True
3  False  False   True   True  False
4  False   True  False   True  False
5  False  False  False  False  False

2. Count of missing values per column:
A    1
B    2
C    1
D    5
E    1
dtype: int64

3. Total missing values in DataFrame: 10

4. Rows with any missing value:
     A    B     C   D     E
0  1.0  6.0  10.0 NaN     a
1  2.0  NaN  11.0 NaN     b
2  NaN  8.0  12.0 NaN  None
3  4.0  9.0   NaN NaN     d
4  5.0  NaN  14.0 NaN     e

5. Rows with no missing values:
     A     B     C     D  E
5  8.0  15.0  20.0  15.0  h

Filling Missing Values

Most Common Fill Strategies:

Method	Purpose	Key Parameters
`fillna(value)`	Replace with specific value	`value`, `inplace`
`ffill()`	Forward fill (use previous value)
`bfill()`	Backward fill (use next value)

Basic Fill Methods

Demo

Fill all missing values with a single value fillna(value)
Fill with different values for each column fillna(dict)

data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Fill all missing values with a single value
df_filled = df.fillna(0)
print("\n1. Fill all missing values with 0:")
print(df_filled)

# 2. Fill numeric columns with their means
df_mean = df.copy()
df_mean.fillna(df_mean.select_dtypes(include=['number']).mean(), inplace=True)
print("\n2. Fill numeric columns with their means:")
print(df_mean)

## The select_dtypes() method is specifically designed to filter columns based on their data types
## The parameter include=['number'] tells pandas to keep only columns whose data type falls under the "number" category,
##   which encompasses both integers and floating-point numbers

## The fillna() method replaces all NaN/missing values in the dataframe
## When you pass a Series as its argument (which is what we're doing with the means),
##   pandas does something clever: it matches column names between the dataframe and the Series,
##   and only fills values in columns that exist in both.

## String columns remain untouched since no means will be calculated for them
## df_mean["E"] = df_mean["E"].ffill() # Fills string column

Output

Original DataFrame:
     A     B     C     D     E
1.0   6.0  10.0   NaN     a
2.0   NaN  11.0   NaN     b
NaN   8.0  12.0   NaN  None
4.0   9.0   NaN   NaN     d
5.0   NaN  14.0   NaN     e
8.0  15.0  20.0  15.0     h

Fill all missing values with 0:
     A     B     C     D  E
1.0   6.0  10.0   0.0  a
2.0   0.0  11.0   0.0  b
0.0   8.0  12.0   0.0  0
4.0   9.0   0.0   0.0  d
5.0   0.0  14.0   0.0  e
8.0  15.0  20.0  15.0  h

Fill numeric columns with their means:
     A     B     C     D     E
1.0   6.0  10.0  15.0     a
2.0   9.5  11.0  15.0     b
4.0   8.0  12.0  15.0  None
4.0   9.0  13.4  15.0     d
5.0   9.5  14.0  15.0     e
8.0  15.0  20.0  15.0     h

More info

Additional info

ffill() and bfill():

Forward fill (propagate last valid observation forward)
Fill numeric columns with their means

# 2. Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': 0, 'E': 'missing'}
df_custom_fill = df.fillna(fill_values)
print("\n1. Fill with custom values per column:")
print(df_custom_fill)

# 3. Forward fill (propagate last valid observation forward)
df_ffill = df.ffill()
print("\n2. Forward fill (ffill):")
print(df_ffill)

Output

Fill with custom values per column:
     A     B      C     D        E
1.0   6.0   10.0   0.0        a
2.0  -1.0   11.0   0.0        b
0.0   8.0   12.0   0.0  missing
4.0   9.0  999.0   0.0        d
5.0  -1.0   14.0   0.0        e
8.0  15.0   20.0  15.0        h

Forward fill (ffill):
     A     B     C     D  E
1.0   6.0  10.0   NaN  a
2.0   6.0  11.0   NaN  b
2.0   8.0  12.0   NaN  b
4.0   9.0  12.0   NaN  d
5.0   9.0  14.0   NaN  e
8.0  15.0  20.0  15.0  h

Note

The choice of fill method should be based on the nature of your data and the analysis you’re conducting.
Forward fill and backward fill are especially useful for time series when you want to carry values forward or backward.
Statistical measures like mean, median, or mode are common choices for filling numeric data.
Remember that filling is modifying your data - it’s important to document your approach and consider its impact on analysis.
For categorical data, consider whether a missing value has meaning before deciding how to fill it.

Dropping Missing Data

Key Parameters for dropna():

Parameter	Purpose	Values
`axis`	Specify rows or columns	0 for rows, 1 for columns
`how`	Condition for dropping	‘any’ (default) or ‘all’
`thresh`	Minimum non-NaN values to keep	integer
`subset`	Columns to consider	list of column names

Demo

Dropping Rows and Columns:

Drop rows with any missing values (any missing values)
Drop columns with any missing values
Drop columns where all values are missing
Keep rows with at least n non-NaN values
Drop rows with missing values in specific columns

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, 9, np.nan],
    'C': [10, 11, 12, np.nan, 14],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'E': ['a', 'b', None, 'd', 'e',]  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n1. After dropping rows with any missing values:")
print(df_drop_rows)

# 2. Drop columns with any missing values
## In filtering/dropping, axis=0 means "operate on the 0-axis (rows)"
df_drop_cols = df.dropna(axis=1)
print("\n2. After dropping columns with any missing values:")
print(df_drop_cols)
## In aggregations: axis specifies the axis to collapse/operate along

# 3. Drop columns where all values are missing
df_drop_all_na_cols = df.dropna(axis=1, how='all')
print("\n3. After dropping columns where all values are missing:")
print(df_drop_all_na_cols)

# 4. Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\n4. Keep rows with at least 3 non-NaN values:")
print(df_thresh)

# 5. Drop rows with missing values in specific columns
df_subset = df.dropna(subset=['A', 'C'])
print("\n5. Drop rows with NaN in columns A or C:")
print(df_subset)

Output

Original DataFrame:
     A    B     C   D     E
1.0  6.0  10.0 NaN     a
2.0  NaN  11.0 NaN     b
NaN  8.0  12.0 NaN  None
4.0  9.0   NaN NaN     d
5.0  NaN  14.0 NaN     e

After dropping columns with any missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5]

After dropping columns where all values are missing:
     A    B     C     E
1.0  6.0  10.0     a
2.0  NaN  11.0     b
NaN  8.0  12.0  None
4.0  9.0   NaN     d
5.0  NaN  14.0     e

Keep rows with at least 3 non-NaN values:
     A    B     C   D  E
1.0  6.0  10.0 NaN  a
2.0  NaN  11.0 NaN  b
4.0  9.0   NaN NaN  d
5.0  NaN  14.0 NaN  e

Drop rows with NaN in columns A or C:
     A    B     C   D  E
1.0  6.0  10.0 NaN  a
2.0  NaN  11.0 NaN  b
5.0  NaN  14.0 NaN  e

More info

Additional info

Dropping is the simplest strategy, but it can lead to data loss - use cautiously.
The how parameter is critical - ‘any’ is much more restrictive than ‘all’.
The thresh parameter lets you keep rows with a minimum amount of data.
The subset parameter allows targeted dropping based on specific columns.
Dropping is often appropriate when missing values are rare, or when you need a complete dataset for specific analyses.
Be aware of potential bias introduced by dropping - if missing data correlates with certain characteristics, dropping can skew results.

Additional reading - Python Data Science Handbook

Key Takeaways

Keypoints

Always detect first: Understand the pattern and extent of missing values before taking action
Choose the right strategy:
- Fill when you can reasonably infer values or need to preserve data
- Drop when missing data is minimal or would introduce bias if filled
Document your approach: Your handling choices affect analysis results