Bonus Lesson 1: Handling Missing Data in Pandas

Objectives

  • Identify missing values in DataFrames using isna() and methods.

  • Strategize the filling of missing values with fillna() and interpolation methods tailored to scenario needs.

  • Demonstrate how to drop missing data with dropna() and discuss various strategies for handling missing data.

Instructor note

  • This lesson is not intended to cover in live session

  • Lesson material is available only for the course participants to follow on their own time

  • Missing data is ubiquitous in real-world datasets

  • Pandas represents missing values as NaN (Not a Number)

  • Proper handling of missing values is critical for accurate analysis

  • We’ll focus on the three most essential operations: detecting, filling, and dropping missing data

Detecting Missing Values

Key Methods for Finding Missing Data

Method

Returns

Purpose

isna() & isnull()

Boolean mask (True where missing)

Identify missing values

notna() & notnull()

Boolean mask (True where not missing)

Identify non-missing values

isna().sum()

Count of missing values per column

Quantify missing data

isna().any()

Whether any value is missing per column

Quick presence check

Detecting Missing Values:

Demo

  • Detect missing values with isna()

    • consistent interface for detecting missing values with isna() and notna()

  • Count missing values per column

  • Count total missing values

import pandas as pd
import numpy as np

# Create a DataFrame with different types of missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)

# Detect missing values with isna()
print("\n1. Boolean mask of missing values:")
print(df.isna())  # True where values are missing

# Count missing values per column
print("\n2. Count of missing values per column:")
print(df.isna().sum(axis=0))

# Count total missing values
total_missing = df.isna().sum().sum()
print(f"\n3. Total missing values in DataFrame: {total_missing}")

# Filter rows with any missing value
rows_with_missing = df.isna().any(axis=1)
rows_with_any_missing = df.loc[rows_with_missing]
print("\n4. Rows with any missing value:")
print(rows_with_any_missing)

# Filter rows with no missing values
## Boolean masks generated by `isna()` or `notna()`  can be used directly for filtering
row_without_missing = df.notna().all(axis=1)
complete_rows = df.loc[row_without_missing]
print("\n5. Rows with no missing values:")
print(complete_rows)

Filling Missing Values

Most Common Fill Strategies:

Method

Purpose

Key Parameters

fillna(value)

Replace with specific value

value, inplace

ffill()

Forward fill (use previous value)

bfill()

Backward fill (use next value)

Basic Fill Methods

Demo

  • Fill all missing values with a single value fillna(value)

  • Fill with different values for each column fillna(dict)


data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Fill all missing values with a single value
df_filled = df.fillna(0)
print("\n1. Fill all missing values with 0:")
print(df_filled)

# 2. Fill numeric columns with their means
df_mean = df.copy()
df_mean.fillna(df_mean.select_dtypes(include=['number']).mean(), inplace=True)
print("\n2. Fill numeric columns with their means:")
print(df_mean)

## The select_dtypes() method is specifically designed to filter columns based on their data types
## The parameter include=['number'] tells pandas to keep only columns whose data type falls under the "number" category,
##   which encompasses both integers and floating-point numbers

## The fillna() method replaces all NaN/missing values in the dataframe
## When you pass a Series as its argument (which is what we're doing with the means),
##   pandas does something clever: it matches column names between the dataframe and the Series,
##   and only fills values in columns that exist in both.

## String columns remain untouched since no means will be calculated for them
## df_mean["E"] = df_mean["E"].ffill() # Fills string column

More info

Additional info

ffill() and bfill():

  • Forward fill (propagate last valid observation forward)

  • Fill numeric columns with their means

# 2. Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': 0, 'E': 'missing'}
df_custom_fill = df.fillna(fill_values)
print("\n1. Fill with custom values per column:")
print(df_custom_fill)

# 3. Forward fill (propagate last valid observation forward)
df_ffill = df.ffill()
print("\n2. Forward fill (ffill):")
print(df_ffill)

Output

1. Fill with custom values per column:
     A     B      C     D        E
0  1.0   6.0   10.0   0.0        a
1  2.0  -1.0   11.0   0.0        b
2  0.0   8.0   12.0   0.0  missing
3  4.0   9.0  999.0   0.0        d
4  5.0  -1.0   14.0   0.0        e
5  8.0  15.0   20.0  15.0        h

2. Forward fill (ffill):
     A     B     C     D  E
0  1.0   6.0  10.0   NaN  a
1  2.0   6.0  11.0   NaN  b
2  2.0   8.0  12.0   NaN  b
3  4.0   9.0  12.0   NaN  d
4  5.0   9.0  14.0   NaN  e
5  8.0  15.0  20.0  15.0  h

Note

  • The choice of fill method should be based on the nature of your data and the analysis you’re conducting.

  • Forward fill and backward fill are especially useful for time series when you want to carry values forward or backward.

  • Statistical measures like mean, median, or mode are common choices for filling numeric data.

  • Remember that filling is modifying your data - it’s important to document your approach and consider its impact on analysis.

  • For categorical data, consider whether a missing value has meaning before deciding how to fill it.

Dropping Missing Data

Key Parameters for dropna():

Parameter

Purpose

Values

axis

Specify rows or columns

0 for rows, 1 for columns

how

Condition for dropping

‘any’ (default) or ‘all’

thresh

Minimum non-NaN values to keep

integer

subset

Columns to consider

list of column names

Demo

Dropping Rows and Columns:

  • Drop rows with any missing values (any missing values)

  • Drop columns with any missing values

  • Drop columns where all values are missing

  • Keep rows with at least n non-NaN values

  • Drop rows with missing values in specific columns

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, 9, np.nan],
    'C': [10, 11, 12, np.nan, 14],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'E': ['a', 'b', None, 'd', 'e',]  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n1. After dropping rows with any missing values:")
print(df_drop_rows)

# 2. Drop columns with any missing values
## In filtering/dropping, axis=0 means "operate on the 0-axis (rows)"
df_drop_cols = df.dropna(axis=1)
print("\n2. After dropping columns with any missing values:")
print(df_drop_cols)
## In aggregations: axis specifies the axis to collapse/operate along

# 3. Drop columns where all values are missing
df_drop_all_na_cols = df.dropna(axis=1, how='all')
print("\n3. After dropping columns where all values are missing:")
print(df_drop_all_na_cols)

# 4. Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\n4. Keep rows with at least 3 non-NaN values:")
print(df_thresh)

# 5. Drop rows with missing values in specific columns
df_subset = df.dropna(subset=['A', 'C'])
print("\n5. Drop rows with NaN in columns A or C:")
print(df_subset)

More info

Additional info
  • Dropping is the simplest strategy, but it can lead to data loss - use cautiously.

  • The how parameter is critical - ‘any’ is much more restrictive than ‘all’.

  • The thresh parameter lets you keep rows with a minimum amount of data.

  • The subset parameter allows targeted dropping based on specific columns.

  • Dropping is often appropriate when missing values are rare, or when you need a complete dataset for specific analyses.

  • Be aware of potential bias introduced by dropping - if missing data correlates with certain characteristics, dropping can skew results.

Additional reading - Python Data Science Handbook

Key Takeaways

Keypoints

  1. Always detect first: Understand the pattern and extent of missing values before taking action

  2. Choose the right strategy:

    • Fill when you can reasonably infer values or need to preserve data

    • Drop when missing data is minimal or would introduce bias if filled

  3. Document your approach: Your handling choices affect analysis results