Handling Missing Data in Pandas

Objectives

  • Identify missing values in DataFrames using isna() and methods.

  • Strategize the filling of missing values with fillna() and interpolation methods tailored to scenario needs.

  • Demonstrate how to drop missing data with dropna() and discuss various strategies for handling missing data.

Time

8 Minutes

  • Missing data is ubiquitous in real-world datasets

  • Pandas represents missing values as NaN (Not a Number)

  • Proper handling of missing values is critical for accurate analysis

  • We’ll focus on the three most essential operations: detecting, filling, and dropping missing data

Detecting Missing Values

Key Methods for Finding Missing Data

Method

Returns

Purpose

isna() / isnull()

Boolean mask (True where missing)

Identify missing values

notna() / notnull()

Boolean mask (True where not missing)

Identify non-missing values

isna().sum()

Count of missing values per column

Quantify missing data

isna().any()

Whether any value is missing per column

Quick presence check

Detecting Missing Values:

Demo

  • Detect missing values with isna()

  • Count missing values per column

  • Count total missing values

import pandas as pd
import numpy as np

# Create a DataFrame with different types of missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)

# Detect missing values with isna()
print("\n1. Boolean mask of missing values:")
print(df.isna())  # True where values are missing

# Count missing values per column
print("\n2. Count of missing values per column:")
print(df.isna().sum(axis=0))

# Count total missing values
total_missing = df.isna().sum().sum()
print(f"\n3. Total missing values in DataFrame: {total_missing}")

# Filter rows with any missing value
rows_with_missing = df.isna().any(axis=1)
rows_with_any_missing = df.loc[rows_with_missing]
print("\n4. Rows with any missing value:")
print(rows_with_any_missing)

# Filter rows with no missing values
row_without_missing = df.notna().all(axis=1)
complete_rows = df.loc[row_without_missing]
print("\n5. Rows with no missing values:")
print(complete_rows)

Output

1. Boolean mask of missing values:
       A      B      C      D      E
0  False  False  False   True  False
1  False   True  False   True  False
2   True  False  False   True   True
3  False  False   True   True  False
4  False   True  False   True  False
5  False  False  False  False  False

2. Count of missing values per column:
A    1
B    2
C    1
D    5
E    1
dtype: int64

3. Total missing values in DataFrame: 10

4. Rows with any missing value:
     A    B     C   D     E
0  1.0  6.0  10.0 NaN     a
1  2.0  NaN  11.0 NaN     b
2  NaN  8.0  12.0 NaN  None
3  4.0  9.0   NaN NaN     d
4  5.0  NaN  14.0 NaN     e

5. Rows with no missing values:
     A     B     C     D  E
5  8.0  15.0  20.0  15.0  h

Exercise

Check if isna() and isnull() returns the same results?

Discussion

  • Pandas provides a consistent interface for detecting missing values with isna() and notna().

  • Notice that both np.nan and None are treated as missing values in Pandas.

    • In Pandas, None values are treated as np.nan

  • Boolean masks generated by isna() can be used directly for filtering

  • The axis parameter controls whether operations are applied to rows or columns.

Filling Missing Values

Most Common Fill Strategies:

Method

Purpose

Key Parameters

fillna(value)

Replace with specific value

value, inplace

ffill()

Forward fill (use previous value)

bfill()

Backward fill (use next value)

Basic Fill Methods

Demo

  • Fill all missing values with a single value fillna(value)

  • Fill with different values for each column fillna(dict)

  • Forward fill (propagate last valid observation forward)

  • Fill numeric columns with their means


data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Fill all missing values with a single value
df_filled = df.fillna(0)
print("\n1. Fill all missing values with 0:")
print(df_filled)

# 2. Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': 0, 'E': 'missing'}
df_custom_fill = df.fillna(fill_values)
print("\n2. Fill with custom values per column:")
print(df_custom_fill)

# 3. Forward fill (propagate last valid observation forward)
df_ffill = df.ffill()
print("\n3. Forward fill (ffill):")
print(df_ffill)

# 4. Fill numeric columns with their means
df_mean = df.copy()
for col in ['A', 'B', 'C', 'D']:
    df_mean[col].fillna(df_mean[col].mean(), inplace=True)
print("\n4. Fill numeric columns with their means:")
print(df_mean)

Output

Original DataFrame:
     A     B     C     D     E
0  1.0   6.0  10.0   NaN     a
1  2.0   NaN  11.0   NaN     b
2  NaN   8.0  12.0   NaN  None
3  4.0   9.0   NaN   NaN     d
4  5.0   NaN  14.0   NaN     e
5  8.0  15.0  20.0  15.0     h

1. Fill all missing values with 0:
     A     B     C     D  E
0  1.0   6.0  10.0   0.0  a
1  2.0   0.0  11.0   0.0  b
2  0.0   8.0  12.0   0.0  0
3  4.0   9.0   0.0   0.0  d
4  5.0   0.0  14.0   0.0  e
5  8.0  15.0  20.0  15.0  h

2. Fill with custom values per column:
     A     B      C     D        E
0  1.0   6.0   10.0   0.0        a
1  2.0  -1.0   11.0   0.0        b
2  0.0   8.0   12.0   0.0  missing
3  4.0   9.0  999.0   0.0        d
4  5.0  -1.0   14.0   0.0        e
5  8.0  15.0   20.0  15.0        h

3. Forward fill (ffill):
     A     B     C     D  E
0  1.0   6.0  10.0   NaN  a
1  2.0   6.0  11.0   NaN  b
2  2.0   8.0  12.0   NaN  b
3  4.0   9.0  12.0   NaN  d
4  5.0   9.0  14.0   NaN  e
5  8.0  15.0  20.0  15.0  h

4. Fill numeric columns with their means:
     A     B     C     D     E
0  1.0   6.0  10.0  15.0     a
1  2.0   9.5  11.0  15.0     b
2  4.0   8.0  12.0  15.0  None
3  4.0   9.0  13.4  15.0     d
4  5.0   9.5  14.0  15.0     e
5  8.0  15.0  20.0  15.0     h

Discussion

  • The choice of fill method should be based on the nature of your data and the analysis you’re conducting.

  • Forward fill and backward fill are especially useful for time series when you want to carry values forward or backward.

  • Statistical measures like mean, median, or mode are common choices for filling numeric data.

  • Remember that filling is modifying your data - it’s important to document your approach and consider its impact on analysis.

  • For categorical data, consider whether a missing value has meaning before deciding how to fill it.

Dropping Missing Data

Key Parameters for dropna():

Parameter

Purpose

Values

axis

Specify rows or columns

0 for rows, 1 for columns

how

Condition for dropping

‘any’ (default) or ‘all’

thresh

Minimum non-NaN values to keep

integer

subset

Columns to consider

list of column names

Demo

Dropping Rows and Columns:

  • Drop rows with any missing values (any missing values)

  • Drop columns with any missing values

  • Drop columns where all values are missing

  • Keep rows with at least n non-NaN values

  • Drop rows with missing values in specific columns

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, 9, np.nan],
    'C': [10, 11, 12, np.nan, 14],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'E': ['a', 'b', None, 'd', 'e',]  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n1. After dropping rows with any missing values:")
print(df_drop_rows)

# 2. Drop columns with any missing values
df_drop_cols = df.dropna(axis=1)
print("\n2. After dropping columns with any missing values:")
print(df_drop_cols)

# 3. Drop columns where all values are missing
df_drop_all_na_cols = df.dropna(axis=1, how='all')
print("\n3. After dropping columns where all values are missing:")
print(df_drop_all_na_cols)

# 4. Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\n4. Keep rows with at least 3 non-NaN values:")
print(df_thresh)

# 5. Drop rows with missing values in specific columns
df_subset = df.dropna(subset=['A', 'C'])
print("\n5. Drop rows with NaN in columns A or C:")
print(df_subset)
Original DataFrame:
     A    B     C   D     E
0  1.0  6.0  10.0 NaN     a
1  2.0  NaN  11.0 NaN     b
2  NaN  8.0  12.0 NaN  None
3  4.0  9.0   NaN NaN     d
4  5.0  NaN  14.0 NaN     e

2. After dropping columns with any missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5]

3. After dropping columns where all values are missing:
     A    B     C     E
0  1.0  6.0  10.0     a
1  2.0  NaN  11.0     b
2  NaN  8.0  12.0  None
3  4.0  9.0   NaN     d
4  5.0  NaN  14.0     e

4. Keep rows with at least 3 non-NaN values:
     A    B     C   D  E
0  1.0  6.0  10.0 NaN  a
1  2.0  NaN  11.0 NaN  b
3  4.0  9.0   NaN NaN  d
4  5.0  NaN  14.0 NaN  e

5. Drop rows with NaN in columns A or C:
     A    B     C   D  E
0  1.0  6.0  10.0 NaN  a
1  2.0  NaN  11.0 NaN  b
4  5.0  NaN  14.0 NaN  e

Discussion

  • Dropping is the simplest strategy, but it can lead to data loss - use cautiously.

  • The how parameter is critical - ‘any’ is much more restrictive than ‘all’.

  • The thresh parameter lets you keep rows with a minimum amount of data.

  • The subset parameter allows targeted dropping based on specific columns.

  • Dropping is often appropriate when missing values are rare, or when you need a complete dataset for specific analyses.

  • Be aware of potential bias introduced by dropping - if missing data correlates with certain characteristics, dropping can skew results.

Key Takeaways

Keypoints

  1. Always detect first: Understand the pattern and extent of missing values before taking action

  2. Choose the right strategy:

    • Fill when you can reasonably infer values or need to preserve data

    • Drop when missing data is minimal or would introduce bias if filled

  3. Document your approach: Your handling choices affect analysis results