Handling Missing Data in Pandas

Objectives

Identify missing values in DataFrames using isna() and methods.
Strategize the filling of missing values with fillna() and interpolation methods tailored to scenario needs.
Demonstrate how to drop missing data with dropna() and discuss various strategies for handling missing data.

Time

8 Minutes

Missing data is ubiquitous in real-world datasets
Pandas represents missing values as NaN (Not a Number)
Proper handling of missing values is critical for accurate analysis
We’ll focus on the three most essential operations: detecting, filling, and dropping missing data

Detecting Missing Values

Key Methods for Finding Missing Data

Method	Returns	Purpose
`isna()` / `isnull()`	Boolean mask (True where missing)	Identify missing values
`notna()` / `notnull()`	Boolean mask (True where not missing)	Identify non-missing values
`isna().sum()`	Count of missing values per column	Quantify missing data
`isna().any()`	Whether any value is missing per column	Quick presence check

Detecting Missing Values:

Demo

Detect missing values with isna()
Count missing values per column
Count total missing values

import pandas as pd
import numpy as np

# Create a DataFrame with different types of missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)

# Detect missing values with isna()
print("\n1. Boolean mask of missing values:")
print(df.isna())  # True where values are missing

# Count missing values per column
print("\n2. Count of missing values per column:")
print(df.isna().sum(axis=0))

# Count total missing values
total_missing = df.isna().sum().sum()
print(f"\n3. Total missing values in DataFrame: {total_missing}")

# Filter rows with any missing value
rows_with_missing = df.isna().any(axis=1)
rows_with_any_missing = df.loc[rows_with_missing]
print("\n4. Rows with any missing value:")
print(rows_with_any_missing)

# Filter rows with no missing values
row_without_missing = df.notna().all(axis=1)
complete_rows = df.loc[row_without_missing]
print("\n5. Rows with no missing values:")
print(complete_rows)

Output

1. Boolean mask of missing values:
       A      B      C      D      E
0  False  False  False   True  False
1  False   True  False   True  False
2   True  False  False   True   True
3  False  False   True   True  False
4  False   True  False   True  False
5  False  False  False  False  False

2. Count of missing values per column:
A    1
B    2
C    1
D    5
E    1
dtype: int64

3. Total missing values in DataFrame: 10

4. Rows with any missing value:
     A    B     C   D     E
0  1.0  6.0  10.0 NaN     a
1  2.0  NaN  11.0 NaN     b
2  NaN  8.0  12.0 NaN  None
3  4.0  9.0   NaN NaN     d
4  5.0  NaN  14.0 NaN     e

5. Rows with no missing values:
     A     B     C     D  E
5  8.0  15.0  20.0  15.0  h

Exercise

Check if isna() and isnull() returns the same results?

Solution

df.isnull()==df.isna()

isna() can be thought as an alias for isnull()
We can use isnull() or isna() interchangeably

Discussion

Pandas provides a consistent interface for detecting missing values with isna() and notna().
Notice that both np.nan and None are treated as missing values in Pandas.
- In Pandas, None values are treated as np.nan
Boolean masks generated by isna() can be used directly for filtering
The axis parameter controls whether operations are applied to rows or columns.

Filling Missing Values

Most Common Fill Strategies:

Method	Purpose	Key Parameters
`fillna(value)`	Replace with specific value	`value`, `inplace`
`ffill()`	Forward fill (use previous value)
`bfill()`	Backward fill (use next value)

Basic Fill Methods

Demo

Fill all missing values with a single value fillna(value)
Fill with different values for each column fillna(dict)
Forward fill (propagate last valid observation forward)
Fill numeric columns with their means

data = {
    'A': [1, 2, np.nan, 4, 5, 8],
    'B': [6, np.nan, 8, 9, np.nan, 15],
    'C': [10, 11, 12, np.nan, 14, 20],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, 15],
    'E': ['a', 'b', None, 'd', 'e', 'h']  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Fill all missing values with a single value
df_filled = df.fillna(0)
print("\n1. Fill all missing values with 0:")
print(df_filled)

# 2. Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': 0, 'E': 'missing'}
df_custom_fill = df.fillna(fill_values)
print("\n2. Fill with custom values per column:")
print(df_custom_fill)

# 3. Forward fill (propagate last valid observation forward)
df_ffill = df.ffill()
print("\n3. Forward fill (ffill):")
print(df_ffill)

# 4. Fill numeric columns with their means
df_mean = df.copy()
for col in ['A', 'B', 'C', 'D']:
    df_mean[col].fillna(df_mean[col].mean(), inplace=True)
print("\n4. Fill numeric columns with their means:")
print(df_mean)

Output

Original DataFrame:
     A     B     C     D     E
1.0   6.0  10.0   NaN     a
2.0   NaN  11.0   NaN     b
NaN   8.0  12.0   NaN  None
4.0   9.0   NaN   NaN     d
5.0   NaN  14.0   NaN     e
8.0  15.0  20.0  15.0     h

Fill all missing values with 0:
     A     B     C     D  E
1.0   6.0  10.0   0.0  a
2.0   0.0  11.0   0.0  b
0.0   8.0  12.0   0.0  0
4.0   9.0   0.0   0.0  d
5.0   0.0  14.0   0.0  e
8.0  15.0  20.0  15.0  h

Fill with custom values per column:
     A     B      C     D        E
1.0   6.0   10.0   0.0        a
2.0  -1.0   11.0   0.0        b
0.0   8.0   12.0   0.0  missing
4.0   9.0  999.0   0.0        d
5.0  -1.0   14.0   0.0        e
8.0  15.0   20.0  15.0        h

Forward fill (ffill):
     A     B     C     D  E
1.0   6.0  10.0   NaN  a
2.0   6.0  11.0   NaN  b
2.0   8.0  12.0   NaN  b
4.0   9.0  12.0   NaN  d
5.0   9.0  14.0   NaN  e
8.0  15.0  20.0  15.0  h

Fill numeric columns with their means:
     A     B     C     D     E
1.0   6.0  10.0  15.0     a
2.0   9.5  11.0  15.0     b
4.0   8.0  12.0  15.0  None
4.0   9.0  13.4  15.0     d
5.0   9.5  14.0  15.0     e
8.0  15.0  20.0  15.0     h

Discussion

The choice of fill method should be based on the nature of your data and the analysis you’re conducting.
Forward fill and backward fill are especially useful for time series when you want to carry values forward or backward.
Statistical measures like mean, median, or mode are common choices for filling numeric data.
Remember that filling is modifying your data - it’s important to document your approach and consider its impact on analysis.
For categorical data, consider whether a missing value has meaning before deciding how to fill it.

Dropping Missing Data

Key Parameters for dropna():

Parameter	Purpose	Values
`axis`	Specify rows or columns	0 for rows, 1 for columns
`how`	Condition for dropping	‘any’ (default) or ‘all’
`thresh`	Minimum non-NaN values to keep	integer
`subset`	Columns to consider	list of column names

Demo

Dropping Rows and Columns:

Drop rows with any missing values (any missing values)
Drop columns with any missing values
Drop columns where all values are missing
Keep rows with at least n non-NaN values
Drop rows with missing values in specific columns

data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, 9, np.nan],
    'C': [10, 11, 12, np.nan, 14],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'E': ['a', 'b', None, 'd', 'e',]  # None also represents missing value
}
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
print(df)

# 1. Drop rows with any missing values
df_drop_rows = df.dropna()
print("\n1. After dropping rows with any missing values:")
print(df_drop_rows)

# 2. Drop columns with any missing values
df_drop_cols = df.dropna(axis=1)
print("\n2. After dropping columns with any missing values:")
print(df_drop_cols)

# 3. Drop columns where all values are missing
df_drop_all_na_cols = df.dropna(axis=1, how='all')
print("\n3. After dropping columns where all values are missing:")
print(df_drop_all_na_cols)

# 4. Keep rows with at least 3 non-NaN values
df_thresh = df.dropna(thresh=3)
print("\n4. Keep rows with at least 3 non-NaN values:")
print(df_thresh)

# 5. Drop rows with missing values in specific columns
df_subset = df.dropna(subset=['A', 'C'])
print("\n5. Drop rows with NaN in columns A or C:")
print(df_subset)

Original DataFrame:
     A    B     C   D     E
1.0  6.0  10.0 NaN     a
2.0  NaN  11.0 NaN     b
NaN  8.0  12.0 NaN  None
4.0  9.0   NaN NaN     d
5.0  NaN  14.0 NaN     e

After dropping columns with any missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5]

After dropping columns where all values are missing:
     A    B     C     E
1.0  6.0  10.0     a
2.0  NaN  11.0     b
NaN  8.0  12.0  None
4.0  9.0   NaN     d
5.0  NaN  14.0     e

Keep rows with at least 3 non-NaN values:
     A    B     C   D  E
1.0  6.0  10.0 NaN  a
2.0  NaN  11.0 NaN  b
4.0  9.0   NaN NaN  d
5.0  NaN  14.0 NaN  e

Drop rows with NaN in columns A or C:
     A    B     C   D  E
1.0  6.0  10.0 NaN  a
2.0  NaN  11.0 NaN  b
5.0  NaN  14.0 NaN  e

Discussion

Dropping is the simplest strategy, but it can lead to data loss - use cautiously.
The how parameter is critical - ‘any’ is much more restrictive than ‘all’.
The thresh parameter lets you keep rows with a minimum amount of data.
The subset parameter allows targeted dropping based on specific columns.
Dropping is often appropriate when missing values are rare, or when you need a complete dataset for specific analyses.
Be aware of potential bias introduced by dropping - if missing data correlates with certain characteristics, dropping can skew results.

Key Takeaways

Keypoints

Always detect first: Understand the pattern and extent of missing values before taking action
Choose the right strategy:
- Fill when you can reasonably infer values or need to preserve data
- Drop when missing data is minimal or would introduce bias if filled
Document your approach: Your handling choices affect analysis results