NumPy Boolean Masking and Filtering

Objectives

Create boolean masks for array filtering using comparison operators
Apply boolean masks to select specific elements from arrays
Combine multiple conditions using logical operators (&, |, ~)
Use np.where() to find indices where conditions are met
Apply np.where() for conditional value assignment
Implement np.isin() to check array membership
Apply these techniques to solve common data analysis problems

Instructor note

Teaching : 10 min
Demo: 5 min

Introduction to Advanced Indexing

When working with data, we often need to focus on specific elements that meet certain criteria. NumPy provides elegant and efficient ways to accomplish this through:

Boolean masking
The np.where() function
The np.isin() function

Let’s explore each technique in detail.

Boolean Masking: The Concept

Boolean masking is a fundamental technique in NumPy that allows us to filter arrays based on conditions. The process happens in two steps:

Step 1: Create a Boolean Mask:

We apply a condition to an array
This produces a new array of the same shape filled with True and False values
Elements that satisfy our condition are marked as True
Elements that don’t satisfy our condition are marked as False

Step 2: Apply the Mask:

We use this boolean array to index into our original array
Only elements corresponding to True values are selected

Let’s see this in action:

Demo

import numpy as np

# Create a sample array
data = np.array([1, 4, 2, 5, 3])
print("Original array:", data)

# Create a boolean mask for elements greater than 3
mask = data > 3
print("Boolean mask (data > 3):", mask)
# This produces: [False, True, False, True, False]

# Apply the mask to select elements
selected_data = data[mask]
print("Selected elements:", selected_data)
# This produces: [4, 5]

## Elegant approach -  mask array has the exact same shape as data array
## Each position containing information about whether that element meets our criteria

Output

Original array: [1 4 2 5 3]
Boolean mask (data > 3): [False  True False  True False]
Selected elements: [4 5]

Combining Multiple Conditions

We can combine multiple conditions using logical operators:

& for bitwise AND
| for bitwise OR
~ for bitwise NOT

Additional info

Additional notes: Combining Multiple Conditions

# Creating a 2D array for demonstration
arr = np.array([[5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]])

# Elements greater than 20 AND less than 40
mask = (arr > 20) & (arr < 40)
print("Elements between 20 and 40:", arr[mask])
# This produces: [25 30 35]

# Elements less than 15 OR greater than 40
mask = (arr < 15) | (arr > 40)
print("Elements less than 15 or greater than 40:", arr[mask])
# This produces: [ 5 10 45]

Important: When combining conditions, always use parentheses around each individual condition to ensure proper precedence.

Using `np.where()`: Finding Positions

The np.where() function gives us even more capabilities. In its simplest form, it returns the indices where a condition is True:

Demo

# Create an array with a sequence
data = np.arange(0, 20, 3)  # [0, 3, 6, 9, 12, 15, 18]
print("Original array:", data)

# Find indices where elements are even
indices = np.where(data % 2 == 0)
print("Indices of even elements:", indices[0])
# This produces: [0, 2, 4, 6]

# Use these indices to get the actual values
even_elements = data[indices]
print("Even elements:", even_elements)
# This produces: [ 0,  6, 12, 18]

Output

Original array: [ 0  3  6  9 12 15 18]
Indices of even elements: [0 2 4 6]
Even elements: [ 0  6 12 18]

The result of np.where() is a tuple of arrays, one for each dimension of the input array. Since we’re working with a 1D array here, we access the first (and only) element of this tuple with indices[0].

Using np.where(): Conditional Assignment

The real power of np.where() comes from its three-argument form:

np.where(condition, x, y)

This works like a vectorized if-else statement:

Where the condition is True, take values from x
Where the condition is False, take values from y

Additional info

Additional notes: `np.where(cond, x, y)`

Demo

# Original array: [0, 3, 6, 9, 12, 15, 18]

# Replace odd numbers with zeros
result = np.where(data % 2 == 0, data, 0)
print("Even numbers preserved, odd numbers replaced with 0:", result)
# This produces: [ 0,  0,  6,  0, 12,  0, 18]

# Another example: create an array that shows whether each element is even or odd
labels = np.where(data % 2 == 0, "even", "odd")
print("Labels for each element:", labels)
# This produces: ['even' 'odd' 'even' 'odd' 'even' 'odd' 'even']

This is much more concise and efficient than using loops or other conditional constructs.

`np.isin()` Function

np.isin() useful when we have a specific set of values we’re interested in.

Additional info

Additional notes: The `np.isin()` Function

np.isin():

The np.isin() function checks whether elements in one array are present in another array. It creates a boolean mask that we can use for filtering:

# Original array: [0, 3, 6, 9, 12, 15, 18]

# Check which elements are in a set of values
test_values = [0, 6, 15]
mask = np.isin(data, test_values)
print("Elements that are in test_values:", data[mask])
# This produces: [ 0,  6, 15]

Additional notes: Practical Applications

Practical Applications:

These techniques are foundational for data analysis tasks:

Data Cleaning: Filter out missing or invalid values

clean_data = data[~np.isnan(data)]  # Remove NaN values

Feature Selection: Extract data points that meet specific criteria
```
high_importance = data[data > threshold]
```

Conditional Transformations: Apply different operations to different elements

normalized = np.where(data > 0, data/data.max(), data/abs(data.min()))

Additional notes: Exercise 2 & 3

Exercise

Exercise 1: NumPy Boolean Masking and Advanced Filtering:

Create a NumPy array of 20 random integers between 0 and 100. Then:

np.random.seed(42)  # for reproducibility
numbers = np.random.randint(0, 101, 20)

Create a boolean mask to identify all numbers divisible by 7
Use the mask to extract these numbers
Count how many numbers are divisible by 7

Solution

import numpy as np

# Create an array of 20 random integers between 0 and 100
np.random.seed(42)  # for reproducibility
numbers = np.random.randint(0, 101, 20)
print("Original array:", numbers)

# Create a boolean mask for numbers divisible by 7
mask = numbers % 7 == 0
print("Boolean mask:", mask)

# Extract numbers divisible by 7
divisible_by_7 = numbers[mask]
print("Numbers divisible by 7:", divisible_by_7)

# Count how many numbers are divisible by 7
count = np.sum(mask)  # True values are treated as 1, False as 0
print(f"Count of numbers divisible by 7: {count}")

Output

Original array: [51 92 14 71 60 20 82 86 74 74 87 99 23  2 21 52  1 87 29 37]
Boolean mask: [False False  True False False False False False False False False False
 False False  True False False False False False]
Numbers divisible by 7: [14 21]
Count of numbers divisible by 7: 2

Exercise 2 - np.where() for Conditional Assignment:

Create a 4x4 matrix of random integers between 1 and 20. Then:

np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))

Use np.where() to replace all odd numbers with -1 while keeping even numbers unchanged

Exercise 2 - Solution:

# Create a 4x4 matrix of random integers between 1 and 20
np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))
print("Original matrix:")
print(matrix)

# Replace odd numbers with -1, keep even numbers
odd_replaced = np.where(matrix % 2 == 0, matrix, -1)
print("\nMatrix with odd numbers replaced by -1:")
print(odd_replaced)

Output

Original matrix:
[[ 7 20 15 11]
 [ 8  7 19 11]
 [11  4  8  3]
 [ 2 12  6  2]]

Matrix with odd numbers replaced by -1:
[[-1 20 -1 -1]
 [ 8 -1 -1 -1]
 [-1  4  8 -1]
 [ 2 12  6  2]]

Exercise 3 - DNA Sequence Analysis:

You are given a DNA sequence as a NumPy array of characters (A, T, G, C).

Create a random DNA sequence of length 50 using np.random.choice(['A', 'T', 'G', 'C'], 50)
Use boolean masking to Count the number of each nucleotide (A, T, G, C)

Exercise 3 - Solution:

import numpy as np

# Create a random DNA sequence
np.random.seed(42)  # for reproducibility
dna_sequence = np.random.choice(['A', 'T', 'G', 'C'], 50)
print("DNA sequence:", ''.join(dna_sequence))

# Count the number of each nucleotide
a_count = np.sum(dna_sequence == 'A')
t_count = np.sum(dna_sequence == 'T')
g_count = np.sum(dna_sequence == 'G')
c_count = np.sum(dna_sequence == 'C')

print(f"A: {a_count}, T: {t_count}, G: {g_count}, C: {c_count}")

Output

DNA sequence: GCAGGCAAGTGGGGCACCCGTATCCTTTCCAACTTACAAGGGTCCCCGTT
A: 10, T: 11, G: 13, C: 16

Key Takeaways

Keypoints

Boolean masking and np.where() operations are highly optimized in NumPy. They:
- Avoid explicit loops in Python
- Execute at C-speed under the hood
- Allow vectorized operations on large datasets
For large datasets, these techniques are drastically faster than traditional iteration.
Boolean masking provides an intuitive way to filter arrays based on conditions
np.where() in its single-argument form finds indices where conditions are true
np.where(condition, x, y) acts as a vectorized if-else statement
np.isin() lets us filter based on membership in a set of values

NumPy Boolean Masking and Filtering

Introduction to Advanced Indexing

Boolean Masking: The Concept

Using np.where(): Finding Positions

np.isin() Function

Key Takeaways

Using `np.where()`: Finding Positions

`np.isin()` Function