NumPy Boolean Masking and Filtering

Objectives

  1. Create boolean masks for array filtering using comparison operators

  2. Apply boolean masks to select specific elements from arrays

  3. Combine multiple conditions using logical operators (&, |, ~)

  4. Use np.where() to find indices where conditions are met

  5. Apply np.where() for conditional value assignment

  6. Implement np.isin() to check array membership

  7. Apply these techniques to solve common data analysis problems

Instructor note

  • Teaching : 10 min

  • Demo: 5 min

Introduction to Advanced Indexing

When working with data, we often need to focus on specific elements that meet certain criteria. NumPy provides elegant and efficient ways to accomplish this through:

  1. Boolean masking

  2. The np.where() function

  3. The np.isin() function

Let’s explore each technique in detail.

Boolean Masking: The Concept

Boolean masking is a fundamental technique in NumPy that allows us to filter arrays based on conditions. The process happens in two steps:

Step 1: Create a Boolean Mask:

  • We apply a condition to an array

  • This produces a new array of the same shape filled with True and False values

  • Elements that satisfy our condition are marked as True

  • Elements that don’t satisfy our condition are marked as False

Step 2: Apply the Mask:

  • We use this boolean array to index into our original array

  • Only elements corresponding to True values are selected

Let’s see this in action:

Demo

import numpy as np

# Create a sample array
data = np.array([1, 4, 2, 5, 3])
print("Original array:", data)

# Create a boolean mask for elements greater than 3
mask = data > 3
print("Boolean mask (data > 3):", mask)
# This produces: [False, True, False, True, False]

# Apply the mask to select elements
selected_data = data[mask]
print("Selected elements:", selected_data)
# This produces: [4, 5]

## Elegant approach -  mask array has the exact same shape as data array
## Each position containing information about whether that element meets our criteria

Combining Multiple Conditions

We can combine multiple conditions using logical operators:

  • & for logical AND

  • | for logical OR

  • ~ for logical NOT

Additional info

Additional notes: Combining Multiple Conditions
# Creating a 2D array for demonstration
arr = np.array([[5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]])

# Elements greater than 20 AND less than 40
mask = (arr > 20) & (arr < 40)
print("Elements between 20 and 40:", arr[mask])
# This produces: [25 30 35]

# Elements less than 15 OR greater than 40
mask = (arr < 15) | (arr > 40)
print("Elements less than 15 or greater than 40:", arr[mask])
# This produces: [ 5 10 45]

Important: When combining conditions, always use parentheses around each individual condition to ensure proper precedence.

Using np.where(): Finding Positions

The np.where() function gives us even more capabilities. In its simplest form, it returns the indices where a condition is True:

Demo

# Create an array with a sequence
data = np.arange(0, 20, 3)  # [0, 3, 6, 9, 12, 15, 18]
print("Original array:", data)

# Find indices where elements are even
indices = np.where(data % 2 == 0)
print("Indices of even elements:", indices[0])
# This produces: [0, 2, 4, 6]

# Use these indices to get the actual values
even_elements = data[indices]
print("Even elements:", even_elements)
# This produces: [ 0,  6, 12, 18]

Using np.where(): Conditional Assignment

The real power of np.where() comes from its three-argument form:

np.where(condition, x, y)

This works like a vectorized if-else statement:

  • Where the condition is True, take values from x

  • Where the condition is False, take values from y

Additional info

Additional notes: `np.where(cond, x, y)`

Demo

# Original array: [0, 3, 6, 9, 12, 15, 18]

# Replace odd numbers with zeros
result = np.where(data % 2 == 0, data, 0)
print("Even numbers preserved, odd numbers replaced with 0:", result)
# This produces: [ 0,  0,  6,  0, 12,  0, 18]

# Another example: create an array that shows whether each element is even or odd
labels = np.where(data % 2 == 0, "even", "odd")
print("Labels for each element:", labels)
# This produces: ['even' 'odd' 'even' 'odd' 'even' 'odd' 'even']

This is much more concise and efficient than using loops or other conditional constructs.

np.isin() Function

  • np.isin() useful when we have a specific set of values we’re interested in.

Additional info

Additional notes: The `np.isin()` Function

np.isin():

The np.isin() function checks whether elements in one array are present in another array. It creates a boolean mask that we can use for filtering:

# Original array: [0, 3, 6, 9, 12, 15, 18]

# Check which elements are in a set of values
test_values = [0, 6, 15]
mask = np.isin(data, test_values)
print("Elements that are in test_values:", data[mask])
# This produces: [ 0,  6, 15]
Additional notes: Practical Applications

Practical Applications:

These techniques are foundational for data analysis tasks:

  1. Data Cleaning: Filter out missing or invalid values

    clean_data = data[~np.isnan(data)]  # Remove NaN values
    
  2. Feature Selection: Extract data points that meet specific criteria

    high_importance = data[data > threshold]
    
  3. Conditional Transformations: Apply different operations to different elements

    normalized = np.where(data > 0, data/data.max(), data/abs(data.min()))
    
Additional notes: Exercise 2 & 3

Exercise

Exercise 1: NumPy Boolean Masking and Advanced Filtering:

Create a NumPy array of 20 random integers between 0 and 100. Then:

np.random.seed(42)  # for reproducibility
numbers = np.random.randint(0, 101, 20)
  • Create a boolean mask to identify all numbers divisible by 7

  • Use the mask to extract these numbers

  • Count how many numbers are divisible by 7

Solution

import numpy as np

# Create an array of 20 random integers between 0 and 100
np.random.seed(42)  # for reproducibility
numbers = np.random.randint(0, 101, 20)
print("Original array:", numbers)

# Create a boolean mask for numbers divisible by 7
mask = numbers % 7 == 0
print("Boolean mask:", mask)

# Extract numbers divisible by 7
divisible_by_7 = numbers[mask]
print("Numbers divisible by 7:", divisible_by_7)

# Count how many numbers are divisible by 7
count = np.sum(mask)  # True values are treated as 1, False as 0
print(f"Count of numbers divisible by 7: {count}")

Output

Original array: [51 92 14 71 60 20 82 86 74 74 87 99 23  2 21 52  1 87 29 37]
Boolean mask: [False False  True False False False False False False False False False
 False False  True False False False False False]
Numbers divisible by 7: [14 21]
Count of numbers divisible by 7: 2

Exercise 2 - np.where() for Conditional Assignment:

Create a 4x4 matrix of random integers between 1 and 20. Then:

np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))
  • Use np.where() to replace all odd numbers with -1 while keeping even numbers unchanged

Exercise 2 - Solution:

# Create a 4x4 matrix of random integers between 1 and 20
np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))
print("Original matrix:")
print(matrix)

# Replace odd numbers with -1, keep even numbers
odd_replaced = np.where(matrix % 2 == 0, matrix, -1)
print("\nMatrix with odd numbers replaced by -1:")
print(odd_replaced)

Output

Original matrix:
[[ 7 20 15 11]
 [ 8  7 19 11]
 [11  4  8  3]
 [ 2 12  6  2]]

Matrix with odd numbers replaced by -1:
[[-1 20 -1 -1]
 [ 8 -1 -1 -1]
 [-1  4  8 -1]
 [ 2 12  6  2]]

Exercise 3 - DNA Sequence Analysis:

You are given a DNA sequence as a NumPy array of characters (A, T, G, C).

  • Create a random DNA sequence of length 50 using np.random.choice(['A', 'T', 'G', 'C'], 50)

  • Use boolean masking to Count the number of each nucleotide (A, T, G, C)

Exercise 3 - Solution:

import numpy as np

# Create a random DNA sequence
np.random.seed(42)  # for reproducibility
dna_sequence = np.random.choice(['A', 'T', 'G', 'C'], 50)
print("DNA sequence:", ''.join(dna_sequence))

# Count the number of each nucleotide
a_count = np.sum(dna_sequence == 'A')
t_count = np.sum(dna_sequence == 'T')
g_count = np.sum(dna_sequence == 'G')
c_count = np.sum(dna_sequence == 'C')

print(f"A: {a_count}, T: {t_count}, G: {g_count}, C: {c_count}")

Output

DNA sequence: GCAGGCAAGTGGGGCACCCGTATCCTTTCCAACTTACAAGGGTCCCCGTT
A: 10, T: 11, G: 13, C: 16

Key Takeaways

Keypoints

  • Boolean masking and np.where() operations are highly optimized in NumPy. They:

    • Avoid explicit loops in Python

    • Execute at C-speed under the hood

    • Allow vectorized operations on large datasets

  • For large datasets, these techniques are drastically faster than traditional iteration.

  • Boolean masking provides an intuitive way to filter arrays based on conditions

  • np.where() in its single-argument form finds indices where conditions are true

  • np.where(condition, x, y) acts as a vectorized if-else statement

  • np.isin() lets us filter based on membership in a set of values