NumPy Boolean Masking and Filtering
Objectives
Create boolean masks for array filtering using comparison operators
Apply boolean masks to select specific elements from arrays
Combine multiple conditions using logical operators (
&
,|
,~
)Use
np.where()
to find indices where conditions are metApply
np.where()
for conditional value assignmentImplement
np.isin()
to check array membershipApply these techniques to solve common data analysis problems
Instructor note
Teaching : 10 min
Demo: 5 min
Introduction to Advanced Indexing
When working with data, we often need to focus on specific elements that meet certain criteria. NumPy provides elegant and efficient ways to accomplish this through:
Boolean masking
The
np.where()
functionThe
np.isin()
function
Let’s explore each technique in detail.
Boolean Masking: The Concept
Boolean masking is a fundamental technique in NumPy that allows us to filter arrays based on conditions. The process happens in two steps:
Step 1: Create a Boolean Mask:
We apply a condition to an array
This produces a new array of the same shape filled with
True
andFalse
valuesElements that satisfy our condition are marked as
True
Elements that don’t satisfy our condition are marked as
False
Step 2: Apply the Mask:
We use this boolean array to index into our original array
Only elements corresponding to
True
values are selected
Let’s see this in action:
Demo
import numpy as np
# Create a sample array
data = np.array([1, 4, 2, 5, 3])
print("Original array:", data)
# Create a boolean mask for elements greater than 3
mask = data > 3
print("Boolean mask (data > 3):", mask)
# This produces: [False, True, False, True, False]
# Apply the mask to select elements
selected_data = data[mask]
print("Selected elements:", selected_data)
# This produces: [4, 5]
## Elegant approach - mask array has the exact same shape as data array
## Each position containing information about whether that element meets our criteria
Output
Original array: [1 4 2 5 3]
Boolean mask (data > 3): [False True False True False]
Selected elements: [4 5]
Combining Multiple Conditions
We can combine multiple conditions using logical operators:
&
for logical AND|
for logical OR~
for logical NOT
Additional info
Additional notes: Combining Multiple Conditions
# Creating a 2D array for demonstration
arr = np.array([[5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
# Elements greater than 20 AND less than 40
mask = (arr > 20) & (arr < 40)
print("Elements between 20 and 40:", arr[mask])
# This produces: [25 30 35]
# Elements less than 15 OR greater than 40
mask = (arr < 15) | (arr > 40)
print("Elements less than 15 or greater than 40:", arr[mask])
# This produces: [ 5 10 45]
Important: When combining conditions, always use parentheses around each individual condition to ensure proper precedence.
Using np.where()
: Finding Positions
The np.where()
function gives us even more capabilities. In its simplest form, it returns the indices where a condition is True:
Demo
# Create an array with a sequence
data = np.arange(0, 20, 3) # [0, 3, 6, 9, 12, 15, 18]
print("Original array:", data)
# Find indices where elements are even
indices = np.where(data % 2 == 0)
print("Indices of even elements:", indices[0])
# This produces: [0, 2, 4, 6]
# Use these indices to get the actual values
even_elements = data[indices]
print("Even elements:", even_elements)
# This produces: [ 0, 6, 12, 18]
Output
Original array: [ 0 3 6 9 12 15 18]
Indices of even elements: [0 2 4 6]
Even elements: [ 0 6 12 18]
The result of
np.where()
is a tuple of arrays, one for each dimension of the input array. Since we’re working with a 1D array here, we access the first (and only) element of this tuple withindices[0]
.
Using np.where()
: Conditional Assignment
The real power of np.where()
comes from its three-argument form:
np.where(condition, x, y)
This works like a vectorized if-else statement:
Where the condition is
True
, take values fromx
Where the condition is
False
, take values fromy
Additional info
Additional notes: `np.where(cond, x, y)`
Demo
# Original array: [0, 3, 6, 9, 12, 15, 18]
# Replace odd numbers with zeros
result = np.where(data % 2 == 0, data, 0)
print("Even numbers preserved, odd numbers replaced with 0:", result)
# This produces: [ 0, 0, 6, 0, 12, 0, 18]
# Another example: create an array that shows whether each element is even or odd
labels = np.where(data % 2 == 0, "even", "odd")
print("Labels for each element:", labels)
# This produces: ['even' 'odd' 'even' 'odd' 'even' 'odd' 'even']
This is much more concise and efficient than using loops or other conditional constructs.
np.isin()
Function
np.isin()
useful when we have a specific set of values we’re interested in.
Additional info
Additional notes: The `np.isin()` Function
np.isin()
:
The np.isin()
function checks whether elements in one array are present in another array. It creates a boolean mask that we can use for filtering:
# Original array: [0, 3, 6, 9, 12, 15, 18]
# Check which elements are in a set of values
test_values = [0, 6, 15]
mask = np.isin(data, test_values)
print("Elements that are in test_values:", data[mask])
# This produces: [ 0, 6, 15]
Additional notes: Practical Applications
Practical Applications:
These techniques are foundational for data analysis tasks:
Data Cleaning: Filter out missing or invalid values
clean_data = data[~np.isnan(data)] # Remove NaN values
Feature Selection: Extract data points that meet specific criteria
high_importance = data[data > threshold]
Conditional Transformations: Apply different operations to different elements
normalized = np.where(data > 0, data/data.max(), data/abs(data.min()))
Additional notes: Exercise 2 & 3
Exercise
Exercise 1: NumPy Boolean Masking and Advanced Filtering:
Create a NumPy array of 20 random integers between 0 and 100. Then:
np.random.seed(42) # for reproducibility
numbers = np.random.randint(0, 101, 20)
Create a boolean mask to identify all numbers divisible by 7
Use the mask to extract these numbers
Count how many numbers are divisible by 7
Solution
import numpy as np
# Create an array of 20 random integers between 0 and 100
np.random.seed(42) # for reproducibility
numbers = np.random.randint(0, 101, 20)
print("Original array:", numbers)
# Create a boolean mask for numbers divisible by 7
mask = numbers % 7 == 0
print("Boolean mask:", mask)
# Extract numbers divisible by 7
divisible_by_7 = numbers[mask]
print("Numbers divisible by 7:", divisible_by_7)
# Count how many numbers are divisible by 7
count = np.sum(mask) # True values are treated as 1, False as 0
print(f"Count of numbers divisible by 7: {count}")
Output
Original array: [51 92 14 71 60 20 82 86 74 74 87 99 23 2 21 52 1 87 29 37]
Boolean mask: [False False True False False False False False False False False False
False False True False False False False False]
Numbers divisible by 7: [14 21]
Count of numbers divisible by 7: 2
Exercise 2 - np.where() for Conditional Assignment:
Create a 4x4 matrix of random integers between 1 and 20. Then:
np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))
Use np.where() to replace all odd numbers with -1 while keeping even numbers unchanged
Exercise 2 - Solution:
# Create a 4x4 matrix of random integers between 1 and 20
np.random.seed(42)
matrix = np.random.randint(1, 21, (4, 4))
print("Original matrix:")
print(matrix)
# Replace odd numbers with -1, keep even numbers
odd_replaced = np.where(matrix % 2 == 0, matrix, -1)
print("\nMatrix with odd numbers replaced by -1:")
print(odd_replaced)
Output
Original matrix:
[[ 7 20 15 11]
[ 8 7 19 11]
[11 4 8 3]
[ 2 12 6 2]]
Matrix with odd numbers replaced by -1:
[[-1 20 -1 -1]
[ 8 -1 -1 -1]
[-1 4 8 -1]
[ 2 12 6 2]]
Exercise 3 - DNA Sequence Analysis:
You are given a DNA sequence as a NumPy array of characters (A, T, G, C).
Create a random DNA sequence of length 50 using
np.random.choice(['A', 'T', 'G', 'C'], 50)
Use boolean masking to Count the number of each nucleotide (A, T, G, C)
Exercise 3 - Solution:
import numpy as np
# Create a random DNA sequence
np.random.seed(42) # for reproducibility
dna_sequence = np.random.choice(['A', 'T', 'G', 'C'], 50)
print("DNA sequence:", ''.join(dna_sequence))
# Count the number of each nucleotide
a_count = np.sum(dna_sequence == 'A')
t_count = np.sum(dna_sequence == 'T')
g_count = np.sum(dna_sequence == 'G')
c_count = np.sum(dna_sequence == 'C')
print(f"A: {a_count}, T: {t_count}, G: {g_count}, C: {c_count}")
Output
DNA sequence: GCAGGCAAGTGGGGCACCCGTATCCTTTCCAACTTACAAGGGTCCCCGTT
A: 10, T: 11, G: 13, C: 16
Key Takeaways
Keypoints
Boolean masking and
np.where()
operations are highly optimized in NumPy. They:Avoid explicit loops in Python
Execute at C-speed under the hood
Allow vectorized operations on large datasets
For large datasets, these techniques are drastically faster than traditional iteration.
Boolean masking provides an intuitive way to filter arrays based on conditions
np.where()
in its single-argument form finds indices where conditions are truenp.where(condition, x, y)
acts as a vectorized if-else statementnp.isin()
lets us filter based on membership in a set of values