Array Indexing and Slicing
Objectives
Define and distinguish between indexing and slicing operations in NumPy arrays
Demonstrate proper syntax for accessing individual elements in 1D and 2D arrays using indexing
Extract ranges of elements using slicing techniques, including with negative indices and step parameters
Recognize and avoid common pitfalls when working with arrays, such as off-by-one errors and unintended modifications to original data
Instructor note
Teaching : 10 min
Demo: 5 min
Introduction
What is Indexing and Slicing?
Indexing is the process of accessing specific individual elements within a data structure
Uses square brackets with a single index value:
array[0]
Most programming languages use zero-based indexing (first element is at position 0)
Slicing is the process of extracting a subset or range of elements
Uses square brackets with a range specification:
array[start:stop:step]
Creates a view of the original data (changes to the slice affect the original array)
Why Indexing Matters in Bioinformatics:
Bioinformatics deals with large, complex biological datasets:
DNA/RNA sequences (can be millions of nucleotides long)
Protein sequences
Gene expression matrices (thousands of genes × dozens/hundreds of samples)
Phylogenetic trees
Molecular structures
Efficient data access is crucial for:
Sequence alignment and comparison
Identifying motifs or patterns
Analyzing specific regions of interest (e.g., genes, domains, binding sites)
Processing large-scale genomic or proteomic data
Statistical analysis across experimental conditions
NumPy Arrays in Bioinformatics
Common bioinformatics applications:
Storing sequence data as numeric arrays
Representing position weight matrices
Managing alignment scores
Handling gene expression matrices
1D Array Operations
Demo
1D Array Indexing
# Example: String sequence converted to numerical representation
# A=0, C=1, G=2, T=3
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2]) # "ACGTAACG"
# Single element access through indexing
print("Original array", dna_seq)
print("First element (0th index)", dna_seq[0]) # First nucleotide (0 = A)
print("Fourth element", dna_seq[3]) # Fourth nucleotide (3 = T)
print("Last element",dna_seq[-1]) # Last nucleotide (2 = G) using negative indexing
Output
Original array [0 1 2 3 0 0 1 2]
First element (0th index) 0
Fourth element 3
Last element 2
1D Array Slicing
print("Original array", dna_seq)
# Slicing * extracting subsequences
print("From second to fourth", dna_seq[1:4]) # From second to fourth nucleotide: array([1, 2, 3]) = "CGT"
print("First three nucleotides", dna_seq[:3]) # First three nucleotides: array([0, 1, 2]) = "ACG"
print("From sixth nucleotide to the end", dna_seq[5:]) # From sixth nucleotide to the end: array([0, 1, 2]) = "ACG"
# Slicing with negative indices
print("Last three nucleotides", dna_seq[-3:]) # Last three nucleotides: array([0, 1, 2]) = "ACG"
# Slicing with step
print("Every second nucleotide", dna_seq[::2]) # Every second nucleotide: array([0, 2, 0, 1]) = "AGAC"
# Reverse array
print("Every second nucleotide", dna_seq[::-1])
Solution
Original array [0 1 2 3 0 0 1 2]
From second to fourth [1 2 3]
First three nucleotides [0 1 2]
From sixth nucleotide to the end [0 1 2]
Last three nucleotides [0 1 2]
Every second nucleotide [0 2 0 1]
Every second nucleotide [2 1 0 0 3 2 1 0]
Real-world significance in bioinformatics:
Indexing:
Accessing specific nucleotide positions of interest
Retrieving expression values for particular genes
Referencing elements in position-specific scoring matrices
Slicing:
Extracting specific regions like promoters, exons, or binding sites
Identifying sequence motifs (e.g., restriction sites, protein domains)
Analyzing k-mers (subsequences of length k)
Creating sliding windows along DNA/protein sequences
More info
Additional notes: 2D Array Operations
2D Array Operations:
# Example: Gene expression matrix
# Rows = genes, Columns = experimental conditions
gene_expr = np.array([
[12.5, 10.2, 33.4, 7.8], # Gene 1 expression across 4 conditions
[45.1, 43.8, 29.2, 22.1], # Gene 2 expression
[8.7, 9.2, 12.3, 10.5], # Gene 3 expression
[67.2, 70.3, 68.7, 71.9] # Gene 4 expression
])
2D Array Indexing:
# Single element access * specific element at row, column
print(gene_expr[1, 2]) # Expression of Gene 2 in condition 3: 29.2
# Row indexing * accessing specific row
print(gene_expr[0]) # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])
2D Array Slicing:
# Row slicing * expression profile of one gene across all conditions
print(gene_expr[0, :]) # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])
# Column slicing * expression of all genes in a specific condition
print(gene_expr[:, 1]) # All genes in condition 2: array([10.2, 43.8, 9.2, 70.3])
# Sub-matrix slicing * subset of genes in subset of conditions
print(gene_expr[0:2, 2:4])
# First 2 genes in conditions 3 and 4:
# array([[33.4, 7.8],
# [29.2, 22.1]])
# Strided slicing * every other gene, first two conditions
print(gene_expr[::2, :2])
# Genes 1 & 3, conditions 1 & 2:
# array([[12.5, 10.2],
# [8.7, 9.2]])
Real-world significance in bioinformatics
Indexing:
Retrieving expression value for a specific gene in a specific condition
Accessing specific positions in sequence alignments
Finding interaction pairs in protein-protein interaction matrices
Slicing:
Comparing gene expression profiles across different tissues or time points
Analyzing subsets of genes after clustering
Extracting data for specific experiments or replicates
Processing sections of alignment score matrices
Analyzing specific regions in protein contact maps
Extracting protein domains from structure coordinate arrays
Additional exercises
Additional notes: Exercises
Exercise
Exercise 1: DNA Sequence Analysis (2-3 minutes)
Given a DNA sequence represented as an array of numerical values (A=0, C=1, G=2, T=3):
import numpy as np
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 2, 1, 0, 0, 2, 3]) # "ACGTAACGTTGCAGT"
Tasks:
Extract the first 5 nucleotides
Extract the last 4 nucleotides
Extract every third nucleotide starting from the first position
Extract the subsequence from position 6 to position 10 (inclusive)
# 1. First 5 nucleotides
print("First 5 nucleotides:", dna_seq[:5])
# 2. Last 4 nucleotides
print("Last 4 nucleotides:", dna_seq[-4:])
# 3. Every third nucleotide
print("Every third nucleotide:", dna_seq[::3])
# 4. Subsequence from position 6 to 10
print("Subsequence pos 6-10:", dna_seq[6:11])
# Note: Upper bound is exclusive in slicing, so we use 11 to include position 10
Output
First 5 nucleotides: [0 1 2 3 0]
Last 4 nucleotides: [0 0 2 3]
Every third nucleotide: [0 3 1 3 0 3]
Subsequence pos 6-10: [1 2 3 3 2]
Exercise 2: Gene Expression Analysis (2-3 minutes)
Given a gene expression matrix where rows represent genes and columns represent conditions:
import numpy as np
gene_expr = np.array([
[15.2, 21.5, 18.9, 11.8, 25.3], # Gene 1
[42.3, 38.1, 29.6, 33.2, 19.7], # Gene 2
[8.4, 7.5, 9.2, 8.1, 10.5], # Gene 3
[31.6, 29.8, 27.5, 34.9, 36.2], # Gene 4
[17.3, 19.8, 22.5, 21.3, 18.2] # Gene 5
])
Tasks:
Extract the expression values for Gene 3
Extract the expression values for all genes under fifth column
Extract a sub-matrix containing Genes 2-4 under columns 2-3
Find the expression value for Gene 5 under columns 2
Exercise 2 -Solution:
# 1. Expression values for Gene 3
print("Gene 3 expression:", gene_expr[2])
# Alternative: gene_expr[2, :]
# 2. Expression values for all genes under column 5
print("Condition 4 expression:", gene_expr[:, 4])
# 3. Sub-matrix of Genes 2-4 under columns 2-3
print("Sub-matrix (Genes 2-4, columns 2-3):")
print(gene_expr[1:4, 1:3])
# array([[38.1, 29.6],
# [7.5, 9.2],
# [29.8, 27.5]])
# 4. Expression value for Gene 5 under columns 2
print("Gene 5, columns 2:", gene_expr[4, 1])
Output
Gene 3 expression: [ 8.4 7.5 9.2 8.1 10.5]
Condition 4 expression: [25.3 19.7 10.5 36.2 18.2]
Sub-matrix (Genes 2-4, columns 2-3):
[[38.1 29.6]
[ 7.5 9.2]
[29.8 27.5]]
Gene 5, columns 2: 19.8
Exercise 3: Multi-sequence Alignment Analysis (2-3 minutes)
Consider a simplified alignment scoring matrix where each row represents a match (1) or mismatch (0) and each column represents a position in the alignment:
import numpy as np
alignment_scores = np.array([
[1, 0, 1, 1, 0, 1, 0, 0, 1, 1], # Sequence 1
[1, 1, 0, 1, 0, 0, 1, 1, 0, 1], # Sequence 2
[0, 1, 1, 1, 1, 0, 0, 1, 0, 0], # Sequence 3
[1, 0, 0, 1, 1, 1, 0, 0, 1, 1] # Sequence 4
]) # 1 = match, 0 = mismatch
Tasks:
Find positions where all sequences match (all rows having 1s in a column - use
np.all
with a mask)Extract scores for positions 3-7 for all sequences
Find the matching pattern (positions with value 1) for Sequence 3
Extract a sub-alignment of the first two sequences for the last five positions
Exercise 3 -Solution:
# 1. Positions where all sequences match
all_match = np.all(alignment_scores == 1, axis=0)
print("Positions where all sequences match:", np.where(all_match)[0])
# 2. Scores for positions 3-7 for all sequences
print("Positions 3-7 scores:")
print(alignment_scores[:, 3:8])
# 3. Matching pattern for Sequence 3
seq3_matches = alignment_scores[2] == 1
print("Sequence 3 match positions:", np.where(seq3_matches)[0])
# 4. Sub-alignment of first two sequences for last five positions
print("Sub-alignment (Seq 1-2, last 5 positions):")
print(alignment_scores[0:2, 5:])
Output
Positions where all sequences match: [3]
Positions 3-7 scores:
[[1 0 1 0 0]
[1 0 0 1 1]
[1 1 0 0 1]
[1 1 1 0 0]]
Sequence 3 match positions: [1 2 3 4 7]
Sub-alignment (Seq 1-2, last 5 positions):
[[1 0 0 1 1]
[0 1 1 0 1]]
Key Takeaways
Keypoints
Efficient indexing and slicing are crucial for bioinformatics workflows
Key takeaways:
Indexing for accessing individual elements
Slicing for extracting regions of interest
Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.)
Combine with boolean operations for filtering
Remember zero-based indexing
Common pitfalls:
Off-by-one errors (especially when converting between biology’s 1-based and programming’s 0-based systems)
Overlooking the exclusive upper bound in slicing (end index is not included)
Forgetting that modifying slices can modify the original array (use .copy() when needed)
Confusing row-major vs. column-major operations