Array Indexing and Slicing
Objectives
- Define and distinguish between indexing and slicing operations in NumPy arrays 
- Demonstrate proper syntax for accessing individual elements in 1D and 2D arrays using indexing 
- Extract ranges of elements using slicing techniques, including with negative indices and step parameters 
- Recognize and avoid common pitfalls when working with arrays, such as off-by-one errors and unintended modifications to original data 
Instructor note
- Teaching : 10 min 
- Demo: 5 min 
Introduction
What is Indexing and Slicing?
- Indexing is the process of accessing specific individual elements within a data structure - Uses square brackets with a single index value: - array[0]
- Most programming languages use zero-based indexing (first element is at position 0) 
 
- Slicing is the process of extracting a subset or range of elements - Uses square brackets with a range specification: - array[start:stop:step]
- Creates a view of the original data (changes to the slice affect the original array) 
 
Why Indexing Matters in Bioinformatics:
- Bioinformatics deals with large, complex biological datasets: - DNA/RNA sequences (can be millions of nucleotides long) 
- Protein sequences 
- Gene expression matrices (thousands of genes × dozens/hundreds of samples) 
- Molecular structures 
 
- Efficient data access is crucial for: - Sequence alignment and comparison 
- Identifying motifs or patterns 
- Analyzing specific regions of interest (e.g., genes, domains, binding sites) 
- Processing large-scale genomic or proteomic data 
- Statistical analysis across experimental conditions 
 
NumPy Arrays in Bioinformatics
- Common bioinformatics applications: - Storing sequence data as numeric arrays 
- Representing position weight matrices 
- Managing alignment scores 
- Handling gene expression matrices 
 
1D Array Operations
Demo
1D Array Indexing
# Example: String sequence converted to numerical representation
# A=0, C=1, G=2, T=3
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2])  # "ACGTAACG"
# Single element access through indexing
print("Original array", dna_seq)
print("First element (0th index)", dna_seq[0])    # First nucleotide (0 = A)
print("Fourth element", dna_seq[3])    # Fourth nucleotide (3 = T)
print("Last element",dna_seq[-1])   # Last nucleotide (2 = G) using negative indexing
Output
Original array [0 1 2 3 0 0 1 2]
First element (0th index) 0
Fourth element 3
Last element 2
1D Array Slicing
print("Original array", dna_seq)
# Slicing * extracting subsequences
print("From second to fourth", dna_seq[1:4])  # From second to fourth nucleotide: array([1, 2, 3]) = "CGT"
print("First three nucleotides", dna_seq[:3])   # First three nucleotides: array([0, 1, 2]) = "ACG"
print("From sixth nucleotide to the end", dna_seq[5:])   # From sixth nucleotide to the end: array([0, 1, 2]) = "ACG"
# Slicing with negative indices
print("Last three nucleotides", dna_seq[-3:])  # Last three nucleotides: array([0, 1, 2]) = "ACG"
# Slicing with step
print("Every second nucleotide", dna_seq[::2])  # Every second nucleotide: array([0, 2, 0, 1]) = "AGAC"
# Reverse array
print("Every second nucleotide", dna_seq[::-1]) 
Solution
Original array [0 1 2 3 0 0 1 2]
From second to fourth [1 2 3]
First three nucleotides [0 1 2]
From sixth nucleotide to the end [0 1 2]
Last three nucleotides [0 1 2]
Every second nucleotide [0 2 0 1]
Every second nucleotide [2 1 0 0 3 2 1 0]
Real-world significance in bioinformatics:
- Indexing: - Accessing specific nucleotide positions of interest 
- Retrieving expression values for particular genes 
- Referencing elements in position-specific scoring matrices 
 
- Slicing: - Extracting specific regions like promoters, exons, or binding sites 
- Identifying sequence motifs (e.g., restriction sites, protein domains) 
- Analyzing k-mers (subsequences of length k) 
- Creating sliding windows along DNA/protein sequences 
 
More info
Additional notes: 2D Array Operations
2D Array Operations:
# Example: Gene expression matrix
# Rows = genes, Columns = experimental conditions
gene_expr = np.array([
    [12.5, 10.2, 33.4, 7.8],  # Gene 1 expression across 4 conditions
    [45.1, 43.8, 29.2, 22.1], # Gene 2 expression
    [8.7,  9.2,  12.3, 10.5], # Gene 3 expression
    [67.2, 70.3, 68.7, 71.9]  # Gene 4 expression
])
2D Array Indexing:
# Single element access * specific element at row, column
print(gene_expr[1, 2])    # Expression of Gene 2 in condition 3: 29.2
# Row indexing * accessing specific row
print(gene_expr[0])       # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])
2D Array Slicing:
# Row slicing * expression profile of one gene across all conditions
print(gene_expr[0, :])    # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])
# Column slicing * expression of all genes in a specific condition
print(gene_expr[:, 1])    # All genes in condition 2: array([10.2, 43.8, 9.2, 70.3])
# Sub-matrix slicing * subset of genes in subset of conditions
print(gene_expr[0:2, 2:4])
# First 2 genes in conditions 3 and 4:
# array([[33.4,  7.8],
#        [29.2, 22.1]])
# Strided slicing * every other gene, first two conditions
print(gene_expr[::2, :2])
# Genes 1 & 3, conditions 1 & 2:
# array([[12.5, 10.2],
#        [8.7,  9.2]])
Real-world significance in bioinformatics
- Indexing: - Retrieving expression value for a specific gene in a specific condition 
- Accessing specific positions in sequence alignments 
- Finding interaction pairs in protein-protein interaction matrices 
 
- Slicing: - Comparing gene expression profiles across different tissues or time points 
- Analyzing subsets of genes after clustering 
- Extracting data for specific experiments or replicates 
- Processing sections of alignment score matrices 
- Analyzing specific regions in protein contact maps 
- Extracting protein domains from structure coordinate arrays 
 
Additional notes: Exercises
Exercise
Exercise 1: DNA Sequence Analysis (2-3 minutes)
Given a DNA sequence represented as an array of numerical values (A=0, C=1, G=2, T=3):
import numpy as np
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 2, 1, 0, 0, 2, 3])  # "ACGTAACGTTGCAGT"
Tasks:
- Extract the first 5 nucleotides 
- Extract the last 4 nucleotides 
- Extract every third nucleotide starting from the first position 
- Extract the subsequence from position 6 to position 10 (inclusive) 
# 1. First 5 nucleotides
print("First 5 nucleotides:", dna_seq[:5])
# 2. Last 4 nucleotides
print("Last 4 nucleotides:", dna_seq[-4:])
# 3. Every third nucleotide
print("Every third nucleotide:", dna_seq[::3])
# 4. Subsequence from position 6 to 10
print("Subsequence pos 6-10:", dna_seq[6:11])
# Note: Upper bound is exclusive in slicing, so we use 11 to include position 10
Output
First 5 nucleotides: [0 1 2 3 0]
Last 4 nucleotides: [0 0 2 3]
Every third nucleotide: [0 3 1 3 0 3]
Subsequence pos 6-10: [1 2 3 3 2]
Exercise 2: Gene Expression Analysis (2-3 minutes)
Given a gene expression matrix where rows represent genes and columns represent conditions:
import numpy as np
gene_expr = np.array([
    [15.2, 21.5, 18.9, 11.8, 25.3],  # Gene 1
    [42.3, 38.1, 29.6, 33.2, 19.7],  # Gene 2
    [8.4,  7.5,  9.2,  8.1,  10.5],  # Gene 3
    [31.6, 29.8, 27.5, 34.9, 36.2],  # Gene 4
    [17.3, 19.8, 22.5, 21.3, 18.2]   # Gene 5
])
Tasks:
- Extract the expression values for Gene 3 
- Extract the expression values for all genes under fifth column 
- Extract a sub-matrix containing Genes 2-4 under columns 2-3 
- Find the expression value for Gene 5 under columns 2 
Exercise 2 -Solution:
# 1. Expression values for Gene 3
print("Gene 3 expression:", gene_expr[2])
# Alternative: gene_expr[2, :]
# 2. Expression values for all genes under column 5
print("Condition 4 expression:", gene_expr[:, 4]) 
# 3. Sub-matrix of Genes 2-4 under columns 2-3
print("Sub-matrix (Genes 2-4, columns 2-3):")
print(gene_expr[1:4, 1:3])
# array([[38.1, 29.6],
#        [7.5,  9.2],
#        [29.8, 27.5]])
# 4. Expression value for Gene 5 under columns 2
print("Gene 5, columns 2:", gene_expr[4, 1]) 
Output
Gene 3 expression: [ 8.4  7.5  9.2  8.1 10.5]
Condition 4 expression: [25.3 19.7 10.5 36.2 18.2]
Sub-matrix (Genes 2-4, columns 2-3):
[[38.1 29.6]
 [ 7.5  9.2]
 [29.8 27.5]]
Gene 5, columns 2: 19.8
Exercise 3: Multi-sequence Alignment Analysis (2-3 minutes)
Consider a simplified alignment scoring matrix where each row represents a match (1) or mismatch (0) and each column represents a position in the alignment:
import numpy as np
alignment_scores = np.array([
    [1, 0, 1, 1, 0, 1, 0, 0, 1, 1],  # Sequence 1
    [1, 1, 0, 1, 0, 0, 1, 1, 0, 1],  # Sequence 2
    [0, 1, 1, 1, 1, 0, 0, 1, 0, 0],  # Sequence 3
    [1, 0, 0, 1, 1, 1, 0, 0, 1, 1]   # Sequence 4
])  # 1 = match, 0 = mismatch
Tasks:
- Find positions where all sequences match (all rows having 1s in a column - use - np.allwith a mask)
- Extract scores for positions 3-7 for all sequences 
- Find the matching pattern (positions with value 1) for Sequence 3 
- Extract a sub-alignment of the first two sequences for the last five positions 
Exercise 3 -Solution:
# 1. Positions where all sequences match
all_match = np.all(alignment_scores == 1, axis=0)
print("Positions where all sequences match:", np.where(all_match)[0]) 
# 2. Scores for positions 3-7 for all sequences
print("Positions 3-7 scores:")
print(alignment_scores[:, 3:8])
# 3. Matching pattern for Sequence 3
seq3_matches = alignment_scores[2] == 1
print("Sequence 3 match positions:", np.where(seq3_matches)[0])
# 4. Sub-alignment of first two sequences for last five positions
print("Sub-alignment (Seq 1-2, last 5 positions):")
print(alignment_scores[0:2, 5:])
Output
Positions where all sequences match: [3]
Positions 3-7 scores:
[[1 0 1 0 0]
 [1 0 0 1 1]
 [1 1 0 0 1]
 [1 1 1 0 0]]
Sequence 3 match positions: [1 2 3 4 7]
Sub-alignment (Seq 1-2, last 5 positions):
[[1 0 0 1 1]
 [0 1 1 0 1]]
Key Takeaways
Keypoints
- Efficient indexing and slicing are crucial for bioinformatics workflows 
- Key takeaways: - Indexing for accessing individual elements 
- Slicing for extracting regions of interest 
- Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.) 
- Combine with boolean operations for filtering 
- Remember zero-based indexing 
 
- Common pitfalls: - Off-by-one errors (especially when converting between biology’s 1-based and programming’s 0-based systems) 
- Overlooking the exclusive upper bound in slicing (end index is not included) 
- Forgetting that modifying slices can modify the original array (use .copy() when needed) 
- Confusing row-major vs. column-major operations