Array Indexing and Slicing

Objectives

Define and distinguish between indexing and slicing operations in NumPy arrays
Demonstrate proper syntax for accessing individual elements in 1D and 2D arrays using indexing
Extract ranges of elements using slicing techniques, including with negative indices and step parameters
Recognize and avoid common pitfalls when working with arrays, such as off-by-one errors and unintended modifications to original data

Instructor note

Teaching : 10 min
Demo: 5 min

Introduction

What is Indexing and Slicing?

Indexing is the process of accessing specific individual elements within a data structure
- Uses square brackets with a single index value: array[0]
- Most programming languages use zero-based indexing (first element is at position 0)
Slicing is the process of extracting a subset or range of elements
- Uses square brackets with a range specification: array[start:stop:step]
- Creates a view of the original data (changes to the slice affect the original array)

Why Indexing Matters in Bioinformatics:

Bioinformatics deals with large, complex biological datasets:
- DNA/RNA sequences (can be millions of nucleotides long)
- Protein sequences
- Gene expression matrices (thousands of genes × dozens/hundreds of samples)
- Molecular structures
Efficient data access is crucial for:
- Sequence alignment and comparison
- Identifying motifs or patterns
- Analyzing specific regions of interest (e.g., genes, domains, binding sites)
- Processing large-scale genomic or proteomic data
- Statistical analysis across experimental conditions

NumPy Arrays in Bioinformatics

Common bioinformatics applications:
- Storing sequence data as numeric arrays
- Representing position weight matrices
- Managing alignment scores
- Handling gene expression matrices

1D Array Operations

Demo

1D Array Indexing

# Example: String sequence converted to numerical representation
# A=0, C=1, G=2, T=3
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2])  # "ACGTAACG"

# Single element access through indexing
print("Original array", dna_seq)
print("First element (0th index)", dna_seq[0])    # First nucleotide (0 = A)
print("Fourth element", dna_seq[3])    # Fourth nucleotide (3 = T)
print("Last element",dna_seq[-1])   # Last nucleotide (2 = G) using negative indexing

Output

Original array [0 1 2 3 0 0 1 2]
First element (0th index) 0
Fourth element 3
Last element 2

1D Array Slicing

print("Original array", dna_seq)

# Slicing * extracting subsequences
print("From second to fourth", dna_seq[1:4])  # From second to fourth nucleotide: array([1, 2, 3]) = "CGT"
print("First three nucleotides", dna_seq[:3])   # First three nucleotides: array([0, 1, 2]) = "ACG"
print("From sixth nucleotide to the end", dna_seq[5:])   # From sixth nucleotide to the end: array([0, 1, 2]) = "ACG"

# Slicing with negative indices
print("Last three nucleotides", dna_seq[-3:])  # Last three nucleotides: array([0, 1, 2]) = "ACG"

# Slicing with step
print("Every second nucleotide", dna_seq[::2])  # Every second nucleotide: array([0, 2, 0, 1]) = "AGAC"

# Reverse array
print("Every second nucleotide", dna_seq[::-1]) 

Solution

Original array [0 1 2 3 0 0 1 2]
From second to fourth [1 2 3]
First three nucleotides [0 1 2]
From sixth nucleotide to the end [0 1 2]
Last three nucleotides [0 1 2]
Every second nucleotide [0 2 0 1]
Every second nucleotide [2 1 0 0 3 2 1 0]

Real-world significance in bioinformatics:

Indexing:
- Accessing specific nucleotide positions of interest
- Retrieving expression values for particular genes
- Referencing elements in position-specific scoring matrices
Slicing:
- Extracting specific regions like promoters, exons, or binding sites
- Identifying sequence motifs (e.g., restriction sites, protein domains)
- Analyzing k-mers (subsequences of length k)
- Creating sliding windows along DNA/protein sequences

More info

Additional notes: 2D Array Operations

2D Array Operations:

# Example: Gene expression matrix
# Rows = genes, Columns = experimental conditions
gene_expr = np.array([
    [12.5, 10.2, 33.4, 7.8],  # Gene 1 expression across 4 conditions
    [45.1, 43.8, 29.2, 22.1], # Gene 2 expression
    [8.7,  9.2,  12.3, 10.5], # Gene 3 expression
    [67.2, 70.3, 68.7, 71.9]  # Gene 4 expression
])

2D Array Indexing:

# Single element access * specific element at row, column
print(gene_expr[1, 2])    # Expression of Gene 2 in condition 3: 29.2

# Row indexing * accessing specific row
print(gene_expr[0])       # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])

2D Array Slicing:

# Row slicing * expression profile of one gene across all conditions
print(gene_expr[0, :])    # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])

# Column slicing * expression of all genes in a specific condition
print(gene_expr[:, 1])    # All genes in condition 2: array([10.2, 43.8, 9.2, 70.3])

# Sub-matrix slicing * subset of genes in subset of conditions
print(gene_expr[0:2, 2:4])
# First 2 genes in conditions 3 and 4:
# array([[33.4,  7.8],
#        [29.2, 22.1]])

# Strided slicing * every other gene, first two conditions

print(gene_expr[::2, :2])
# Genes 1 & 3, conditions 1 & 2:
# array([[12.5, 10.2],
#        [8.7,  9.2]])

Real-world significance in bioinformatics

Indexing:
- Retrieving expression value for a specific gene in a specific condition
- Accessing specific positions in sequence alignments
- Finding interaction pairs in protein-protein interaction matrices
Slicing:
- Comparing gene expression profiles across different tissues or time points
- Analyzing subsets of genes after clustering
- Extracting data for specific experiments or replicates
- Processing sections of alignment score matrices
- Analyzing specific regions in protein contact maps
- Extracting protein domains from structure coordinate arrays

Additional notes: Exercises

Exercise

Exercise 1: DNA Sequence Analysis (2-3 minutes)

Given a DNA sequence represented as an array of numerical values (A=0, C=1, G=2, T=3):

import numpy as np
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 2, 1, 0, 0, 2, 3])  # "ACGTAACGTTGCAGT"

Tasks:

Extract the first 5 nucleotides
Extract the last 4 nucleotides
Extract every third nucleotide starting from the first position
Extract the subsequence from position 6 to position 10 (inclusive)

# 1. First 5 nucleotides
print("First 5 nucleotides:", dna_seq[:5])

# 2. Last 4 nucleotides
print("Last 4 nucleotides:", dna_seq[-4:])

# 3. Every third nucleotide
print("Every third nucleotide:", dna_seq[::3])

# 4. Subsequence from position 6 to 10
print("Subsequence pos 6-10:", dna_seq[6:11])
# Note: Upper bound is exclusive in slicing, so we use 11 to include position 10

Output

First 5 nucleotides: [0 1 2 3 0]
Last 4 nucleotides: [0 0 2 3]
Every third nucleotide: [0 3 1 3 0 3]
Subsequence pos 6-10: [1 2 3 3 2]

Exercise 2: Gene Expression Analysis (2-3 minutes)

Given a gene expression matrix where rows represent genes and columns represent conditions:

import numpy as np
gene_expr = np.array([
    [15.2, 21.5, 18.9, 11.8, 25.3],  # Gene 1
    [42.3, 38.1, 29.6, 33.2, 19.7],  # Gene 2
    [8.4,  7.5,  9.2,  8.1,  10.5],  # Gene 3
    [31.6, 29.8, 27.5, 34.9, 36.2],  # Gene 4
    [17.3, 19.8, 22.5, 21.3, 18.2]   # Gene 5
])

Tasks:

Extract the expression values for Gene 3
Extract the expression values for all genes under fifth column
Extract a sub-matrix containing Genes 2-4 under columns 2-3
Find the expression value for Gene 5 under columns 2

Exercise 2 -Solution:

# 1. Expression values for Gene 3
print("Gene 3 expression:", gene_expr[2])
# Alternative: gene_expr[2, :]

# 2. Expression values for all genes under column 5
print("Condition 4 expression:", gene_expr[:, 4]) 

# 3. Sub-matrix of Genes 2-4 under columns 2-3
print("Sub-matrix (Genes 2-4, columns 2-3):")
print(gene_expr[1:4, 1:3])
# array([[38.1, 29.6],
#        [7.5,  9.2],
#        [29.8, 27.5]])

# 4. Expression value for Gene 5 under columns 2
print("Gene 5, columns 2:", gene_expr[4, 1]) 

Output

Gene 3 expression: [ 8.4  7.5  9.2  8.1 10.5]
Condition 4 expression: [25.3 19.7 10.5 36.2 18.2]
Sub-matrix (Genes 2-4, columns 2-3):
[[38.1 29.6]
 [ 7.5  9.2]
 [29.8 27.5]]
Gene 5, columns 2: 19.8

Exercise 3: Multi-sequence Alignment Analysis (2-3 minutes)

Consider a simplified alignment scoring matrix where each row represents a match (1) or mismatch (0) and each column represents a position in the alignment:

import numpy as np
alignment_scores = np.array([
    [1, 0, 1, 1, 0, 1, 0, 0, 1, 1],  # Sequence 1
    [1, 1, 0, 1, 0, 0, 1, 1, 0, 1],  # Sequence 2
    [0, 1, 1, 1, 1, 0, 0, 1, 0, 0],  # Sequence 3
    [1, 0, 0, 1, 1, 1, 0, 0, 1, 1]   # Sequence 4
])  # 1 = match, 0 = mismatch

Tasks:

Find positions where all sequences match (all rows having 1s in a column - use np.all with a mask)
Extract scores for positions 3-7 for all sequences
Find the matching pattern (positions with value 1) for Sequence 3
Extract a sub-alignment of the first two sequences for the last five positions

Exercise 3 -Solution:

# 1. Positions where all sequences match
all_match = np.all(alignment_scores == 1, axis=0)
print("Positions where all sequences match:", np.where(all_match)[0]) 

# 2. Scores for positions 3-7 for all sequences
print("Positions 3-7 scores:")
print(alignment_scores[:, 3:8])

# 3. Matching pattern for Sequence 3
seq3_matches = alignment_scores[2] == 1
print("Sequence 3 match positions:", np.where(seq3_matches)[0])

# 4. Sub-alignment of first two sequences for last five positions
print("Sub-alignment (Seq 1-2, last 5 positions):")
print(alignment_scores[0:2, 5:])

Output

Positions where all sequences match: [3]
Positions 3-7 scores:
[[1 0 1 0 0]
 [1 0 0 1 1]
 [1 1 0 0 1]
 [1 1 1 0 0]]
Sequence 3 match positions: [1 2 3 4 7]
Sub-alignment (Seq 1-2, last 5 positions):
[[1 0 0 1 1]
 [0 1 1 0 1]]

Key Takeaways

Keypoints

Efficient indexing and slicing are crucial for bioinformatics workflows
Key takeaways:
- Indexing for accessing individual elements
- Slicing for extracting regions of interest
- Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.)
- Combine with boolean operations for filtering
- Remember zero-based indexing
Common pitfalls:
- Off-by-one errors (especially when converting between biology’s 1-based and programming’s 0-based systems)
- Overlooking the exclusive upper bound in slicing (end index is not included)
- Forgetting that modifying slices can modify the original array (use .copy() when needed)
- Confusing row-major vs. column-major operations