Array Indexing and Slicing

Objectives

  • Define and distinguish between indexing and slicing operations in NumPy arrays

  • Demonstrate proper syntax for accessing individual elements in 1D and 2D arrays using indexing

  • Extract ranges of elements using slicing techniques, including with negative indices and step parameters

  • Recognize and avoid common pitfalls when working with arrays, such as off-by-one errors and unintended modifications to original data

Instructor note

  • Teaching : 10 min

  • Demo: 5 min

Introduction

What is Indexing and Slicing?

  • Indexing is the process of accessing specific individual elements within a data structure

    • Uses square brackets with a single index value: array[0]

    • Most programming languages use zero-based indexing (first element is at position 0)

  • Slicing is the process of extracting a subset or range of elements

    • Uses square brackets with a range specification: array[start:stop:step]

    • Creates a view of the original data (changes to the slice affect the original array)

Why Indexing Matters in Bioinformatics:

  • Bioinformatics deals with large, complex biological datasets:

    • DNA/RNA sequences (can be millions of nucleotides long)

    • Protein sequences

    • Gene expression matrices (thousands of genes × dozens/hundreds of samples)

    • Phylogenetic trees

    • Molecular structures

  • Efficient data access is crucial for:

    • Sequence alignment and comparison

    • Identifying motifs or patterns

    • Analyzing specific regions of interest (e.g., genes, domains, binding sites)

    • Processing large-scale genomic or proteomic data

    • Statistical analysis across experimental conditions

NumPy Arrays in Bioinformatics

  • Common bioinformatics applications:

    • Storing sequence data as numeric arrays

    • Representing position weight matrices

    • Managing alignment scores

    • Handling gene expression matrices

1D Array Operations

Demo

1D Array Indexing

# Example: String sequence converted to numerical representation
# A=0, C=1, G=2, T=3
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2])  # "ACGTAACG"

# Single element access through indexing
print("Original array", dna_seq)
print("First element (0th index)", dna_seq[0])    # First nucleotide (0 = A)
print("Fourth element", dna_seq[3])    # Fourth nucleotide (3 = T)
print("Last element",dna_seq[-1])   # Last nucleotide (2 = G) using negative indexing

1D Array Slicing

print("Original array", dna_seq)

# Slicing * extracting subsequences
print("From second to fourth", dna_seq[1:4])  # From second to fourth nucleotide: array([1, 2, 3]) = "CGT"
print("First three nucleotides", dna_seq[:3])   # First three nucleotides: array([0, 1, 2]) = "ACG"
print("From sixth nucleotide to the end", dna_seq[5:])   # From sixth nucleotide to the end: array([0, 1, 2]) = "ACG"

# Slicing with negative indices
print("Last three nucleotides", dna_seq[-3:])  # Last three nucleotides: array([0, 1, 2]) = "ACG"

# Slicing with step
print("Every second nucleotide", dna_seq[::2])  # Every second nucleotide: array([0, 2, 0, 1]) = "AGAC"

# Reverse array
print("Every second nucleotide", dna_seq[::-1]) 

Real-world significance in bioinformatics:

  • Indexing:

    • Accessing specific nucleotide positions of interest

    • Retrieving expression values for particular genes

    • Referencing elements in position-specific scoring matrices

  • Slicing:

    • Extracting specific regions like promoters, exons, or binding sites

    • Identifying sequence motifs (e.g., restriction sites, protein domains)

    • Analyzing k-mers (subsequences of length k)

    • Creating sliding windows along DNA/protein sequences

More info

Additional notes: 2D Array Operations

2D Array Operations:

# Example: Gene expression matrix
# Rows = genes, Columns = experimental conditions
gene_expr = np.array([
    [12.5, 10.2, 33.4, 7.8],  # Gene 1 expression across 4 conditions
    [45.1, 43.8, 29.2, 22.1], # Gene 2 expression
    [8.7,  9.2,  12.3, 10.5], # Gene 3 expression
    [67.2, 70.3, 68.7, 71.9]  # Gene 4 expression
])

2D Array Indexing:

# Single element access * specific element at row, column
print(gene_expr[1, 2])    # Expression of Gene 2 in condition 3: 29.2

# Row indexing * accessing specific row
print(gene_expr[0])       # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])

2D Array Slicing:

# Row slicing * expression profile of one gene across all conditions
print(gene_expr[0, :])    # Gene 1 across all conditions: array([12.5, 10.2, 33.4, 7.8])

# Column slicing * expression of all genes in a specific condition
print(gene_expr[:, 1])    # All genes in condition 2: array([10.2, 43.8, 9.2, 70.3])

# Sub-matrix slicing * subset of genes in subset of conditions
print(gene_expr[0:2, 2:4])
# First 2 genes in conditions 3 and 4:
# array([[33.4,  7.8],
#        [29.2, 22.1]])

# Strided slicing * every other gene, first two conditions

print(gene_expr[::2, :2])
# Genes 1 & 3, conditions 1 & 2:
# array([[12.5, 10.2],
#        [8.7,  9.2]])

Real-world significance in bioinformatics

  • Indexing:

    • Retrieving expression value for a specific gene in a specific condition

    • Accessing specific positions in sequence alignments

    • Finding interaction pairs in protein-protein interaction matrices

  • Slicing:

    • Comparing gene expression profiles across different tissues or time points

    • Analyzing subsets of genes after clustering

    • Extracting data for specific experiments or replicates

    • Processing sections of alignment score matrices

    • Analyzing specific regions in protein contact maps

    • Extracting protein domains from structure coordinate arrays

Additional exercises

Additional notes: Exercises

Exercise

Exercise 1: DNA Sequence Analysis (2-3 minutes)

Given a DNA sequence represented as an array of numerical values (A=0, C=1, G=2, T=3):

import numpy as np
dna_seq = np.array([0, 1, 2, 3, 0, 0, 1, 2, 3, 3, 2, 1, 0, 0, 2, 3])  # "ACGTAACGTTGCAGT"

Tasks:

  1. Extract the first 5 nucleotides

  2. Extract the last 4 nucleotides

  3. Extract every third nucleotide starting from the first position

  4. Extract the subsequence from position 6 to position 10 (inclusive)

# 1. First 5 nucleotides
print("First 5 nucleotides:", dna_seq[:5])

# 2. Last 4 nucleotides
print("Last 4 nucleotides:", dna_seq[-4:])

# 3. Every third nucleotide
print("Every third nucleotide:", dna_seq[::3])

# 4. Subsequence from position 6 to 10
print("Subsequence pos 6-10:", dna_seq[6:11])
# Note: Upper bound is exclusive in slicing, so we use 11 to include position 10

Output

First 5 nucleotides: [0 1 2 3 0]
Last 4 nucleotides: [0 0 2 3]
Every third nucleotide: [0 3 1 3 0 3]
Subsequence pos 6-10: [1 2 3 3 2]

Exercise 2: Gene Expression Analysis (2-3 minutes)

Given a gene expression matrix where rows represent genes and columns represent conditions:

import numpy as np
gene_expr = np.array([
    [15.2, 21.5, 18.9, 11.8, 25.3],  # Gene 1
    [42.3, 38.1, 29.6, 33.2, 19.7],  # Gene 2
    [8.4,  7.5,  9.2,  8.1,  10.5],  # Gene 3
    [31.6, 29.8, 27.5, 34.9, 36.2],  # Gene 4
    [17.3, 19.8, 22.5, 21.3, 18.2]   # Gene 5
])

Tasks:

  1. Extract the expression values for Gene 3

  2. Extract the expression values for all genes under fifth column

  3. Extract a sub-matrix containing Genes 2-4 under columns 2-3

  4. Find the expression value for Gene 5 under columns 2

Exercise 2 -Solution:

# 1. Expression values for Gene 3
print("Gene 3 expression:", gene_expr[2])
# Alternative: gene_expr[2, :]

# 2. Expression values for all genes under column 5
print("Condition 4 expression:", gene_expr[:, 4]) 

# 3. Sub-matrix of Genes 2-4 under columns 2-3
print("Sub-matrix (Genes 2-4, columns 2-3):")
print(gene_expr[1:4, 1:3])
# array([[38.1, 29.6],
#        [7.5,  9.2],
#        [29.8, 27.5]])

# 4. Expression value for Gene 5 under columns 2
print("Gene 5, columns 2:", gene_expr[4, 1]) 

Output

Gene 3 expression: [ 8.4  7.5  9.2  8.1 10.5]
Condition 4 expression: [25.3 19.7 10.5 36.2 18.2]
Sub-matrix (Genes 2-4, columns 2-3):
[[38.1 29.6]
 [ 7.5  9.2]
 [29.8 27.5]]
Gene 5, columns 2: 19.8

Exercise 3: Multi-sequence Alignment Analysis (2-3 minutes)

Consider a simplified alignment scoring matrix where each row represents a match (1) or mismatch (0) and each column represents a position in the alignment:

import numpy as np
alignment_scores = np.array([
    [1, 0, 1, 1, 0, 1, 0, 0, 1, 1],  # Sequence 1
    [1, 1, 0, 1, 0, 0, 1, 1, 0, 1],  # Sequence 2
    [0, 1, 1, 1, 1, 0, 0, 1, 0, 0],  # Sequence 3
    [1, 0, 0, 1, 1, 1, 0, 0, 1, 1]   # Sequence 4
])  # 1 = match, 0 = mismatch

Tasks:

  1. Find positions where all sequences match (all rows having 1s in a column - use np.all with a mask)

  2. Extract scores for positions 3-7 for all sequences

  3. Find the matching pattern (positions with value 1) for Sequence 3

  4. Extract a sub-alignment of the first two sequences for the last five positions

Exercise 3 -Solution:

# 1. Positions where all sequences match
all_match = np.all(alignment_scores == 1, axis=0)
print("Positions where all sequences match:", np.where(all_match)[0]) 

# 2. Scores for positions 3-7 for all sequences
print("Positions 3-7 scores:")
print(alignment_scores[:, 3:8])

# 3. Matching pattern for Sequence 3
seq3_matches = alignment_scores[2] == 1
print("Sequence 3 match positions:", np.where(seq3_matches)[0])

# 4. Sub-alignment of first two sequences for last five positions
print("Sub-alignment (Seq 1-2, last 5 positions):")
print(alignment_scores[0:2, 5:])

Output

Positions where all sequences match: [3]
Positions 3-7 scores:
[[1 0 1 0 0]
 [1 0 0 1 1]
 [1 1 0 0 1]
 [1 1 1 0 0]]
Sequence 3 match positions: [1 2 3 4 7]
Sub-alignment (Seq 1-2, last 5 positions):
[[1 0 0 1 1]
 [0 1 1 0 1]]

Key Takeaways

Keypoints

  • Efficient indexing and slicing are crucial for bioinformatics workflows

  • Key takeaways:

    • Indexing for accessing individual elements

    • Slicing for extracting regions of interest

    • Leverage both for efficient data manipulation in matrices (gene × condition, position × sequence, etc.)

    • Combine with boolean operations for filtering

    • Remember zero-based indexing

  • Common pitfalls:

    • Off-by-one errors (especially when converting between biology’s 1-based and programming’s 0-based systems)

    • Overlooking the exclusive upper bound in slicing (end index is not included)

    • Forgetting that modifying slices can modify the original array (use .copy() when needed)

    • Confusing row-major vs. column-major operations