Hands-on: RNA Expression Analysis - alternative method

Objective

Dataset description

  • test_data/Sample_group_info.csv

    • Patients were divided into istrong (immunity-strong) and iweak (immunity-weak) groups based on their Immunity score

    • The Immunity score was calculated by averaging the gene expression values of the 17 identified immune-genes

  • test_data/count_matrix.csv: Gene expression raw counts

Steps

  1. Data Loading and visualization

    1. Load sample group information (iweak vs istrong)

    2. Load gene expression count matrix

    3. Examine array information

  2. Sample Identification

    1. Filter samples by group (iweak/istrong)

    2. Match count matrix columns with sample IDs

  3. Data Preprocessing

    1. Convert count matrix to numeric values

    2. Apply log2 transformation: log2(counts + 1)

  4. Statistical Analysis

    1. Calculate mean and std for each gene within each group

    2. Compute Z-scores within each sample group

    3. Calculate Z-score differences between groups

    4. Compute standard deviation of all differences

  5. Ranking Genes

    1. Calculate Z-ratio: difference / std_difference

    2. Rank genes by Z-ratio (highest to lowest)

Workflow:

        flowchart TD
    A[Load Sample Group Info] --> B{Filter by Group}
    B -->|iweak| C[Identify iweak samples]
    B -->|istrong| D[Identify istrong samples]
    
    E[Load Count Matrix] --> F[Match columns with samples]
    
    F --> G[Convert to numeric]
    G --> H[Log2 transformation]
    
    C --> F
    D --> F
    
    H --> I1[Calculate iweak mean & std]
    H --> I2[Calculate istrong mean & std]
    
    I1 --> J1[Compute Z-scores for iweak]
    I2 --> J2[Compute Z-scores for istrong]
    
    J1 --> K[Calculate Z-score difference]
    J2 --> K
    
    K --> L[Calculate standard deviation]
    
    L --> M[Compute Z-ratio]
    
    M --> N[Rank genes by Z-ratio]
    
    classDef dataNode fill:#f9f9f9,stroke:#aaa,stroke-width:2px;
    classDef processNode fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    classDef resultNode fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    
    class A,E dataNode;
    class B,C,D,F,G,H,I1,I2,J1,J2,K,L,M processNode;
    class N resultNode;
    

Step 1: Loading and inspecting data

import numpy as np
from urllib.request import urlopen

# Read the CSV file into a numpy array
## CSV file contains sample group information
sample_group_info_url= "https://coderefinery.github.io/intermediate-python-ml/_downloads/b458a48eed87eb03931a8ce6efcdd351/Sample_group_info.csv"

data = np.genfromtxt(urlopen(sample_group_info_url), delimiter=',', dtype='str')
# Print the numpy array information

def print_array_info(array):
    # Get the shape of the array
    shape = array.shape

    # Get the number of dimensions of the array
    ndim = array.ndim

    # Get the data type of the array
    dtype = array.dtype

    # Get the number of elements in the array
    size = array.size

    print(f"Shape: {shape} \nNumber of dimensions: {ndim} \nData type: {dtype} \nSize: {size}")
print_array_info(data)
Shape: (303, 2) 
Number of dimensions: 2 
Data type: <U12 
Size: 606
# Read the CSV file into a numpy array with string dtype
## CSV file contains RNA count matrix

count_matrix_url = "https://coderefinery.github.io/intermediate-python-ml/_downloads/ab7de98031b77441be14a9d7ba21466c/count_matrix.csv"

count_matrix = np.genfromtxt(urlopen(count_matrix_url), delimiter=',',
                     dtype='str')
print_array_info(count_matrix)
Shape: (81, 483) 
Number of dimensions: 2 
Data type: <U18 
Size: 39123
# Remove sample names from the count matrix (cm) - Delete the first row
## Convert the cm to a float32 array
print(count_matrix[0:5, 0:5])
print("___")
cm = np.delete(count_matrix, 0, axis=0).astype("float32")
print(cm[0:5, 0:5])
[['SH_TS_BC_C1' 'SH_TS_BC_C11' 'SH_TS_BC_C15' 'SH_TS_BC_C3' 'SH_TS_BC01']
 ['25' '559' '231' '44' '23']
 ['173' '2475' '886' '320' '6']
 ['114' '8806' '2781' '537' '47']
 ['626' '7492' '2829' '564' '14']]
___
[[2.500e+01 5.590e+02 2.310e+02 4.400e+01 2.300e+01]
 [1.730e+02 2.475e+03 8.860e+02 3.200e+02 6.000e+00]
 [1.140e+02 8.806e+03 2.781e+03 5.370e+02 4.700e+01]
 [6.260e+02 7.492e+03 2.829e+03 5.640e+02 1.400e+01]
 [3.170e+02 5.949e+03 2.357e+03 2.750e+02 2.600e+01]]

Step 2: Sample Identification

  1. Filter samples by group (iweak/istrong)

  2. Match count matrix columns with sample IDs

alt text

# Print the first 5 rows and columns of the data
print(data[:5, :5])
[['SH_TS_BC111' 'iweak']
 ['SH_TS_BC112' 'iweak']
 ['SH_TS_BC113' 'iweak']
 ['SH_TS_BC119' 'istrong']
 ['SH_TS_BC133' 'iweak']]
# Access indices of the array where the second column is 'iweak'
iweak_index = np.where(data[:, 1] == 'iweak')
print(iweak_index)
print_array_info(iweak_index[0])
(array([  0,   1,   2,   4,   5,   6,   7,   8,   9,  12,  14,  15,  16,
        17,  18,  21,  24,  25,  27,  31,  33,  34,  35,  38,  39,  41,
        43,  45,  47,  49,  50,  51,  53,  54,  55,  56,  59,  60,  61,
        63,  64,  65,  67,  69,  71,  72,  73,  74,  75,  78,  80,  81,
        84,  86,  92,  93,  94,  97, 102, 106, 108, 111, 112, 114, 122,
       123, 126, 131, 132, 133, 139, 142, 145, 146, 148, 149, 150, 151,
       160, 161, 163, 164, 166, 168, 170, 171, 173, 176, 177, 180, 186,
       188, 192, 195, 196, 197, 200, 203, 206, 207, 212, 214, 215, 216,
       217, 219, 223, 225, 226, 227, 228, 230, 235, 244, 248, 249, 252,
       256, 258, 260, 262, 263, 265, 266, 269, 270, 271, 275, 276, 278,
       279, 280, 282, 283, 285, 286, 287, 288, 289, 291, 292, 293, 294,
       295, 296, 298, 299, 300, 302]),)
Shape: (149,) 
Number of dimensions: 1 
Data type: int64 
Size: 149
# Access indices of the array where the second column is 'iweak'
iweak_index = np.where(data[:, 1] == 'iweak')[0]
print("Index\n", iweak_index[:5], "\nData\n", data[iweak_index][:5,])
print_array_info(iweak_index)
Index
 [0 1 2 4 5] 
Data
 [['SH_TS_BC111' 'iweak']
 ['SH_TS_BC112' 'iweak']
 ['SH_TS_BC113' 'iweak']
 ['SH_TS_BC133' 'iweak']
 ['SH_TS_BC134' 'iweak']]
Shape: (149,) 
Number of dimensions: 1 
Data type: int64 
Size: 149
# View the first column of the count matrix where the sample group is 'iweak'
print(count_matrix[0:5, 0:5])
[['SH_TS_BC_C1' 'SH_TS_BC_C11' 'SH_TS_BC_C15' 'SH_TS_BC_C3' 'SH_TS_BC01']
 ['25' '559' '231' '44' '23']
 ['173' '2475' '886' '320' '6']
 ['114' '8806' '2781' '537' '47']
 ['626' '7492' '2829' '564' '14']]
# Create a boolean mask to find if the columns in the count matrix where the sample group is 'iweak'
cm_iweak_mask = np.isin(count_matrix[0, :], data[iweak_index, 0])
print(cm_iweak_mask[:30])
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False  True  True  True False False]
# Find the indices of the columns in the count matrix where the sample group is 'iweak'
cm_weak_cols = np.where(cm_iweak_mask)[0]
print(cm_weak_cols)
print_array_info(cm_weak_cols)
[ 25  26  27  36  37  38  40  41  42  46  48  49  51  52  53  56  64  65
  67  71  73  74  75  79  80  83  85  88  91  93  94  95  97  98  99 100
 103 104 105 108 109 110 112 114 116 117 118 119 120 124 126 128 131 133]
Shape: (54,) 
Number of dimensions: 1 
Data type: int64 
Size: 54
# Access indices of the array where the second column is 'istrong'
## Assign the indices to a istrong_index (not the tuple returned by np.where)
istrong_index = np.where(data[:, 1] == 'istrong')[0]
print(istrong_index)
print_array_info(istrong_index)
[  3  10  11  13  19  20  22  23  26  28  29  30  32  36  37  40  42  44
  46  48  52  57  58  62  66  68  70  76  77  79  82  83  85  87  88  89
  90  91  95  96  98  99 100 101 103 104 105 107 109 110 113 115 116 117
 118 119 120 121 124 125 127 128 129 130 134 135 136 137 138 140 141 143
 144 147 152 153 154 155 156 157 158 159 162 165 167 169 172 174 175 178
 179 181 182 183 184 185 187 189 190 191 193 194 198 199 201 202 204 205
 208 209 210 211 213 218 220 221 222 224 229 231 232 233 234 236 237 238
 239 240 241 242 243 245 246 247 250 251 253 254 255 257 259 261 264 267
 268 272 273 274 277 281 284 290 297 301]
Shape: (154,) 
Number of dimensions: 1 
Data type: int64 
Size: 154
# Find the indices of the columns in the count matrix where the sample group is 'istrong'
cm_strong_cols = np.where(np.isin(count_matrix[0, :], data[istrong_index, 0]))[0]
print(cm_strong_cols)
print_array_info(cm_strong_cols)
[ 33  44  45  47  54  55  58  63  66  68  69  70  72  76  78  81  84  86
  89  92  96 101 102 107 111 113 115 121 122 125 129 130 132 134 136 137
 138]
Shape: (37,) 
Number of dimensions: 1 
Data type: int64 
Size: 37
count_matrix.shape
(81, 483)

Step 3: Data Preprocessing

  • Convert count matrix to numeric values

  • Apply log2 transformation: log2(counts + 1)

# Remove sample names from the count matrix (cm) - Delete the first row
print(count_matrix[0:5, 0:5])
print("___")
## Convert the cm to a float32 array

cm = np.delete(count_matrix, 0, axis=0).astype("float32")
print(cm[0:5, 0:5])
[['SH_TS_BC_C1' 'SH_TS_BC_C11' 'SH_TS_BC_C15' 'SH_TS_BC_C3' 'SH_TS_BC01']
 ['25' '559' '231' '44' '23']
 ['173' '2475' '886' '320' '6']
 ['114' '8806' '2781' '537' '47']
 ['626' '7492' '2829' '564' '14']]
___
[[2.500e+01 5.590e+02 2.310e+02 4.400e+01 2.300e+01]
 [1.730e+02 2.475e+03 8.860e+02 3.200e+02 6.000e+00]
 [1.140e+02 8.806e+03 2.781e+03 5.370e+02 4.700e+01]
 [6.260e+02 7.492e+03 2.829e+03 5.640e+02 1.400e+01]
 [3.170e+02 5.949e+03 2.357e+03 2.750e+02 2.600e+01]]
# Convert cm to log scale
cm = np.log2(cm + 1)
print(cm)
print_array_info(cm)
[[ 4.70044    9.129283   7.857981  ...  0.         6.5999126  7.936638 ]
 [ 7.4429436 11.273795   9.79279   ...  6.794416   9.865733  11.2842455]
 [ 6.84549   13.104435  11.441907  ...  9.187352  10.403012  11.279611 ]
 ...
 [10.675957  13.7911625 12.428099  ... 10.456354  11.276706  12.22581  ]
 [ 4.857981   8.169925   7.491853  ...  8.948367   4.5849624  8.204571 ]
 [ 9.432542  12.378024  10.899357  ... 10.82893   13.397273  14.26415  ]]
Shape: (80, 483) 
Number of dimensions: 2 
Data type: float32 
Size: 38640
# Calculate mean and STD of each gene in iweak samples
iweak_mean = cm[:, cm_weak_cols].mean(1)    ## Mean of iweak samples
iweak_std = cm[:, cm_weak_cols].std(1)      ## STD of iweak samples
print("iweak_mean", iweak_mean[:5], iweak_mean.shape)
print("iweak_std", iweak_std[:5], iweak_std.shape)
iweak_mean [7.8603177 8.870119  8.839295  9.873015  8.818066 ] (80,)
iweak_std [1.9773906 1.5399547 2.0553062 1.1807643 2.1794095] (80,)
# Calculate mean and STD of each gene in istrong samples
istrong_mean = cm[:,cm_strong_cols].mean(1) ## Mean of istrong disease samples
istrong_std = cm[:,cm_strong_cols].std(1)   ## STD of istrong samples

print("istrong_mean", istrong_mean[:5], istrong_mean.shape)
print("istrong_std", istrong_std[:5], istrong_std.shape)
istrong_mean [ 6.9949713  6.953521  10.527761   9.192108   9.029262 ] (80,)
istrong_std [2.2878554 2.8049028 1.3030388 2.2123892 1.9921837] (80,)

Step 4: Statistical Analysis

  1. Calculate mean and std for each gene within each group

  2. Compute Z-scores within each sample group

  3. Calculate Z-score differences between groups

  4. Compute standard deviation of all differences

Z-scores:

  • Gene expression measurements (counts) can have vastly different scales across different samples due to technical variations

  • The Z-score transformation standardizes these measurements

$$ Z_{G} = \frac{(Count_G - \mu_{Count_{group}})}{\sigma_{Count_{group}}} $$

$$ Z_{G} : Z-score\ for\ a\ gene\ G$$ $$ Count_G: Log10\ count\ of\ gene\ G\ in\ a\ given\ sample$$ $$ \mu_{Count_{group}}: The\ overall\ average\ across\ all\ samples\ in\ the\ given\ group\ for\ each\ gene$$ $$ \sigma_{Count_{group}}: Standard\ deviation\ all\ samples\ in\ the\ given\ group\ for\ each\ gene$$

alt text

# Calculate Z-scores of each gene in iweak samples (vectorized)
print(cm.shape, iweak_mean.shape, iweak_std.shape)

## use .reshape(-1, 1) to convert the mean and std to column vectors
## This is necessary for vectorized operations to work correctly
cm_iweak_z = (cm[:, cm_weak_cols] - iweak_mean.reshape(-1, 1)) / iweak_std.reshape(-1, 1)
## The reshape is necessary because you want to subtract/divide row-wise, but NumPy's default broadcasting for 1D arrays applies column-wise.
print("cm_iweak_z", cm_iweak_z[:5, :5])
print_array_info(cm_iweak_z)
(80, 483) (80,) (80,)
cm_iweak_z [[-0.8058757  -0.7423927  -0.9294824   0.4624626  -0.95228404]
 [-1.1998737  -2.0764856   0.16326085  0.22980079 -5.1106176 ]
 [-4.3007193   0.31064132 -0.57465094  0.26697835 -1.5138254 ]
 [-1.1198412  -1.3803278   0.17033552  1.1318417  -2.3681269 ]
 [-4.04608    -0.28826228 -0.18751603  0.5879035  -1.0315684 ]]
Shape: (80, 54) 
Number of dimensions: 2 
Data type: float32 
Size: 4320
# Calculate Z-scores of each gene in istrong samples (vectorized)
cm_istrong_z = (cm[:, cm_strong_cols] - istrong_mean.reshape(-1, 1)) / istrong_std.reshape(-1, 1)
print("cm_istrong_z", cm_istrong_z[:5, :5])
print_array_info(cm_istrong_z)
cm_istrong_z [[ 0.2545867   0.7559668   0.47983187  1.1538492   0.3143879 ]
 [ 0.97684896 -0.04753883  0.8586477   1.0635368   0.32466918]
 [ 0.13716765  0.28922623  1.1129626   1.229808    0.77283007]
 [ 0.4442336   0.6031446   0.74163324  0.75332576  0.71388024]
 [ 0.29755822  1.1648388   1.0060126   0.3692481  -0.07359705]]
Shape: (80, 37) 
Number of dimensions: 2 
Data type: float32 
Size: 2960

Z-ratio = Z-score difference (per gene):

  • The Z-ratio provides a standardized measure of the difference between conditions for each gene

  • This accounts for the overall variability in the experiment

  • A gene showing a difference of, say, 0.5 in average Z-score

    • might be highly significant if most genes show very little difference (small Z-score difference - SD),

    • but not significant if many genes show large differences (large Z-score difference - SD)

  • It puts the individual gene’s change in the context of the overall experimental variation

$$ Z.score_{Diff_{gene}} = \bar{Z}{Gene, istring} - \bar{Z}{Gene, iweak} $$

$$ Z_{Ratio, Gene} = \frac{Z.score_{Diff_{gene}}}{SD_{Z.score_{Diff_{gene}}}} $$

alt text

diff_z_scores = cm_istrong_z.mean(1) - cm_iweak_z.mean(1)
std_diff = diff_z_scores.std()

### z-score ratio for each gene
## Divide Z-Ratio differences by the Z-Ratio differences SD
z_score_ratios = diff_z_scores / std_diff
print_array_info(z_score_ratios)
print(z_score_ratios[:10])
Shape: (80,) 
Number of dimensions: 1 
Data type: float32 
Size: 80
[-0.43179178 -1.5044755   1.247218    0.22787355  0.64381164  0.83408666
  0.96931595  2.2225075   2.1141531   0.49432233]

Step5: Rank genes according to the Z score ratio:

  • Sort z_score_ratio in descending order and access indices

  • Rank genes using indices

gene_list = ["ACTR3B", "ANLN", "APOBEC3G", "AURKA", "BAG1", "BCL2", "BIRC5", "BLVRA", "CCL5", "CCNB1", "CCNE1", "CCR2", "CD2", "CD27", "CD3D", "CD52", "CD68", "CDC20", "CDC6", "CDH3", "CENPF", "CEP55", "CORO1A", "CTSL2", "CXCL9", "CXXC5", "EGFR", "ERBB2", "ESR1", "EXO1", "FGFR4", "FOXA1", "FOXC1", "GAPDH", "GPR160", "GRB7", "GSTM1", "GUSB", "GZMA", "GZMK", "HLA-DMA", "IL2RG", "KIF2C", "KRT14", "KRT17", "KRT5", "LCK", "MAPT", "MDM2", "MELK", "MIA", "MKI67", "MLPH", "MMP11", "MRPL19", "MYBL2", "MYC", "NAT1", "NDC80", "NUF2", "ORC6", "PGR", "PHGDH", "PRKCB", "PSMC4", "PTPRC", "PTTG1", "RRM2", "SCUBE2", "SF3A1", "SFRP1", "SH2D1A", "SLC39A6", "TFRC", "TMEM45B", "TP53", "TYMS", "UBE2C", "UBE2T", "VEGFA"]
### `np.argsort()` returns indices of the array that would sort in ascending order
### slicing syntax [start:stop:step] with step -1 returns a reversed array

gene_ranks = np.argsort(z_score_ratios)[::-1]
print("Genes ranked according to Z-score ratio:")
print(np.array(gene_list)[gene_ranks])
Genes ranked according to Z-score ratio:
['PSMC4' 'FOXC1' 'TP53' 'BLVRA' 'CCL5' 'CD68' 'PTPRC' 'GRB7' 'APOBEC3G'
 'MRPL19' 'CENPF' 'BIRC5' 'UBE2C' 'PHGDH' 'ERBB2' 'TFRC' 'BCL2' 'GUSB'
 'MDM2' 'KRT5' 'BAG1' 'FOXA1' 'CCNB1' 'KRT17' 'CEP55' 'KRT14' 'NDC80'
 'AURKA' 'CXCL9' 'TMEM45B' 'SLC39A6' 'GZMK' 'PTTG1' 'GPR160' 'IL2RG'
 'GZMA' 'MYC' 'EXO1' 'MKI67' 'TYMS' 'PGR' 'SH2D1A' 'CD27' 'PRKCB' 'CCR2'
 'NUF2' 'RRM2' 'MIA' 'MELK' 'ESR1' 'SF3A1' 'UBE2T' 'EGFR' 'CCNE1' 'MAPT'
 'KIF2C' 'MMP11' 'HLA-DMA' 'NAT1' 'CTSL2' 'MLPH' 'CXXC5' 'CD2' 'ACTR3B'
 'CORO1A' 'SFRP1' 'FGFR4' 'CD3D' 'ORC6' 'LCK' 'CD52' 'CDH3' 'GSTM1'
 'VEGFA' 'SCUBE2' 'CDC6' 'ANLN' 'MYBL2' 'CDC20' 'GAPDH']