Hands-on: RNA Expression Analysis - alternative method

Objectives

  • Examine differential expression of immune-related genes between patient groups previously classified as immunologically strong (‘istrong’) and immunologically weak (‘iweak’)

  • Apply an alternative analytical approach using Z-ratio methodology to complement standard differential expression tools like DESeq

  • Ranks immune-related genes based on their relative expression differences between the patient groups

RNA Expression Analysis Steps:

  1. Data Loading and visualization

    • Load sample group information (iweak vs istrong)

    • Load gene expression count matrix

    • View first few rows/columns

    • View basic info

  2. Sample Identification

    • Filter samples by group (iweak/istrong)

    • Match count matrix columns with sample IDs

  3. Data Preprocessing

    • Convert count matrix to numeric values

    • Apply log2 transformation: log2(counts + 1)

  4. Statistical Analysis

    • Calculate mean and std for each gene within each group

    • Compute Z-scores within each sample group

    • Calculate Z-score differences between groups

    • Compute standard deviation of all differences

  5. Ranking Genes

    • Calculate Z-ratio: difference / std_difference

    • Rank genes by Z-ratio (highest to lowest)

This workflow standardizes the comparison between sample groups by accounting for the overall variability in gene expression across the entire experiment.

import pandas as pd
import numpy as np

1. Data Loading and visualization

  • Load sample group information (iweak vs istrong)

  • Load gene expression count matrix

  • View first few rows/columns

  • View basic info

Load sample group information (iweak vs istrong)

sample_info = pd.read_csv(
    "test_data/Sample_group_info.csv", header=None, names=["Sample", "Group"]
)
print("Samples and Groups:\n", sample_info.head())
print("Dataframe info:\n", sample_info.info())
print("\nNumber of samples in each group:")
print(sample_info.groupby(by="Group").size())
Samples and Groups:
         Sample    Group
0  SH_TS_BC111    iweak
1  SH_TS_BC112    iweak
2  SH_TS_BC113    iweak
3  SH_TS_BC119  istrong
4  SH_TS_BC133    iweak
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Sample  303 non-null    object
 1   Group   303 non-null    object
dtypes: object(2)
memory usage: 4.9+ KB
Dataframe info:
 None

Number of samples in each group:
Group
istrong    154
iweak      149
dtype: int64

Load gene expression count matrix

count_matrix = pd.read_csv(
    "test_data/count_matrix_with_row_indices.csv", header=0, index_col=0, sep=";"
)
print("Count matrix:\n", count_matrix.iloc[:, :5].head())
print("Dataframe info:\n", count_matrix.info())
print(
    "Descriptive statistics (First 5 samples):\n", count_matrix.iloc[:, :5].describe()
)
Count matrix:
           SH_TS_BC_C1  SH_TS_BC_C11  SH_TS_BC_C15  SH_TS_BC_C3  SH_TS_BC01
Gene                                                                      
ACTR3B             25           559           231           44          23
ANLN              173          2475           886          320           6
APOBEC3G          114          8806          2781          537          47
AURKA             626          7492          2829          564          14
BAG1              317          5949          2357          275          26
<class 'pandas.core.frame.DataFrame'>
Index: 80 entries, ACTR3B to VEGFA
Columns: 483 entries, SH_TS_BC_C1 to UNC_TGS_BC_Y90_R1
dtypes: int64(483)
memory usage: 302.5+ KB
Dataframe info:
 None
Descriptive statistics (First 5 samples):
         SH_TS_BC_C1  SH_TS_BC_C11  SH_TS_BC_C15   SH_TS_BC_C3   SH_TS_BC01
count     80.000000      80.00000     80.000000     80.000000    80.000000
mean    1118.700000   20114.17500   6846.137500   1403.150000   126.212500
std     2627.440095   42620.73209  13895.968032   2411.549117   329.326881
min        1.000000      13.00000      6.000000      0.000000     0.000000
25%       58.500000    1758.50000    692.500000    207.000000     3.750000
50%      265.000000    5481.00000   1903.500000    529.000000    22.000000
75%      849.500000   15620.50000   5396.750000   1142.000000   100.250000
max    15912.000000  239031.00000  79955.000000  12397.000000  2352.000000
print("Number of NaN values in each column:", count_matrix.isna().sum(0))
print("Number of NaN values in the dataframe:", count_matrix.isna().sum(0).sum())
Number of NaN values in each column: SH_TS_BC_C1          0
SH_TS_BC_C11         0
SH_TS_BC_C15         0
SH_TS_BC_C3          0
SH_TS_BC01           0
                    ..
UNC_TGS_BC_9m        0
UNC_TGS_BC_Y23       0
UNC_TGS_BC_Y23_R1    0
UNC_TGS_BC_Y90       0
UNC_TGS_BC_Y90_R1    0
Length: 483, dtype: int64
Number of NaN values in the dataframe: 0

2. Sample Identification

  • Filter samples by group (iweak/istrong)

  • Match count matrix columns with sample IDs

Filter samples and match count matrix - iweak

# Display info about iweak samples
iweak_samples = sample_info[sample_info["Group"] == "iweak"]
print("iweak samples:")
print(iweak_samples.head())
print("Number of iweak samples:", len(iweak_samples))
# Display info about iweak samples
print("iweak samples:")
print(iweak_samples.info())
iweak samples:
        Sample  Group
0  SH_TS_BC111  iweak
1  SH_TS_BC112  iweak
2  SH_TS_BC113  iweak
4  SH_TS_BC133  iweak
5  SH_TS_BC134  iweak
Number of iweak samples: 149
iweak samples:
<class 'pandas.core.frame.DataFrame'>
Index: 149 entries, 0 to 302
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Sample  149 non-null    object
 1   Group   149 non-null    object
dtypes: object(2)
memory usage: 3.5+ KB
None
# Identify columns that match iweak sample IDs
print("Samples in count matrix (first 10):\n", count_matrix.columns[:10])
print("Data Type of count_matrix.columns:", type(count_matrix.columns))

## `pandas.core.indexes.base.Index` is not a NumPy ndarray, but it is built on top of NumPy arrays.
## In other words, while a Pandas Index can store data in a way that is compatible with NumPy
## `pandas.core.indexes.base.Index`  it is a separate object that provides additional functionality specific to indexing and more complex operations suited for Pandas
Samples in count matrix (first 10):
 Index(['SH_TS_BC_C1', 'SH_TS_BC_C11', 'SH_TS_BC_C15', 'SH_TS_BC_C3',
       'SH_TS_BC01', 'SH_TS_BC010_1', 'SH_TS_BC010_2', 'SH_TS_BC02',
       'SH_TS_BC04', 'SH_TS_BC05'],
      dtype='object')
Data Type of count_matrix.columns: <class 'pandas.core.indexes.base.Index'>
iweak_cols = count_matrix.columns.isin(iweak_samples["Sample"])
print("iweak column mask (first 10):")
print(iweak_cols[:10])
print("Number of iweak columns in iweak column mask:", iweak_cols.sum())
# print("Number of iweak columns in count matrix:", len(iweak_cols[iweak_cols]))
print("\niweak column mask (first 30):", iweak_cols[:30])
print(
    f"First 30 columns of iweak: {count_matrix.columns[iweak_cols][:30]} \
      \n Total number of iweak columns: {len(count_matrix.columns[iweak_cols])}"
)
iweak column mask (first 10):
[False False False False False False False False False False]
Number of iweak columns in iweak column mask: 54

iweak column mask (first 30): [False False False False False False False False False False False False
 False False False False False False False False False False False False
 False  True  True  True False False]
First 30 columns of iweak: Index(['SH_TS_BC111', 'SH_TS_BC112', 'SH_TS_BC113', 'SH_TS_BC133',
       'SH_TS_BC134', 'SH_TS_BC139', 'SH_TS_BC141', 'SH_TS_BC146',
       'SH_TS_BC147', 'SH_TS_BC152', 'SH_TS_BC154', 'SH_TS_BC155',
       'SH_TS_BC160', 'SH_TS_BC161', 'SH_TS_BC163', 'SH_TS_BC169',
       'SH_TS_BC172', 'SH_TS_BC173', 'SH_TS_BC176', 'SH_TS_BC181',
       'SH_TS_BC183', 'SH_TS_BC184', 'SH_TS_BC185', 'SH_TS_BC196',
       'SH_TS_BC198', 'SH_TS_BC200', 'SH_TS_BC203', 'SH_TS_BC207',
       'SH_TS_BC210', 'SH_TS_BC212'],
      dtype='object')       
 Total number of iweak columns: 54

Filter samples and match count matrix - istrong

# Display info about istrong samples
istrong_samples = sample_info[sample_info["Group"] == "istrong"]
print("\nistrong samples:")
print(istrong_samples.head())
print("Number of iweak samples:", len(istrong_samples))
# Display info about iweak samples
print("iweak samples:")
print(istrong_samples.info())
istrong samples:
         Sample    Group
3   SH_TS_BC119  istrong
10  SH_TS_BC150  istrong
11  SH_TS_BC151  istrong
13  SH_TS_BC153  istrong
19  SH_TS_BC165  istrong
Number of iweak samples: 154
iweak samples:
<class 'pandas.core.frame.DataFrame'>
Index: 154 entries, 3 to 301
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Sample  154 non-null    object
 1   Group   154 non-null    object
dtypes: object(2)
memory usage: 3.6+ KB
None
istrong_cols = count_matrix.columns.isin(istrong_samples["Sample"])
print("istrong column mask (first 10):")
print(istrong_cols[:10])
print("Number of istrong columns in istrong column mask:", istrong_cols.sum())

print("\nistrong column mask (first 30):", istrong_cols[:30])
print(
    f"First 30 columns of istrong: {count_matrix.columns[istrong_cols][:30]} \
      \n Total number of istrong columns: {len(count_matrix.columns[istrong_cols])}"
)
istrong column mask (first 10):
[False False False False False False False False False False]
Number of istrong columns in istrong column mask: 37

istrong column mask (first 30): [False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False]
First 30 columns of istrong: Index(['SH_TS_BC119', 'SH_TS_BC150', 'SH_TS_BC151', 'SH_TS_BC153',
       'SH_TS_BC165', 'SH_TS_BC166', 'SH_TS_BC170', 'SH_TS_BC171',
       'SH_TS_BC175', 'SH_TS_BC177', 'SH_TS_BC178', 'SH_TS_BC180',
       'SH_TS_BC182', 'SH_TS_BC188', 'SH_TS_BC193', 'SH_TS_BC199',
       'SH_TS_BC202', 'SH_TS_BC204', 'SH_TS_BC209', 'SH_TS_BC211',
       'SH_TS_BC219', 'SH_TS_BC233', 'SH_TS_BC235', 'SH_TS_BC240',
       'SH_TS_BC249', 'SH_TS_BC252', 'SH_TS_BC255', 'SH_TS_BC265',
       'SH_TS_BC266', 'SH_TS_BC272'],
      dtype='object')       
 Total number of istrong columns: 37

3. Data Preprocessing

  • Convert count matrix to numeric values

  • Apply log2 transformation: log2(counts + 1)

Convert cm to log scale:

  • Gene expression count data often contains zeros (genes that weren’t detected)

  • Since log₂(0) is mathematically undefined (negative infinity), we add 1 to every value

  • This is called a “pseudo-count” approach, creating what’s known as “log₂(counts+1)”

  • Differences in log space correspond to fold changes in original space

    • A difference of 1 in log₂ space = a 2-fold change in original counts

    • A difference of 2 in log₂ space = a 4-fold change in original counts

Convert count matrix to numeric values

# Convert count matrix to numeric values
count_matrix = count_matrix.astype(float, errors="raise")
count_matrix.info()
<class 'pandas.core.frame.DataFrame'>
Index: 80 entries, ACTR3B to VEGFA
Columns: 483 entries, SH_TS_BC_C1 to UNC_TGS_BC_Y90_R1
dtypes: float64(483)
memory usage: 302.5+ KB
print("Counts of zeros in each column:")
print(count_matrix.apply(lambda x: x == 0, axis=0).sum(axis=0))
print(
    "Total zeros in count matrix:",
    count_matrix.apply(lambda x: x == 0, axis=0).sum(axis=0).sum(),
)
Counts of zeros in each column:
SH_TS_BC_C1          0
SH_TS_BC_C11         0
SH_TS_BC_C15         0
SH_TS_BC_C3          3
SH_TS_BC01           5
                    ..
UNC_TGS_BC_9m        1
UNC_TGS_BC_Y23       4
UNC_TGS_BC_Y23_R1    9
UNC_TGS_BC_Y90       2
UNC_TGS_BC_Y90_R1    3
Length: 483, dtype: int64
Total zeros in count matrix: 3150
np_matrix = np.array(count_matrix.iloc[:, :])
print("NumPy matrix shape:", np_matrix.shape)
print("Counts of zeros in each column:", (np_matrix == 0).sum(axis=0))
print("Total Counts of zeros in matrix:", (np_matrix == 0).sum(axis=0).sum())
NumPy matrix shape: (80, 483)
Counts of zeros in each column: [ 0  0  0  3  5  4  3  2  6  6  2  4  4  4  1  4  0  0  8  0 16  1  0  1
  3 37  2  4  2  4  4  1  2  2  3  3  2  6 17 10 11  5  8  2  3  0  1  1
  5  3  4  2  2  4  1  1 18  2  4  1  8  0  0  3  1  1  5  6  2  3  6  4
  6 24 15  4  7 36  3 18 10 28  2 20  2  7  3  3  2  9 18  3  3  4  8  1
 11  3  5 13  6 20  6 16 17  3 17  2  6  2  5 18  9  3 14  2 13  2  0  0
  0  1  3  8 11  1  8 17  2  4  6 17 19 12  5 21  3 36  2  1  2  5 32  1
  0  1  0  1  0  7  2  1  1  6 11  1  1  0  3  7  0  2  9 11  6  5  4  1
 16  6  6  1  1  1  2  2  3  1  3  0  2  1  3  1  1  2  3  6  4  2 34  1
  1  2  2 12 24 20  3  2 13  2  2 21 10  1  8  3 18 19  4 23 35  1  2  3
 10  1  0  0  6  8  0  0  1  1  0  1  1  1  1  1  2 15  4  6  2 12  3 11
  3  3 17  8  8  1 14  6 18  7  0  1 16  8 16  3  7  6  6 13 24 15  3  6
 15  1  0  2  1 16  1  9  2  1  1  0  0  2  9  0  0  4  1  0 11  1 10  3
  0  7  0  3  0  2 24  2  8  1  2  0 18  4 17 26 22  9 20 21  1  7 10  2
 13  1 23  5 17  8  7 19 13 28 25 24 21 11  4  4  3 15  9  2 12 11 16  1
  0  8  7  9 11 12  6  5  0  7 14 12  7 22  4  7 13 19 13  6  0  3  2 15
 11  6  1 11  0  8  9  3  9 19  1  5  5  7  1  4  6  5  7  2  3  2  5  7
 13 17  0  1  6  4  0  0  1  2  3  9  2  8 11  2 36  7  1  0  3  8  2  0
  2 24  4  6 10  2  1  3 12  8  7  6 23  1 10  3  0 15  3 12  3  6  2  8
  8 26 21 13  8  5 16 18 19  9  2  3  2  4  3  2  2  9  1  1  0 21  1  1
  1  0  0  1  2  1  1  6  1  1  0  6  5  2  2  8  1  4  2  4  2  4  1  4
  9  2  3]
Total Counts of zeros in matrix: 3150

Apply log2 transformation: log2(counts + 1)

# Convert count_matrix to log2
count_matrix_log2 = count_matrix.apply(lambda x: np.log2(x + 1), axis=0)
print(
    "Log2 transformed count matrix (first 5 rows & columns):\n",
    count_matrix_log2.iloc[:5, :5],
)
print("Log2 transformed count matrix info:\n", count_matrix_log2.info())
print(
    "Log2 transformed count matrix descriptive statistics (first 5 samples):\n",
    count_matrix_log2.iloc[:, :5].describe(),
)
Log2 transformed count matrix (first 5 rows & columns):
           SH_TS_BC_C1  SH_TS_BC_C11  SH_TS_BC_C15  SH_TS_BC_C3  SH_TS_BC01
Gene                                                                      
ACTR3B       4.700440      9.129283      7.857981     5.491853    4.584963
ANLN         7.442943     11.273796      9.792790     8.326429    2.807355
APOBEC3G     6.845490     13.104435     11.441907     9.071462    5.584963
AURKA        9.292322     12.871328     11.466586     9.142107    3.906891
BAG1         8.312883     12.538674     11.203348     8.108524    4.754888
<class 'pandas.core.frame.DataFrame'>
Index: 80 entries, ACTR3B to VEGFA
Columns: 483 entries, SH_TS_BC_C1 to UNC_TGS_BC_Y90_R1
dtypes: float64(483)
memory usage: 302.5+ KB
Log2 transformed count matrix info:
 None
Log2 transformed count matrix descriptive statistics (first 5 samples):
        SH_TS_BC_C1  SH_TS_BC_C11  SH_TS_BC_C15  SH_TS_BC_C3  SH_TS_BC01
count    80.000000     80.000000     80.000000    80.000000   80.000000
mean      7.777790     12.376603     10.938512     8.774008    4.519537
std       2.917182      2.540082      2.458104     2.686284    2.755204
min       1.000000      3.807355      2.807355     0.000000    0.000000
25%       5.894663     10.780910      9.436778     7.698564    2.241446
50%       8.049386     12.419880     10.895191     9.049684    4.523562
75%       9.732161     13.931187     12.398038    10.158550    6.661454
max      13.957918     17.866844     16.286919    13.597820   11.200286

4. Statistical Analysis

  • Calculate mean and std for each gene within each group

  • Compute Z-scores within each sample group

  • Calculate Z-score differences between groups & SD of the Z-score difference

print(
    f"iweak_cols mask: {iweak_cols[:10]} \
      \nistrong_cols mask: {istrong_cols[:10]} \
      \nTotal number of iweak columns: {len(count_matrix_log2.columns[iweak_cols])} \
      \nTotal number of istrong columns: {len(count_matrix_log2.columns[istrong_cols])}"
)
iweak_cols mask: [False False False False False False False False False False]       
istrong_cols mask: [False False False False False False False False False False]       
Total number of iweak columns: 54       
Total number of istrong columns: 37

Calculate mean and std for each gene within each group

# Mean log2 value for iweak and istrong samples
mean_iweak = count_matrix_log2.iloc[:, iweak_cols].mean(axis=1)
mean_istrong = count_matrix_log2.iloc[:, istrong_cols].mean(axis=1)
print("Mean log2 value for iweak samples (first 5 rows):\n", mean_iweak.head())
print("Mean log2 value for istrong samples (first 5 rows):\n", mean_istrong.head())
Mean log2 value for iweak samples (first 5 rows):
 Gene
ACTR3B      7.860318
ANLN        8.870121
APOBEC3G    8.839295
AURKA       9.873015
BAG1        8.818064
dtype: float64
Mean log2 value for istrong samples (first 5 rows):
 Gene
ACTR3B       6.994971
ANLN         6.953521
APOBEC3G    10.527763
AURKA        9.192108
BAG1         9.029261
dtype: float64
# Mean log2 value for iweak and istrong samples
std_iweak = count_matrix_log2.iloc[:, iweak_cols].std(axis=1)
std_istrong = count_matrix_log2.iloc[:, istrong_cols].std(axis=1)
print(
    "Standard deviation log2 value for iweak samples (first 5 rows):\n",
    std_iweak.head(),
)
print(
    "Standard deviation log2 value for istrong samples (first 5 rows):\n",
    std_istrong.head(),
)
Standard deviation log2 value for iweak samples (first 5 rows):
 Gene
ACTR3B      1.995958
ANLN        1.554415
APOBEC3G    2.074605
AURKA       1.191852
BAG1        2.199874
dtype: float64
Standard deviation log2 value for istrong samples (first 5 rows):
 Gene
ACTR3B      2.319413
ANLN        2.843593
APOBEC3G    1.321013
AURKA       2.242906
BAG1        2.019663
dtype: float64
print(count_matrix_log2.shape, mean_iweak.shape)
(80, 483) (80,)

Compute Z-scores within each sample group

# Calculate Z-scores for iweak samples
## Numpy like operations
z_iweak = (
    count_matrix_log2.iloc[:, iweak_cols] - mean_iweak.values.reshape(-1, 1)
) / std_iweak.values.reshape(-1, 1)
print("Z-scores for iweak samples (first 5 rows):\n")
print(z_iweak.iloc[:5, :5])
Z-scores for iweak samples (first 5 rows):

          SH_TS_BC111  SH_TS_BC112  SH_TS_BC113  SH_TS_BC133  SH_TS_BC134
Gene                                                                     
ACTR3B      -0.798379    -0.735487    -0.920836     0.458160    -0.943426
ANLN        -1.188713    -2.057170     0.161741     0.227662    -5.063077
APOBEC3G    -4.260712     0.307752    -0.569305     0.264495    -1.499743
AURKA       -1.109424    -1.367487     0.168751     1.121313    -2.346097
BAG1        -4.008440    -0.285580    -0.185771     0.582435    -1.021972

Z-score calculation using pandas built-in sub and div functions:

sub() and div():

  • .sub(): is pandas’ method to perform element-wise subtraction

  • .div(): is pandas’ method to perform element-wise division

  • Accepts a value, series, dataframe to subtract

  • axis= - specify the axis along which to perform the operation

  • This is the preferred method as it is more readable and less error-prone

Note: When you perform operations like subtraction using the sub() method, pandas typically follows its broadcasting rules to align indices. This can sometimes lead to unintended behavior if the shapes of the Series or DataFrames don’t match.

# Calculate Z-scores for iweak samples, using `sub` and `div`

z_iweak = (
    count_matrix_log2.iloc[:, iweak_cols].sub(mean_iweak, axis=0).div(std_iweak, axis=0)
)
print("Z-scores for iweak samples (first 5 rows):\n", z_iweak.iloc[:5, :5])
print(
    "\n Z-scores for iweak samples Rows (Genes) and Columns (samples):", z_iweak.shape
)
Z-scores for iweak samples (first 5 rows):
           SH_TS_BC111  SH_TS_BC112  SH_TS_BC113  SH_TS_BC133  SH_TS_BC134
Gene                                                                     
ACTR3B      -0.798379    -0.735487    -0.920836     0.458160    -0.943426
ANLN        -1.188713    -2.057170     0.161741     0.227662    -5.063077
APOBEC3G    -4.260712     0.307752    -0.569305     0.264495    -1.499743
AURKA       -1.109424    -1.367487     0.168751     1.121313    -2.346097
BAG1        -4.008440    -0.285580    -0.185771     0.582435    -1.021972

 Z-scores for iweak samples Rows (Genes) and Columns (samples): (80, 54)
# Calculate Z-scores for istrong samples, using `sub` and `div`

z_istrong = (
    count_matrix_log2.iloc[:, istrong_cols]
    .sub(mean_istrong, axis=0)
    .div(std_istrong, axis=0)
)
print("Z-scores for iweak samples (first 5 rows):\n", z_istrong.iloc[:5, :5])
print(
    "\n Z-scores for iweak samples Rows (Genes) and Columns (samples):", z_istrong.shape
)
Z-scores for iweak samples (first 5 rows):
           SH_TS_BC119  SH_TS_BC150  SH_TS_BC151  SH_TS_BC153  SH_TS_BC165
Gene                                                                     
ACTR3B       0.251123     0.745681     0.473303     1.138150     0.310111
ANLN         0.963558    -0.046892     0.846965     1.049066     0.320252
APOBEC3G     0.135300     0.285290     1.097819     1.213074     0.762314
AURKA        0.438189     0.594938     0.731543     0.743076     0.704167
BAG1         0.293509     1.148990     0.992325     0.364224    -0.072595

 Z-scores for iweak samples Rows (Genes) and Columns (samples): (80, 37)

Calculate Z-score differences between groups & SD of the Z-score difference

  1. Calculate mean z-score for each gene in two groups

  2. Calculate z-score difference of each group

  3. Calculate the SD

# Calculate mean z-score for each gene in two groups
# Calcualte z-score difference of each group

z_diff = z_istrong.mean(axis=1) - z_iweak.mean(axis=1)
print("Shape of z_diff:", z_diff.shape)
print("Z-score difference (istrong - iweak) (first 5 rows):\n", z_diff.head())
Shape of z_diff: (80,)
Z-score difference (istrong - iweak) (first 5 rows):
 Gene
ACTR3B      1.345857e-15
ANLN       -1.994845e-17
APOBEC3G   -7.123653e-16
AURKA      -4.685386e-16
BAG1        1.760576e-15
dtype: float64
# SD of z-score difference
z_diff_std = z_diff.std()
print("Type of z_diff_std:", type(z_diff_std))
print("Standard deviation of z-score difference:", z_diff_std)
Type of z_diff_std: <class 'numpy.float64'>
Standard deviation of z-score difference: 1.4956431674223958e-15

5. Ranking Genes

  • Calculate Z-ratio: difference / std_difference

  • Rank genes by Z-ratio (highest to lowest)

Calculate Z-ratio: Z-score difference / std_difference

z_score_ratios = z_diff / z_diff_std
print("Shape of z_score_ratios:", z_score_ratios.shape)
print("Z-score ratios (istrong - iweak) (first 5 rows):\n", z_score_ratios.head())
Shape of z_score_ratios: (80,)
Z-score ratios (istrong - iweak) (first 5 rows):
 Gene
ACTR3B      0.899852
ANLN       -0.013338
APOBEC3G   -0.476294
AURKA      -0.313269
BAG1        1.177136
dtype: float64

Rank genes by Z-ratio (highest to lowest)

z_score_ratios.sort_values(ascending=False)
Gene
GAPDH      3.143055
CCL5       2.748459
CD68       2.222521
HLA-DMA    2.007418
MKI67      1.917063
             ...   
MDM2      -1.469024
UBE2C     -1.492727
PSMC4     -1.776938
TYMS      -2.403611
NUF2      -2.414980
Length: 80, dtype: float64
import matplotlib.pyplot as plt
z_score_ratios.sort_values(ascending=False).plot(
    kind="bar",
    figsize=(20, 5),
    title="Z-score ratios (istrong - iweak)",
    xlabel="Genes",
    ylabel="Z-score ratios",
)
<Axes: title={'center': 'Z-score ratios (istrong - iweak)'}, xlabel='Genes', ylabel='Z-score ratios'>
../_images/522d5072b3f0f8e5d5ad0cf78d76dce25ea5d0c3f47c8686b8b8156c5f4defa3.png