Parallelize using Workflow Manager

In this section, we discuss running the Iris data set example using Snakemake, a workflow manager tool.

Motivation

In the previous section Parallelize using scripting, we examined a workflow consisting of two scripts, preprocess.py and train_and_plot.py.

The preprocess.py script created a preprocessed Iris data set and saved it to the disk.
The train_and_plot.py script read (number of neighbors, distance metric) parameter values as command line arguments, loaded the preprocessed data, trained an Iris subspecies classifier, and plotted the classifier’s boundary decisions.

The workflow was then submitted to Slurm queue using two separate submission scripts with the following schedule

preprocess.py was run first.
Multiple jobs of train_and_plot.py using different (number of neighbors, distance metric) values were submitted in parallel.

The submission scripts (and array jobs) work well for these kind of small workflows and are usually the go-to solution. However, if the workflow is larger and consists of several steps, such as multiple preprocessing and postprocessing scripts, we may instead want to use a dedicated workflow manager tool.

The general idea of a workflow manager is that each computational step in a workflow is presented as a rule which takes its input as a file and writes its output to a file. The workflow manager then

Detects in which order the steps need to be run and which steps of the workflow can be run in parallel.
Checks if some of the expected result files already exist on the disk and only runs jobs needed to produce the missing results.
Submits the jobs to the Slurm queue accordingly.

While there are multiple workflow managers out there (see an example list), here we will use a particular tool named Snakemake. In Snakemake, the workflow rules are written in a Snakefile using a Python-like scripting language. Snakemake itself is also written in Python. However, the computational steps in the workflow can use any language.

Accessing Snakemake on an HPC cluster

Snakemake can be installed using pip along with its Slurm plugin. However, since not all clusters allow users to install their own software, it is up to the cluster admins to provide users with a recommended way to access to Snakemake. For example:

CSC Puhti users can follow their official Snakemake documentation.
Aalto Triton users can load the generic scientific computing python environment module: module load scicomp-python-env

Consult your cluster’s documentation and/or contact your cluster’s administration to find the recommended way of using Snakemake.

Create and Run Snakemake Workflow

In order to run the preprocess.py and train_and_plot.py as a Snakemake workflow, we do the following:

We write a Snakefile which defines the preprocessing and training/plotting steps as rules.
We write a profile file which defines the same requested computational resources as the Slurm batch script in section Create a submission script.

The Snakefile:

# Parameter values
N_NEIGHBORS_LIST = [1, 2, 4, 8, 16, 32, 64]
METRICS = ["cosine", "euclidean", "haversine", "l1", "manhattan"]

# Final output files
# The "all" rule lists all files that should ultimately be produced by the workflow
rule all:
    input:
        expand("results/n_neighbors={n_neighbors}___metric={metric}.png", n_neighbors=N_NEIGHBORS_LIST, metric=METRICS)
        

# Rule to produce an image file corresponding to a (n_neighbors, metric) combination
# This rule
# - takes as input the preprocessed data
# - produces an image file as output
# - uses an Apptainer container to run the script
# - logs the output of the script
rule train_and_plot:
    input:
        "data/preprocessed/Iris.pkl"
    output:
        "results/n_neighbors={n_neighbors}___metric={metric}.png",
    container:
        "docker://harbor.cs.aalto.fi/aaltorse-public/coderefinery/parallel-workflow:latest"
    log: 
        "logs/train_and_plot/n_neighbors={n_neighbors}___metric={metric}.log"
    shell:
        "python train_and_plot.py --n_neighbors {wildcards.n_neighbors} --metric {wildcards.metric} 1> {log} 2> {log}"


# Rule to preprocess and create the data set
rule preprocess:
    output:
        "data/preprocessed/Iris.pkl"
    container:
        "docker://harbor.cs.aalto.fi/aaltorse-public/coderefinery/parallel-workflow:latest"
    log: 
        "logs/preprocess_data/preprocess.log"
    shell:
        "python preprocess.py 1> {log} 2> {log}"
 

A Snakemake profile file:

# Tell snakemake to use Slurm
executor: slurm

# Maximum number of parallel jobs
jobs: 10  

# Set number of threads for rule(s) (in Snakemake 'threads' is equal to cpus-per-task)
set-threads: 
  preprocess: 1
  train_and_plot: 2

# Set other resources (in this case memory and time) for rule(s)
# Formats are described in: 
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#standard-resources
set-resources:
  preprocess:
    mem: 500MB
    runtime: 30m
  train_and_plot:
    mem: 1GB
    runtime: 1h

We run Snakemake with

snakemake --snakefile Snakefile --profile profiles/slurm/ --software-deployment-method apptainer

What the command does:

Snakemake infers from workflow/Snakefile that the required input files specified in rule “All” can be created using the rule “train_and_plot” in an embarrassingly parallel manner. (Note that input files of the rule “All” are our target image files.)
Snakemake looks for a profile configuration file config.yml in the given path profiles/slurm/. The profile tells Snakemake to submit the jobs to Slurm and to request specific resources (cpus, memory, runtime, etc.). The resources are specified for each rule individually.
The option --software-deployment-method tells Snakemake to create the environments in which the rules are run using apptainer and conda.

Advantages and Disadvantages

Advantages of using a workflow manager to parallelize jobs:

Defining complete workflow using a workflow manager makes sure that scripts are submitted in correct order and in parallel if possible.
The workflow manager checks if some or all of the expected result files already exist and only runs jobs needed to produce the missing results.
Workflow managers promote reproduciblity of experiments by fixing the computational pipeline and by encouraging the use of containers and environments.

Distadvantages:

Not all clusters support using the workflow manager(s) of your choice out of the box. In this case, contact the cluster admin and ask what is the recommended way to use them.
Workflow managers are (relatively) complex tools with their own scripting syntaxes, practices, and ecosystems. Learning to use one will take time.