How to parallelize independent tasks using workflows (example: Snakemake)

Objectives

Understand the concept of a workflow management tool.
Instead of thinking in terms of individual step-by-step commands, think in terms of dependencies (rules).
Try to port our computational pipeline to Snakemake.
See how Snakemake can identify independent steps and run them in parallel.
It is not our goal to remember all the details of Snakemake.

The problem

Imagine we want to process a large number of similar input data.

This could be one way to do it:

#!/usr/bin/env bash

num_rounds=10

for i in $(seq -w 1 ${num_rounds}); do
    python generate_data.py \
            --num-samples 2000 \
            --training-data data/train_${i}.csv \
            --test-data data/test_${i}.csv

    python generate_predictions.py \
            --num-neighbors 7 \
            --training-data data/train_${i}.csv \
            --test-data data/test_${i}.csv \
            --predictions results/predictions_${i}.csv

    python plot_results.py \
            --training-data data/train_${i}.csv \
            --predictions results/predictions_${i}.csv \
            --output-chart results/chart_${i}.png
done

Discuss possible problems with this approach.

Thinking in terms of dependencies

For the following we will assume that we have the input data available:

#!/usr/bin/env bash

num_rounds=10

for i in $(seq -w 1 ${num_rounds}); do
    python generate_data.py \
            --num-samples 2000 \
            --training-data data/train_${i}.csv \
            --test-data data/test_${i}.csv
done

From here on we will focus on the processing part.

The central file in Snakemake is the snakefile. This is how we can express the pipeline in Snakemake (below we will explain it):

# the comma is there because glob_wildcards returns a named tuple
numbers, = glob_wildcards("data/train_{number}.csv")


# rule that collects the target files
rule all:
    input:
        expand("results/chart_{number}.svg", number=numbers)


rule chart:
    input:
        script="plot_results.py",
        predictions="results/predictions_{number}.csv",
        training="data/train_{number}.csv"
    output:
        "results/chart_{number}.svg"
    log:
        "logs/chart_{number}.txt"
    shell:
        """
        python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output}
        """


rule predictions:
    input:
        script="generate_predictions.py",
        training="data/train_{number}.csv",
        test="data/test_{number}.csv"
    output:
        "results/predictions_{number}.csv"
    log:
        "logs/predictions_{number}.txt"
    shell:
        """
        python {input.script} --num-neighbors 7 --training-data {input.training} --test-data {input.test} --predictions {output}
        """

Explanation:

The snakefile contains 3 rules and it will run the “all” rule by default unless we ask it to produce a different “target”:
- “all”
- “chart”
- “predictions”
Rules “predictions” and “chart” depend on input and produce output.
Note how “all” depends on the output of the “chart” rule and how “chart” depends on the output of the “predictions” rule.
The shell part of the rule shows how to produce the output from the input.
We ask Snakemake to collect all files that match "data/train_{number}.csv" and from this to infer the wildcards {number}.
Later we can refer to {number} throughout the snakefile.

This part defines what we want to have at the end:

rule all:
    input:
        expand("results/chart_{number}.svg", number=numbers)

Rules correspond to steps and parameter scanning can be done with wildcards.

Exercise

Exercise: Practicing with Snakemake

Create a snakefile (above) and run it with Snakemake (adjust number of cores):
```
$ snakemake --cores 8
```
Check the output. Did it use all available cores? How did it know which steps it can start in parallel?
Run Snakemake again. Now it should finish almost immediately because all the results are already there. Aha! Snakemake does not repeat steps that are already done. You can force it to re-run all with snakemake --delete-all-output.
Remove few files from results/ and run it again. Snakemake should now only re-run the steps that are necessary to get the deleted files.
Modify generate_predictions.py which is used in the rule “predictions”. Now Snakemake will only re-run the steps that depend on this script.
It is possible to only process one file with (useful for testing):
```
$ snakemake results/predictions_09.csv
```
Add few more data files to the input directory data/ (for instance by copying some existing ones) and observe how Snakemake will pick them up next time you run the workflow, without changing the snakefile.

What else is possible?

With the option --keep-going we can tell Snakemake to not give up on first failure.
The option --restart-times 3 would tell Snakemake to try to restart a rule up to 3 times if it fails.
It is possible to tell Snakemake to use a locally mounted file system instead of the default network file system for certain rules (documentation).
Sometimes you need to run different rules inside different software environments (e.g. Conda environments) and this is possible.
A lot more is possible:
- Snakemake documentation
- Snakemake tutorial

More elaborate example

In this example below we also scan over a range of numbers for the --num-neighbors parameter:

# the comma is there because glob_wildcards returns a named tuple
numbers, = glob_wildcards("data/train_{number}.csv")


# define the parameter scan for num-neighbors
neighbor_values = [1, 3, 5, 7, 9, 11]


# rule that collects all target files
rule all:
    input:
        expand("results/chart_{number}_{num_neighbors}.svg", number=numbers, num_neighbors=neighbor_values)


rule chart:
    input:
        script="plot_results.py",
        predictions="results/predictions_{number}_{num_neighbors}.csv",
        training="data/train_{number}.csv"
    output:
        "results/chart_{number}_{num_neighbors}.svg"
    log:
        "logs/chart_{number}_{num_neighbors}.txt"
    shell:
        """
        python {input.script} --training-data {input.training} --predictions {input.predictions} --output-chart {output}
        """


rule predictions:
    input:
        script="generate_predictions.py",
        training="data/train_{number}.csv",
        test="data/test_{number}.csv"
    output:
        "results/predictions_{number}_{num_neighbors}.csv"
    log:
        "logs/predictions_{number}_{num_neighbors}.txt"
    params:
        num_neighbors=lambda wildcards: wildcards.num_neighbors
    shell:
        """
        python {input.script} --num-neighbors {params.num_neighbors} --training-data {input.training} --test-data {input.test} --predictions {output}
        """

Other workflow management tools

Another very popular one is Nextflow.
Many workflow management tools exist (overview).