Automation and reproducible workflows

Objectives

  • Understand the difference between a script and a workflow.

  • Understand the pros and cons of “simple” scripts.

What if we need to run many similar calculations?

It all started relatively simple:

python generate-data.py --num-planets 100 --output-file initial.csv

python simulate.py --num-steps 50 \
                   --input-file initial.csv \
                   --output-file final.csv \
                   --trajectories-file trajectories.npz

python animate.py --initial-file initial.csv \
                  --trajectories-file trajectories.npz \
                  --output-file animation.mp4

But now we want to run this for different numbers of planets: 10, 20, 30, 40, …

One possible solution:

#!/usr/bin/env bash

for num_planets in 10 20 30 40 50; do
    python generate-data.py --num-planets ${num_planets} \
                            --output-file initial.csv

    python simulate.py --num-steps 50 \
                       --input-file initial.csv \
                       --output-file final.csv \
                       --trajectories-file trajectories.npz

    python animate.py --initial-file initial.csv \
                      --trajectories-file trajectories.npz \
                      --output-file animation-${num_planets}.mp4
done

Discussion

How would you solve this problem?

Can you list some alternatives to the solution presented above (for-loop inside a shell script)?

What are the pros and cons of the solution presented above?

  • Consider the case where a step can take hours.

  • Imagine needing to run hundreds of calculations.

  • Consider the case where a step/calculation can fail.

  • Consider the case where you might find a mistake in one of the Python scripts.

Where to explore more

  • Snakemake

  • Nextflow

  • There are many more workflow/pipeline tools and frameworks. Do not invent your own!