Recording computational steps

Questions

  • You have some steps that need to be run to do your work. How do you actually run them? Does it rely on your own memory and work, or is it reproducible? How do you communicate the steps for future you and others?

  • How can we create a reproducible workflow?

  • When to use scientific workflow management systems.

Instructor note

  • 5 min teaching

  • 15 min demo

Several steps from input data to result

The following material is partly derived from a HPC Carpentry lesson.

In this episode, we will use an example project which finds most frequent words in books and plots the result from those statistics. In this example we wish to:

  1. Analyze word frequencies using code/count.py for 4 books (they are all in the data directory).

  2. Plot a histogram using plot/plot.py.

From book to word counts to plot

Example (for one book only):

$ python code/count.py data/isles.txt > statistics/isles.data
$ python plot/plot.py --data-file statistics/isles.data --plot-file plot/isles.png

Another way to analyze the data would be via a graphical user interface (GUI), where you can for example drag and drop files and click buttons to do the different processing steps.

Both of the above (single line commands and simple graphical interfaces) are tricky in terms of reproducibility. We currently have two steps and 4 books. But imagine having 4 steps and 500 books. How could we deal with this?

As a first idea we could express the workflow with a script. The repository includes such script called run_all.sh.

We can run it with:

$ bash run_all.sh

This is imperative style: we tell the script to run these steps in precisely this order, as we would run them manually, one after another.

Discussion

  • What are the advantages of this solution compared to processing all one by one?

  • Is the scripted solution reproducible?

  • Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate with a scripted solution?


Workflow tools

Sometimes it may be helpful to go from imperative to declarative style. Rather than saying “do this and then that” we describe dependencies but we let the tool figure out the series of steps to produce results.

Example workflow tool: Snakemake

Snakemake (inspired by GNU Make) is one of many tools to create reproducible and scalable data analysis workflows. Workflows are described via a human readable, Python based language. Snakemake workflows scale seamlessly from laptop to cluster or cloud, without the need to modify the workflow definition.


A demo

Preparation

The exercise (below) and pre-exercise discussion uses the word-count repository (https://github.com/coderefinery/word-count) which we need to clone to work on it.

If you want to do this exercise on your own, you can do so either on your own computer (follow the instructions in the bottom right panel on the CodeRefinery installation instruction page), or the Binder cloud service:

On your own computer:

  • Install the necessary tools

  • Activate the coderefinery conda environment with conda activate coderefinery.

  • Clone the word-count repository:

    $ git clone https://github.com/coderefinery/word-count.git
    

On Binder: We can also use the cloud service Binder to make sure we all have the same computing environment. This is interesting from a reproducible research point of view and it’s explained further in the Jupyter lesson how this is even possible.

Workflow-1: Workflow solution using Snakemake

How Snakemake works

Somebody wrote a Snakemake solution in the Snakefile:

# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book

rule all:
    input:
        expand('statistics/{book}.data', book=DATA),
        expand('plot/{book}.png', book=DATA)

# count words in one of our books
rule count_words:
    input:
        script='code/count.py',
        book='data/{file}.txt'
    output: 'statistics/{file}.data'
    shell: 'python {input.script} {input.book} > {output}'

# create a plot for each book
rule make_plot:
    input:
        script='code/plot.py',
        book='statistics/{file}.data'
    output: 'plot/{file}.png'
    shell: 'python {input.script} --data-file {input.book} --plot-file {output}'

We can see that Snakemake uses declarative style: Snakefiles contain rules that relate targets (output) to dependencies (input) and commands (shell).

Steps:

  1. Clone the example to your computer: $ git clone https://github.com/coderefinery/word-count.git

  2. Study the Snakefile. How does it know what to do first and what to do then?

  3. Try to run it. Since version 5.11 one needs to specify number of cores (or jobs) using -j, --jobs or --cores:

    $ snakemake --delete-all-output -j 1
    $ snakemake -j 1
    

    The --delete-all-output part makes sure that we remove all generated files before we start.

  4. Try running snakemake again and observe that and discuss why it refused to rerun all steps:

    $ snakemake -j 1
    
    Building DAG of jobs...
    Nothing to be done (all requested files are present and up to date).
    
  5. Make a tiny modification to the plot.py script and run $ snakemake -j 1 again and observe how it will only re-run the plot steps.

  6. Make a tiny modification to one of the books and run $ snakemake -j 1 again and observe how it only regenerates files for this book.

  7. Discuss possible advantages compared to a scripted solution.

  8. Question for R developers: Imagine you want to rewrite the two Python scripts and use R instead. Which lines in the Snakefile would you have to modify so that it uses your R code?

  9. If you make changes to the Snakefile, validate it using $ snakemake --lint.

Visualizing the workflow

We can visualize the directed acyclic graph (DAG) of our current Snakefile using the --dag option, which will output the DAG in dot language.

Note: This requires the Graphviz software, which can be installed by conda install graphviz.

$ snakemake -j 1 --dag | dot -Tpng > dag.png

Rules that have yet to be completed are indicated with solid outlines, while already completed rules are indicated with dashed outlines.

Snakemake DAG

Why Snakemake?

  • Gentle learning curve.

  • Free, open-source, and installs easily via conda or pip.

  • Cross-platform (Windows, MacOS, Linux) and compatible with all High Performance Computing (HPC) schedulers: same workflow works without modification and scales appropriately whether on a laptop or cluster.

  • If several workflow steps are independent of each other, and you have multiple cores available, Snakemake can run them in parallel.

  • Is is possible to define isolated software environments per rule, e.g. by adding conda: 'environment.yml' to a rule.

  • Also possible to run workflows in Docker or Apptainer containers e.g. by adding container: 'docker://some-org/some-tool#2.3.1' to a rule.

  • Heavily used in bioinformatics, but is completely general.

  • Nice functionality for archiving the workflow, see: the official documentation

Tools like Snakemake help us with reproducibility by supporting us with automation, scalability and portability of our workflows.

Similar tools

Keypoints

  • Computational steps can be recorded in many ways

  • Workflow tools can help, if there are many steps to be executed