Reproducible dependencies, environments, and workflows

Objectives

  • Learn how to document dependencies in Python.

  • Discuss how to document dependencies in other languages.

  • Each (non-trivial) Python project from now on should have a requirements.txt or environment.yml file.

Have you ever experienced the following?

  1. You try to run a code or notebook and it cannot find a module?

Traceback (most recent call last):
  File "/home/user/imgfilters/example.py", line 1, in <module>
    from imgfilters.filters import (
  File "/home/user/imgfilters/imgfilters/filters.py", line 1, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
  1. You have installed the module but the code still does not work since it needs a different version of the module?

  2. “It works on my machine 🤷” but not on your machine?

Discussion

  • How do you install and use dependencies in your projects?

  • How do you document them?

Tracking dependencies in Python

There are two standard files to document dependencies in Python. Either use requirements.txt (together with pip) or environment.yml (Conda).

Let us study the requirements.txt from our project.

  • How would we “pin” the versions?

  • When should we “pin” the versions?

How would the corresponding environment.yml look like?

name: imgfilters
channels:
  - conda-forge
dependencies:
  - python=3.12
  - numpy
  - scikit-image
  - scikit-learn
  - pywavelets
  - matplotlib
  - jupyterlab

What if you don’t remember which dependencies you have installed?

Conda:

# list all dependencies and write them to file
$ conda env export > environment.yml

# omit detailed build information
$ conda env export --no-builds > environment.yml

# only list dependencies that were explicitly installed, not their dependencies
$ conda env export --from-history > environment.yml

pip:

$ python -m pip freeze > requirements.txt

My recommendation to install dependencies and document them at the same time: Document them first and then install from file:

$ conda env create -f environment.yml
$ pip install -r requirements.txt

What if I need to record the entire operating system?

One solution can be: Containers (Apptainer/Singularity, Docker, …) but the details of this are outside the scope of this short course.

Recording computational steps

We need a way to record and communicate computational steps:

  • README (steps written out “in words”)

  • Scripts (typically shell scripts)

  • Notebooks (Jupyter or R Markdown)

  • Workflows (Snakemake, doit, …)

Busy kitchen

[Midjourney, CC-BY-NC 4.0]

More reading