Reproducible dependencies, environments, and workflows
Objectives
Learn how to document dependencies in Python.
Discuss how to document dependencies in other languages.
Each (non-trivial) Python project from now on should have a
requirements.txt
orenvironment.yml
file.
Have you ever experienced the following?
You try to run a code or notebook and it cannot find a module?
Traceback (most recent call last):
File "/home/user/imgfilters/example.py", line 1, in <module>
from imgfilters.filters import (
File "/home/user/imgfilters/imgfilters/filters.py", line 1, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
You have installed the module but the code still does not work since it needs a different version of the module?
“It works on my machine 🤷” but not on your machine?
Discussion
How do you install and use dependencies in your projects?
How do you document them?
Tracking dependencies in Python
There are two standard files to document dependencies in Python.
Either use requirements.txt
(together with pip) or environment.yml
(Conda).
Let us study the requirements.txt from our project.
How would we “pin” the versions?
When should we “pin” the versions?
How would the corresponding environment.yml
look like?
name: imgfilters
channels:
- conda-forge
dependencies:
- python=3.12
- numpy
- scikit-image
- scikit-learn
- pywavelets
- matplotlib
- jupyterlab
What if you don’t remember which dependencies you have installed?
Conda:
# list all dependencies and write them to file
$ conda env export > environment.yml
# omit detailed build information
$ conda env export --no-builds > environment.yml
# only list dependencies that were explicitly installed, not their dependencies
$ conda env export --from-history > environment.yml
pip:
$ python -m pip freeze > requirements.txt
My recommendation to install dependencies and document them at the same time: Document them first and then install from file:
$ conda env create -f environment.yml
$ pip install -r requirements.txt
What if I need to record the entire operating system?
One solution can be: Containers (Apptainer/Singularity, Docker, …) but the details of this are outside the scope of this short course.
Recording computational steps
We need a way to record and communicate computational steps:
README (steps written out “in words”)
Scripts (typically shell scripts)
Notebooks (Jupyter or R Markdown)
Workflows (Snakemake, doit, …)
[Midjourney, CC-BY-NC 4.0]