Reproducible environments and dependencies
Objectives
There are not many codes that have no dependencies. How should we deal with dependencies?
We will focus on installing and managing dependencies in Python when using packages from PyPI and Conda.
We will not discuss how to distribute your code as a package.
[This episode borrows from https://coderefinery.github.io/reproducible-python/reusable/ and https://aaltoscicomp.github.io/python-for-scicomp/dependencies/]
Essential XKCD comics:
How to avoid: “It works on my machine 🤷”
Use a standard way to list dependencies in your project:
Python:
requirements.txt
orenvironment.yml
R:
DESCRIPTION
orrenv.lock
Rust:
Cargo.lock
Julia:
Project.toml
C/C++/Fortran:
CMakeLists.txt
orMakefile
orspack.yaml
or the module system on clusters or containersOther languages: …
Two ecosystems: PyPI (The Python Package Index) and Conda
PyPI
Installation tool:
pip
oruv
or similarTraditionally used for Python-only packages or for Python interfaces to external libraries. There are also packages that have bundled external libraries (such as numpy).
Pros:
Easy to use
Package creation is easy
Cons:
Installing packages that need external libraries can be complicated
Conda
Installation tool:
conda
ormamba
or similarAims to be a more general package distribution tool and it tries to provide not only the Python packages, but also libraries and tools needed by the Python packages.
Pros:
Quite easy to use
Easier to manage packages that need external libraries
Not only for Python
Cons:
Package creation is harder
Conda ecosystem explained
Anaconda is a distribution of conda packages made by Anaconda Inc. When using Anaconda remember to check that your situation abides with their licensing terms (see below).
Anaconda has recently changed its licensing terms, which affects its use in a professional setting. This caused uproar among academia and Anaconda modified their position in this article.
Main points of the article are:
conda (installation tool) and community channels (e.g. conda-forge) are free to use.
Anaconda repository and Anaconda’s channels in the community repository are free for universities and companies with fewer than 200 employees. Non-university research institutions and national laboratories need licenses.
Miniconda is free, when it does not download Anaconda’s packages.
Miniforge is not related to Anaconda, so it is free.
For ease of use on sharing environment files, we recommend using Miniforge to create the environments and using conda-forge as the main channel that provides software.
Major repositories/channels:
Anaconda Repository houses Anaconda’s own proprietary software channels.
Anaconda’s proprietary channels:
main
,r
,msys2
andanaconda
. These are sometimes calleddefaults
.conda-forge is the largest open source community channel. It has over 28k packages that include open-source versions of packages in Anaconda’s channels.
Tools and distributions for dependency management in Python
Poetry: Dependency management and packaging.
Pipenv: Dependency management, alternative to Poetry.
pyenv: If you need different Python versions for different projects.
virtualenv: Tool to create isolated Python environments for PyPI packages.
micropipenv: Lightweight tool to “rule them all”.
Conda: Package manager for Python and other languages maintained by Anaconda Inc.
Miniconda: A “miniature” version of conda, maintained by Anaconda Inc. By default uses Anaconda’s channels. Check licensing terms when using these packages.
Mamba: A drop in replacement for conda. It used be much faster than conda due to better dependency solver but nowadays conda also uses the same solver. It still has some UI improvements.
Micromamba: Tiny version of the Mamba package manager.
Miniforge: Open-source Miniconda alternative with conda-forge as the default channel and optionally mamba as the default installer.
Pixi: Modern, super fast tool which can manage conda environments.
uv: Modern, super fast replacement for pip, poetry, pyenv, and virtualenv. You can also switch between Python versions.
Best practice: Install dependencies into isolated environments
For each project, create a separate environment.
Don’t install dependencies globally for all projects. Sooner or later, different projects will have conflicting dependencies.
Install them from a file which documents them at the same time Install dependencies by first recording them in
requirements.txt
orenvironment.yml
and install using these files, then you have a trace (we will practice this later below).
Keypoints
If somebody asks you what dependencies you have in your project, you should be able to answer this question with a file.
In Python, the two most common ways to do this are:
requirements.txt (for pip and virtual environments)
environment.yml (for conda and similar)
You can export (“freeze”) the dependencies from your current environment into these files:
# inside a conda environment
$ conda env export --from-history > environment.yml
# inside a virtual environment
$ pip freeze > requirements.txt
How to communicate the dependencies as part of a report/thesis/publication
Each notebook or script or project which depends on libraries should come with
either a requirements.txt
or a environment.yml
, unless you are creating
and distributing this project as Python package.
Attach a
requirements.txt
or aenvironment.yml
to your thesis.Even better: Put
requirements.txt
or aenvironment.yml
in your Git repository along your code.Even better: Also binderize your analysis pipeline.
Containers
A container is like an operating system inside a file.
“Building a container”: Container definition file (recipe) -> Container image
This can be used with Apptainer/ SingularityCE.
Containers offer the following advantages:
Reproducibility: The same software environment can be recreated on different computers. They force you to know and document all your dependencies.
Portability: The same software environment can be run on different computers.
Isolation: The software environment is isolated from the host system.
“Time travel”:
You can run old/unmaintained software on new systems.
Code that needs new dependencies which are not available on old systems can still be run on old systems.
How to install dependencies into environments
Now we understand a bit better why and how we installed dependencies for this course in the Software install instructions.
We have used Miniforge and the long command we have used was:
$ mamba env create -n course -f https://raw.githubusercontent.com/coderefinery/python-progression/main/software/environment.yml
This command did two things:
Create a new environment with name “course” (specified by
-n
).Installed all dependencies listed in the
environment.yml
file (specified by-f
), which we fetched directly from the web. Here you can browse it.
For your own projects:
Start by writing an
environment.yml
ofrequirements.txt
file. They look like this:
name: course
channels:
- conda-forge
dependencies:
- python <= 3.12
- jupyterlab
- altair-all
- vega_datasets
- pandas
- numpy
- pytest
- scalene
- ruff
- icecream
- myst-parser
- sphinx
- sphinx-rtd-theme
- sphinx-autoapi
- sphinx-autobuild
jupyterlab
altair
vega_datasets
pandas
numpy
pytest
scalene
ruff
icecream
myst-parser
sphinx
sphinx-rtd-theme
sphinx-autoapi
sphinx-autobuild
Then set up an isolated environment and install the dependencies from the file into it:
Create a new environment with name “myenv” from
environment.yml
:$ conda env create -n myenv -f environment.yml
Or equivalently:
$ mamba env create -n myenv -f environment.yml
Activate the environment:
$ conda activate myenv
Run your code inside the activated virtual environment.
$ python example.py
Create
pixi.toml
fromenvironment.yml
:$ pixi init --import environment.yml
Run your code inside the environment:
$ pixi run python example.py
Create a virtual environment by running (the second argument is the name of the environment and you can change it):
$ python -m venv venv
Activate the virtual environment (how precisely depends on your operating system and shell).
Install the dependencies:
$ python -m pip install -r requirements.txt
Run your code inside the activated virtual environment.
$ python example.py
Create a virtual environment by running (the second argument is the name of the environment and you can change it):
$ uv venv venv
Activate the virtual environment (how precisely depends on your operating system and shell).
Install the dependencies:
$ uv pip sync requirements.txt
Run your code inside the virtual environment.
$ uv run python example.py
Updating environments
What if you forgot a dependency? Or during the development of your project you realize that you need a new dependency? Or you don’t need some dependency anymore?
Modify the
environment.yml
orrequirements.txt
file.Either remove your environment and create a new one, or update the existing one:
Update the environment by running:
$ conda env update --file environment.yml
Or equivalently:
$ mamba env update --file environment.yml
Remove
pixi.toml
.Then update it from the updated
environment.yml
by running:$ pixi init --import environment.yml
Activate the virtual environment.
Update the environment by running:
$ pip install -r requirements.txt
Activate the virtual environment.
Update the environment by running:
$ uv pip sync requirements.txt
Pinning package versions
Let us look at the environment.yml which we used to set up the environment for this progression course. Dependencies are listed without version numbers. Should we pin the versions?
Both
pip
andconda
ecosystems and all the tools that we have mentioned support pinning versions.It is possible to define a range of versions instead of precise versions.
While your project is still in progress, I often use latest versions and do not pin them.
When publishing the script or notebook, it is a good idea to pin the versions to ensure that the code can be run in the future.
Remember that at some point in time you will face a situation where newer versions of the dependencies are no longer compatible with your software. At this point you’ll have to update your software to use the newer versions or to lock it into a place in time.
Managing dependencies on a supercomputer
Additional challenges:
Storage quotas: Do not install dependencies in your home directory. A conda environment can easily contain 100k files.
Network file systems struggle with many small files. Conda environments often contain many small files.
Possible solutions:
Try Pixi (modern take on managing Conda environments) and uv (modern take on managing virtual environments). Blog post: Using Pixi and uv on a supercomputer
Install your environment on the fly into a scratch directory on local disk (not the network file system).
Install your environment on the fly into a RAM disk/drive.
Containerize your environment into a container image.
Keypoints
Being able to communicate your dependencies is not only nice for others, but also for your future self or the next PhD student or post-doc.
If you ask somebody to help you with your code, they will ask you for the dependencies.