How to make the project more reusable

Objectives

There are not many codes that have no dependencies. How should we deal with dependencies?

How to avoid: “It works on my machine 🤷”

Use a standard way to list dependencies in your project:

  • Python: requirements.txt or environment.yml

  • R: DESCRIPTION or renv.lock

  • Rust: Cargo.lock

  • Julia: Project.toml

  • C/C++/Fortran: CMakeLists.txt or Makefile or spack.yaml or the module system on clusters or containers

  • Other languages: …

Tools and what problems they try to solve

Conda, Anaconda, mamba, pip, virtualenv, Pipenv, pyenv, Poetry, requirements.txt, environment.yml, renv, …, these tools try to solve the following problems:

  • Defining a specific set of dependencies, possibly with well defined versions

  • Installing those dependencies mostly automatically

  • Recording the versions for all dependencies

  • Isolate environments

    • On your computer for projects so they can use different software

    • Isolate environments on computers with many users (and allow self-installations)

  • Using different Python/R versions per project

  • Provide tools and services to share packages

Essential XKCD comics:

Best practices

Install dependencies into isolated environments:

  • For each project, create a new environment.

  • Don’t install dependencies globally for all projects.

  • Install them from a file which documents them at the same time.

Keypoints

If somebody asks you what dependencies you have in your project, you should be able to answer this question with a file.

In Python, the two most common ways to do this are:

  • requirements.txt (for pip and virtual environments)

  • environment.yml (for conda and similar)

You can export the dependencies from your current environment into these files:

# inside a conda environment
$ conda env export --from-history > environment.yml

# inside a virtual environment
$ pip freeze > requirements.txt

Discussion

  • The dependencies in our example project are listed in a environment.yml file.

  • Shouldn’t the dependencies be pinned to specific versions?

  • When is a good time to pin them?

Exercise

Exercise: Time-capsule of dependencies

Situation: 5 students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.

  • Which version do you expect to be easiest to re-run? Why?

  • What problems do you anticipate in each solution?

    A: You find a couple of library imports across the code but that’s it.

    B: The README file lists which libraries were used but does not mention any versions.

    C: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy
      - numpy
      - sympy
      - click
      - python
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@master
        - git+https://github.com/anotheruser/anotherproject.git@master
    

    D: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@d7b2c7e
        - git+https://github.com/anotheruser/anotherproject.git@sometag
    

    E: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - someproject=1.2.3
      - anotherproject=2.3.4
    

Containers

  • A container is like an operating system inside a file.

  • “Building a container”: Container definition file (recipe) -> Container image

  • Let us explore and discuss the container definition file in our example project.

  • This can be used with Apptainer/ SingularityCE.

Containers offer the following advantages:

  • Reproducibility: The same software environment can be recreated on different computers. They force you to know and document all your dependencies.

  • Portability: The same software environment can be run on different computers.

  • Isolation: The software environment is isolated from the host system.

  • Time travel”:

    • You can run old/unmaintained software on new systems.

    • Code that needs new dependencies which are not available on old systems can still be run on old systems.

Demonstration: Building a container

Demo: Build a container and run it on a cluster

Here we will try to build a container from the definition file of our example project.

Requirements:

  1. Linux (it is possible to build them on a macOS or Windows computer but it is more complicated).

  2. An installation of Apptainer (e.g. following the quick installation). Alternatively, SingularityCE should also work.

Now you can build the container image from the container definition file. Depending on the configuration you might need to run the command with sudo or with --fakeroot.

Hopefully one of these four will work:

$ sudo apptainer build container.sif container.def
$ apptainer build --fakeroot container.sif container.def

$ sudo singularity build container.sif container.def
$ singularity build --fakeroot container.sif container.def

Once you have the container.sif, copy it to a cluster and try to run it there.

Here are two job script examples:

#!/usr/bin/env bash

# the SBATCH directives and the module load below are only relevant for the
# Dardel cluster and the PDC Summer School; adapt them for your cluster

#SBATCH --account=edu24.summer
#SBATCH --job-name='container'
#SBATCH --time=0-00:05:00

#SBATCH --partition=shared

#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16


module load PDC singularity


# catch common shell script errors
set -euf -o pipefail


echo
echo "what is the operating system on the host?"
cat /etc/os-release


echo
echo "what is the operating system in the container?"
singularity exec container.sif cat /etc/os-release


# 1000 planets, 20 steps
time ./container.sif 1000 20 ${SLURM_CPUS_PER_TASK} results

Where to explore more