Recording dependencies

Questions

  • How can we communicate different versions of software dependencies?

Instructor note

  • 10 min teaching

  • 10 min demo

Our codes often depend on other codes that in turn depend on other codes …

  • Reproducibility: We can version-control our code with Git but how should we version-control dependencies? How can we capture and communicate dependencies?

  • Dependency hell: Different codes on the same environment can have conflicting dependencies.

An image showing blocks (=codes) depending on each other for stability

From xkcd - dependency. Another image that might be familiar to some of you working with Python can be found on xkcd - superfund.

Kitchen analogy

  • Software <-> recipe

  • Data <-> ingredients

  • Libraries <-> cooking books/blogs

Cooking recipe in an unfamiliar language

Cooking recipe in an unfamiliar language [Midjourney, CC-BY-NC 4.0]

Kitchen with few open cooking books

When we create recipes, we often use existing recipes written by others (libraries) [Midjourney, CC-BY-NC 4.0]


Tools and what problems they try to solve

Conda, Anaconda, pip, virtualenv, Pipenv, pyenv, Poetry, requirements.txt, environment.yml, renv, …, these tools try to solve the following problems:

  • Defining a specific set of dependencies, possibly with well defined versions

  • Installing those dependencies mostly automatically

  • Recording the versions for all dependencies

  • Isolate environments

    • On your computer for projects so they can use different software

    • Isolate environments on computers with many users (and allow self-installations)

  • Using different Python/R versions per project

  • Provide tools and services to share packages

Isolated environments are also useful because they help you make sure that you know your dependencies!

If things go wrong, you can delete and re-create - much better than debugging. The more often you re-create your environment, the more reproducible it is.


Demo

Dependencies-1: Time-capsule of dependencies

Situation: 5 students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.

Answer in the collaborative document:

  • Which version do you expect to be easiest to re-run? Why?

  • What problems do you anticipate in each solution?

    A: You find a couple of library imports across the code but that’s it.

    B: The README file lists which libraries were used but does not mention any versions.

    C: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy
      - numpy
      - sympy
      - click
      - python
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@master
        - git+https://github.com/anotheruser/anotherproject.git@master
    

    D: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@d7b2c7e
        - git+https://github.com/anotheruser/anotherproject.git@sometag
    

    E: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - someproject=1.2.3
      - anotherproject=2.3.4
    

Dependencies-2: Create a time-capsule for the future

Now we will demo creating our own time-capsule and share it with the future world. If we asked you now which dependencies your project is using, what would you answer? How would you find out? And how would you communicate this information?

We start from an existing conda environment. Try this either with your own project or inside the “coderefinery” conda environment. For demonstration puprposes, you can also create an environment with:

$ conda env create -f myenv.yml

Where the file myenv.yml could have some python libraries with unspecified versions:

name: myenv
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - numpy
  - pandas
  - seaborn

After creating the environment we can activate it with

conda activate myenv

Now we can freeze the environment into a new YAML file with:

$ conda env export > environment.yml

Have a look at the generated file and discuss what you see.

In the future — or on a different computer — we can re-create this environment with:

$ conda env create -f environment.yml

What happens instead when you run the following command?

$ conda env export --from-history > environment_fromhistory.yml

More information: https://docs.conda.io/en/latest/

See also: https://github.com/mamba-org/mamba

Keypoints

  • Recording dependencies with versions can make it easier for the next person to execute your code

  • There are many tools to record dependencies