Sharing reproducible containers

Objectives

  • Know about good practices for creating reproducible containers

  • Know about some popular services to share container definition files and images

In this lesson we will go through a handful of guidelines and tips that will help you to create reproducible containers which can be shared today and hopefully years into the future.

We will encourage you to

And finally we will give you an exercise to practice your new skills.

Reuse

As we have learned, building a container means that you pack the OS and all the applications you need into a file. We have also learned that typically we don’t do everything from scratch, we build upon base containers.

This means that when building containers we try to:

  • Use available base containers

  • Add customisation on top of that

An example was using an official python image for our python container:

Bootstrap: docker
From: python:latest

%files
    summation.py /opt

%runscript
    echo "Got arguments: $*"
    exec python /opt/summation.py "$@"

%post
    pip install numpy

Here we use a python base image and in addition we install some more software: numpy (and we copy our custom python script into the image).

Building upon base-images is used extensively: The python image is not just python, it is again based on an another image, which itself is based on another image, and so on.

To find the image dependency, you will need to do a bit of detective work, finding the image in a registry, and inspecting its Dockerfile which hopefully is linked from the registry. If there is no Dockerfile linked from the registry page, you may be out of luck.

Example image dependency

Let’s check the Dockerhub python registry. We can click on the link of the latest bookworm tag (see DockerHub python tab below) which leads us to its Dockerfile on Github (as seen in the Dockerfile python tab).

../_images/dockerhub_python.png

Inspecting this Dockerfile, we see that it again is based on a another image, namely buildpack-deps:bookworm.

We can do the same exercise for the image buildpack-deps:bookworm by finding the image in a registry like Dockerhub, navigating to the Dockerfile linked from that registry, and so on.

After all that, this is the image dependency tree we find for the original python docker base image:

--> From: python:latest
  --> FROM: buildpack-deps:bookworm
    --> FROM buildpack-deps:bookworm-scm
      --> FROM buildpack-deps:bookworm-curl
        --> FROM debian:bookworm
          --> FROM scratch

Take-away message

Check if there is a suitable official base image for the applications you need, and build upon that.

Be specific

One of the main objectives of using images is that the users gets exactly what they expect, and everything should just work. The container is after all self-contained!

During development you might want to have “latest” versions of software. But “latest” is a moving target: “latest” today is not the same as “latest” in 2 years. And now you can get into problems! Maybe the latest version of your base image is not compatible with the other software the image has included. Or which you are including. This can spoil the party massively!

Take-away message

When sharing/publishing a container, try to be as specific as you can! Always specify software versions.

Taking our python image as an example, a more future-proof definition file would specify the base image version as well as the numpy version. Compare these two:

Bootstrap: docker
From: python:3.12.7-bookworm

%files
    summation.py /opt

%runscript
    echo "Got arguments: $*"
    exec python /opt/summation.py "$@"

%post
    pip install numpy==1.26.0

Further below we have an exercise where we can practice recognizing future problems in container definition files.

Separate concerns

Purpose

When creating you image definition file - have a think about what the image should contain based on what purpose it has. Do not be tempted to add software just because it is convenient for general use.

For instance: an image that is used to run some specific scientific analysis on a specific input type of data may not need your favourite text editor inside. Or that extra python package just in case.

Slim the image down to just what it needs for the purpose it fulfills. The benefit will be at least two-fold: the image will be lighter meaning it will be quicker to download and have smaller carbon-footprint. But in addition there is less software to potentially get into software dependency problems with. Another benefit: it will be clearer for the user what is the purpose of the image, and how to use it.

Stay to the point

  • Try to make your image as specific as possible

  • Only add software that is needed for the specific purpose of the container

Data

The main purpose of a software image is exactly that - to provide software, not datasets. There are several reasons why it is not a good idea to include (potentially large) datasets, here are a few:

  • The image could become very heavy

  • The data may better be stored in a suited data registry

  • The data may be different from user to user

  • The data may be sensitive and should only reside in a private and secure computing environment

Instead of shipping the data with the image, let the user bind mount it into the container. Check out the Binding folders into your container lesson for details.

Compare the two apptainer definition files and how to run the resulting my_container.sif container. The right tab also exemplifies bind-mounting a folder for output data, which is useful in order to access the resulting output data directly from the host server.

Bootstrap: docker
From: python:3.9-slim

%files
   process_data.py /app/process_data.py
   input_data /app/input_data

%post
   mkdir /app/output_data
   chmod 777 /app/output_data

%runscript
   python /app/process_data.py /app/input_data /app/output_data

%help
   Usage: apptainer run --writable-tmpfs this_container.sif

That said, there may be reasons why some particular data is better copied into the container. For instance some reference data that stays unchanged and that is needed for all analysis.

Data key practices

  • Avoid copying data into the container unless there are obvious benefits

Document your image

In the example above you can see that some documentation is added in the image itself under the %help block. This is not only important for sharing, but also for yourself to help remember how to use the container. See more details in the Adding documentation to your image.

Documentation key practices

Always add documentation to your image.

  • Minimally how to use the container via the %help block

  • In addition author, version, description via the %label block

Use version control and public registries

Key practices

  • Track the changes to the definition file with version control. In practice: Put the definition file on GitHub or GitLab.

  • Make the container image findable by others. In practice: Put the image on a public registry.

  • Make sure one can find and inspect the definition file from the registry. In practice: Link the repo to the public registry.

In principle a definition file is enough to build a container image and in theory we would not need to share pre-built images. But in practice it is very useful to share the pre-built image as well. This is because:

  • Building a container image can take time and resources.

  • If we were not careful specifying versions, the image might not build again in the same way.

  • Some dependencies might not be available anymore.

There are many popular services to share container images and almost every big-tech company offers one:

  • Docker Hub: Default Docker registry with public/private repositories and CI/CD integration.

  • Google Container Registry (GCR): GCP service, tightly integrated with Google Cloud services and Kubernetes.

  • Azure Container Registry (ACR): Fully managed, integrated with Azure services like AKS and DevOps.

  • Quay.io: Red Hat service, security scanning, OpenShift/Kubernetes integration, public/private repositories.

  • JFrog Artifactory: Universal artifact repository supporting Docker and other formats, advanced security features.

  • Harbor: Open-source registry, role-based access control, vulnerability scanning, and image signing.

  • DigitalOcean Container Registry: Integrated with DigitalOcean Kubernetes.

  • GitLab Container Registry: Built into GitLab, works seamlessly with GitLab CI/CD pipelines.

What many projects do (however, note the warning below):

  • Track their container definition files in a public repository on GitHub or GitLab.

  • From these repositories, they build the container images and push them to a public registry (above list).

Warning

A public registry that is free today might not be free tomorrow. Make sure you have a backup plan for your images and make sure the image can still be found 5 years from now if the service provider changes their pricing model.

Recommendation to “guarantee” long-term availability

  • There are no guarantees, however:

  • One of the most stable services is Zenodo which is an excellent place to publish your container image as supporting material for a publication and also get a DOI for it. It is unlikely to change pricing for academic use.

  • Make sure to also publish the definition file with it.

It is possible to host both the definition file and the image on GitHub:

  • You don’t need to host it yourself.

  • The image stays close to its sources and is not on a different service.

  • Anybody can inspect the recipe and how it was built.

  • Every time you make a change to the recipe, it builds a new image.

  • We can practice/demonstrate this in the exercise below.

Exercises

Exercise Sharing-1: Time-travel with containers

Imagine the following situation: A researcher has written and published their research code which requires a number of libraries and system dependencies. They ran their code on a Linux computer (Ubuntu). One very nice thing they did was to publish also a container image with all dependencies included, as well as the definition file (below) to create the container image.

Now we travel 3 years into the future and want to reuse their work and adapt it for our data. The container registry where they uploaded the container image however no longer exists. But luckily (!) we still have the definition file (below). From this we should be able to create a new container image.

  • Can you anticipate problems using the definition file here 3 years after its creation? Which possible problems can you point out?

  • Discuss possible take-aways for creating more reusable containers.

 1Bootstrap: docker
 2From: ubuntu:latest
 3
 4%post
 5    # Set environment variables
 6    export VIRTUAL_ENV=/app/venv
 7
 8    # Install system dependencies and Python 3
 9    apt-get update && \
10    apt-get install -y --no-install-recommends \
11        gcc \
12        libgomp1 \
13        python3 \
14        python3-venv \
15        python3-distutils \
16        python3-pip && \
17    apt-get clean && \
18    rm -rf /var/lib/apt/lists/*
19
20    # Set up the virtual environment
21    python3 -m venv $VIRTUAL_ENV
22    . $VIRTUAL_ENV/bin/activate
23
24    # Install Python libraries
25    pip install --no-cache-dir --upgrade pip && \
26    pip install --no-cache-dir -r /app/requirements.txt
27
28%files
29    # Copy project files
30    ./requirements.txt /app/requirements.txt
31    ./app.py /app/app.py
32    # Copy data
33    /home/myself/data /app/data
34    # Workaround to fix dependency on fancylib
35    /home/myself/fancylib /usr/lib/fancylib
36
37%environment
38    # Set the environment variables
39    export LANG=C.UTF-8 LC_ALL=C.UTF-8
40    export VIRTUAL_ENV=/app/venv
41
42%runscript
43    # Activate the virtual environment
44    . $VIRTUAL_ENV/bin/activate
45    # Run the application
46    python /app/app.py

Exercise Sharing-2: Building a container on GitHub

You can build a container on GitHub (using GitHub Actions) or GitLab (using GitLab CI) and host the image on GitHub/GitLab. This has the following advantages:

  • You don’t need to host it yourself.

  • The image stays close to its sources and is not on a different service.

  • Anybody can inspect the recipe and how it was built.

  • Every time you make a change to the recipe, it builds a new image.

If you want to try this out:

  • Take this repository as starting point and inspiration.

  • We don’t need to focus too much on what this container does, but rather how it is built.

  • To build a new version, one needs to send a pull request which updates the file VERSION and modifies the definition file.

  • Using this approach, try to build a very simple container definition directly on GitHub where the goal is to have both the definition file and the image file in the same place.