Sharing reproducible containers
Objectives
Know about good practices for creating reproducible containers
Know about some popular services to share container definition files and images
In this lesson we will go through a handful of guidelines and tips that will help you to create reproducible containers which can be shared today and hopefully years into the future.
We will encourage you to
And finally we will give you an exercise to practice your new skills.
Reuse
As we have learned, building a container means that you pack the OS and all the applications you need into a file. We have also learned that typically we don’t do everything from scratch, we build upon base containers.
This means that when building containers we try to:
Use available base containers
Add customisation on top of that
An example was using an official python image for our python container:
Bootstrap: docker
From: python:latest
%files
summation.py /opt
%runscript
echo "Got arguments: $*"
exec python /opt/summation.py "$@"
%post
pip install numpy
Here we use a python base image and in addition we install some more software: numpy (and we copy our custom python script into the image).
Building upon base-images is used extensively: The python image is not just python, it is again based on an another image, which itself is based on another image, and so on.
To find the image dependency, you will need to do a bit of detective work, finding the image in a registry, and inspecting its Dockerfile which hopefully is linked from the registry. If there is no Dockerfile linked from the registry page, you may be out of luck.
Example image dependency
Let’s check the Dockerhub python registry. We can click on the link of the latest bookworm tag (see DockerHub python tab below) which leads us to its Dockerfile on Github (as seen in the Dockerfile python tab).
Inspecting this Dockerfile, we see that it again is based on a another image, namely buildpack-deps:bookworm
.
We can do the same exercise for the image buildpack-deps:bookworm
by finding the image in a registry like Dockerhub, navigating to the Dockerfile linked from that registry, and so on.
After all that, this is the image dependency tree we find for the original python docker base image:
--> From: python:latest
--> FROM: buildpack-deps:bookworm
--> FROM buildpack-deps:bookworm-scm
--> FROM buildpack-deps:bookworm-curl
--> FROM debian:bookworm
--> FROM scratch
Take-away message
Check if there is a suitable official base image for the applications you need, and build upon that.
Popular base images
There probably exists a base image for your need, almost whatever it is. If you web-search e.g. “best docker containers” you will find useful lists of popular ones. Here is a customised selection of such a list - with some images we find very useful:
Alpine (slim Linux OS)
BusyBox (slim Linux OS with many common Linux utilities)
Nginx (web server)
Ubuntu (Linux OS)
PostGreSQL (database)
Node (web development)
MySQL (database)
Once you have found a suitable base image, you must think about what version to chose. You will see that each image has a selection of different versions, so which should you chose? We will explore this in the next section.
Be specific
One of the main objectives of using images is that the users gets exactly what they expect, and everything should just work. The container is after all self-contained!
During development you might want to have “latest” versions of software. But “latest” is a moving target: “latest” today is not the same as “latest” in 2 years. And now you can get into problems! Maybe the latest version of your base image is not compatible with the other software the image has included. Or which you are including. This can spoil the party massively!
Take-away message
When sharing/publishing a container, try to be as specific as you can! Always specify software versions.
Taking our python image as an example, a more future-proof definition file would specify the base image version as well as the numpy version. Compare these two:
Bootstrap: docker
From: python:3.12.7-bookworm
%files
summation.py /opt
%runscript
echo "Got arguments: $*"
exec python /opt/summation.py "$@"
%post
pip install numpy==1.26.0
Bootstrap: docker
From: python:latest
%files
summation.py /opt
%runscript
echo "Got arguments: $*"
exec python /opt/summation.py "$@"
%post
pip install numpy
Further below we have an exercise where we can practice recognizing future problems in container definition files.
Separate concerns
Purpose
When creating you image definition file - have a think about what the image should contain based on what purpose it has. Do not be tempted to add software just because it is convenient for general use.
For instance: an image that is used to run some specific scientific analysis on a specific input type of data may not need your favourite text editor inside. Or that extra python package just in case.
Slim the image down to just what it needs for the purpose it fulfills. The benefit will be at least two-fold: the image will be lighter meaning it will be quicker to download and have smaller carbon-footprint. But in addition there is less software to potentially get into software dependency problems with. Another benefit: it will be clearer for the user what is the purpose of the image, and how to use it.
Stay to the point
Try to make your image as specific as possible
Only add software that is needed for the specific purpose of the container
Data
The main purpose of a software image is exactly that - to provide software, not datasets. There are several reasons why it is not a good idea to include (potentially large) datasets, here are a few:
The image could become very heavy
The data may better be stored in a suited data registry
The data may be different from user to user
The data may be sensitive and should only reside in a private and secure computing environment
Instead of shipping the data with the image, let the user bind mount it into the container. Check out the Binding folders into your container lesson for details.
Compare the two apptainer definition files and how to run the resulting my_container.sif
container. The right tab also exemplifies bind-mounting a folder for output data, which is useful in order to access the resulting output data directly from the host server.
Bootstrap: docker
From: python:3.9-slim
%files
process_data.py /app/process_data.py
input_data /app/input_data
%post
mkdir /app/output_data
chmod 777 /app/output_data
%runscript
python /app/process_data.py /app/input_data /app/output_data
%help
Usage: apptainer run --writable-tmpfs this_container.sif
Bootstrap: docker
From: python:3.9-slim
%files
process_data.py /app/process_data.py
%post
mkdir /app/output_data
mkdir /app/input_data
%runscript
python /app/process_data.py /app/input_data /app/output_data
%help
Usage: apptainer run --bind /path/to/host/input:/app/input_data,/path/to/host/output:/app/output_data this_container.sif
That said, there may be reasons why some particular data is better copied into the container. For instance some reference data that stays unchanged and that is needed for all analysis.
Data key practices
Avoid copying data into the container unless there are obvious benefits
Document your image
In the example above you can see that some documentation is added in the image itself under the %help
block. This is not only important for sharing, but also for yourself to help remember how to use the container. See more details in the Adding documentation to your image.
Documentation key practices
Always add documentation to your image.
Minimally how to use the container via the
%help
blockIn addition author, version, description via the
%label
block
Use version control and public registries
Key practices
Track the changes to the definition file with version control. In practice: Put the definition file on GitHub or GitLab.
Make the container image findable by others. In practice: Put the image on a public registry.
Make sure one can find and inspect the definition file from the registry. In practice: Link the repo to the public registry.
In principle a definition file is enough to build a container image and in theory we would not need to share pre-built images. But in practice it is very useful to share the pre-built image as well. This is because:
Building a container image can take time and resources.
If we were not careful specifying versions, the image might not build again in the same way.
Some dependencies might not be available anymore.
There are many popular services to share container images and almost every big-tech company offers one:
Docker Hub: Default Docker registry with public/private repositories and CI/CD integration.
Google Container Registry (GCR): GCP service, tightly integrated with Google Cloud services and Kubernetes.
Azure Container Registry (ACR): Fully managed, integrated with Azure services like AKS and DevOps.
Quay.io: Red Hat service, security scanning, OpenShift/Kubernetes integration, public/private repositories.
JFrog Artifactory: Universal artifact repository supporting Docker and other formats, advanced security features.
Harbor: Open-source registry, role-based access control, vulnerability scanning, and image signing.
DigitalOcean Container Registry: Integrated with DigitalOcean Kubernetes.
GitLab Container Registry: Built into GitLab, works seamlessly with GitLab CI/CD pipelines.
What many projects do (however, note the warning below):
Track their container definition files in a public repository on GitHub or GitLab.
From these repositories, they build the container images and push them to a public registry (above list).
Warning
A public registry that is free today might not be free tomorrow. Make sure you have a backup plan for your images and make sure the image can still be found 5 years from now if the service provider changes their pricing model.
Recommendation to “guarantee” long-term availability
There are no guarantees, however:
One of the most stable services is Zenodo which is an excellent place to publish your container image as supporting material for a publication and also get a DOI for it. It is unlikely to change pricing for academic use.
Make sure to also publish the definition file with it.
It is possible to host both the definition file and the image on GitHub:
You don’t need to host it yourself.
The image stays close to its sources and is not on a different service.
Anybody can inspect the recipe and how it was built.
Every time you make a change to the recipe, it builds a new image.
We can practice/demonstrate this in the exercise below.
Exercises
Exercise Sharing-1: Time-travel with containers
Imagine the following situation: A researcher has written and published their research code which requires a number of libraries and system dependencies. They ran their code on a Linux computer (Ubuntu). One very nice thing they did was to publish also a container image with all dependencies included, as well as the definition file (below) to create the container image.
Now we travel 3 years into the future and want to reuse their work and adapt it for our data. The container registry where they uploaded the container image however no longer exists. But luckily (!) we still have the definition file (below). From this we should be able to create a new container image.
Can you anticipate problems using the definition file here 3 years after its creation? Which possible problems can you point out?
Discuss possible take-aways for creating more reusable containers.
1Bootstrap: docker
2From: ubuntu:latest
3
4%post
5 # Set environment variables
6 export VIRTUAL_ENV=/app/venv
7
8 # Install system dependencies and Python 3
9 apt-get update && \
10 apt-get install -y --no-install-recommends \
11 gcc \
12 libgomp1 \
13 python3 \
14 python3-venv \
15 python3-distutils \
16 python3-pip && \
17 apt-get clean && \
18 rm -rf /var/lib/apt/lists/*
19
20 # Set up the virtual environment
21 python3 -m venv $VIRTUAL_ENV
22 . $VIRTUAL_ENV/bin/activate
23
24 # Install Python libraries
25 pip install --no-cache-dir --upgrade pip && \
26 pip install --no-cache-dir -r /app/requirements.txt
27
28%files
29 # Copy project files
30 ./requirements.txt /app/requirements.txt
31 ./app.py /app/app.py
32 # Copy data
33 /home/myself/data /app/data
34 # Workaround to fix dependency on fancylib
35 /home/myself/fancylib /usr/lib/fancylib
36
37%environment
38 # Set the environment variables
39 export LANG=C.UTF-8 LC_ALL=C.UTF-8
40 export VIRTUAL_ENV=/app/venv
41
42%runscript
43 # Activate the virtual environment
44 . $VIRTUAL_ENV/bin/activate
45 # Run the application
46 python /app/app.py
Solution
Line 2: “ubuntu:latest” will mean something different 3 years in future.
Lines 11-12: The compiler gcc and the library libgomp1 will have evolved.
Line 30: The container uses requirements.txt to build the virtual environment but we don’t see here what libraries the code depends on.
Line 33: Data is copied in from the hard disk of the person who created it. Hopefully we can find the data somewhere.
Line 35: The library fancylib has been built outside the container and copied in but we don’t see here how it was done.
Python version will be different then and hopefully the code still runs then.
Singularity/Apptainer will have also evolved by then. Hopefully this definition file then still works.
No help text.
No contact address to ask more questions about this file.
(Can you find more? Please contribute more points.)
1Bootstrap: docker
2From: ubuntu:latest
3
4%post
5 # Set environment variables
6 export VIRTUAL_ENV=/app/venv
7
8 # Install system dependencies and Python 3
9 apt-get update && \
10 apt-get install -y --no-install-recommends \
11 gcc \
12 libgomp1 \
13 python3 \
14 python3-venv \
15 python3-distutils \
16 python3-pip && \
17 apt-get clean && \
18 rm -rf /var/lib/apt/lists/*
19
20 # Set up the virtual environment
21 python3 -m venv $VIRTUAL_ENV
22 . $VIRTUAL_ENV/bin/activate
23
24 # Install Python libraries
25 pip install --no-cache-dir --upgrade pip && \
26 pip install --no-cache-dir -r /app/requirements.txt
27
28%files
29 # Copy project files
30 ./requirements.txt /app/requirements.txt
31 ./app.py /app/app.py
32 # Copy data
33 /home/myself/data /app/data
34 # Workaround to fix dependency on fancylib
35 /home/myself/fancylib /usr/lib/fancylib
36
37%environment
38 # Set the environment variables
39 export LANG=C.UTF-8 LC_ALL=C.UTF-8
40 export VIRTUAL_ENV=/app/venv
41
42%runscript
43 # Activate the virtual environment
44 . $VIRTUAL_ENV/bin/activate
45 # Run the application
46 python /app/app.py
This definition files has potential problems 3 years later. Further down on this page we show a better and real version.
1Bootstrap: docker
2From: ubuntu:latest
3
4%post
5 export DEBIAN_FRONTEND=noninteractive
6 apt-get update -y
7
8 apt install -y git build-essential pkg-config
9 apt install -y libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libgsl-dev
10
11 git clone --recursive https://github.com/someuser/sometool.git
12 cd sometool
13
14 make
15
16%files
17 # Workaround to fix dependency on fancylib
18 /home/myself/fancylib /usr/lib/fancylib
19
20%environment
21 export LC_ALL=C
22
23%runscript
24 export PATH=/sometool:$PATH
25
26 $@
Solution
Line 2: “ubuntu:latest” will mean something different 3 years in future.
Lines 9: The libraries will have evolved.
Line 11: We clone a Git repository recursively and that repository might evolve until we build the container image the next time. here what libraries the code depends on.
Line 18: The library fancylib has been built outside the container and copied in but we don’t see here how it was done.
Singularity/Apptainer will have also evolved by then. Hopefully this definition file then still works.
No help text.
No contact address to ask more questions about this file.
(Can you find more? Please contribute more points.)
1Bootstrap: docker
2From: ubuntu:latest
3
4%post
5 export DEBIAN_FRONTEND=noninteractive
6 apt-get update -y
7
8 apt install -y git build-essential pkg-config
9 apt install -y libz-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libgsl-dev
10
11 git clone --recursive https://github.com/someuser/sometool.git
12 cd sometool
13
14 make
15
16%files
17 # Workaround to fix dependency on fancylib
18 /home/myself/fancylib /usr/lib/fancylib
19
20%environment
21 export LC_ALL=C
22
23%runscript
24 export PATH=/sometool:$PATH
25
26 $@
Exercise Sharing-2: Building a container on GitHub
You can build a container on GitHub (using GitHub Actions) or GitLab (using GitLab CI) and host the image on GitHub/GitLab. This has the following advantages:
You don’t need to host it yourself.
The image stays close to its sources and is not on a different service.
Anybody can inspect the recipe and how it was built.
Every time you make a change to the recipe, it builds a new image.
If you want to try this out:
Take this repository as starting point and inspiration.
We don’t need to focus too much on what this container does, but rather how it is built.
To build a new version, one needs to send a pull request which updates the file
VERSION
and modifies the definition file.Using this approach, try to build a very simple container definition directly on GitHub where the goal is to have both the definition file and the image file in the same place.