MPI programs and containers
Objectives
Learn what complications are involved with MPI containers
Learn how to generate a MPI container for your HPC system
What to consider when creating a container for MPI programs?
Message Passing Interface (MPI) is a standardized API and a programming paradigm where programs can use MPI directives to send messages across thousands of processes. It is commonly used in traditional HPC computing.
To handle the scale of the MPI programs the MPI installations are typically tied to the high-speed interconnect available in the computational cluster and to the queue system that the cluster uses.
This can create the following problems when an MPI program is containerized:
Launching of the MPI job can fail if the program does not communicate with the queue system.
The MPI communication performance can be bad if the program does not utilize the high-speed interconnects correctly.
The container can have portability issues when taking it to a different cluster with different MPI, queue system or interconnect.
To solve these problems we first need to know how MPI works.
How MPI works
The launch process for an MPI program works like this:
A reservation is done in the queue system for some number of MPI tasks.
When the reservation gets the resources, individual MPI programs are launched by the queue system (
srun
) or by an MPI launcher (mpirun
).User’s MPI program calls the MPI librarires it was built against.
These libraries ask the queue system how many other MPI tasks there are.
Individual MPI tasks start running the program collectively. Communication between tasks is done via fast interconnects.
To make this work with various different queue systems and various different interconnects MPI installations often utilize Process Management Interface (PMI/PMI2/PMIx) when they connect to the queue system and Unified Communication X when they connect to the interconnects.
How to use MPI with a container
Most common way of running MPI programs in containers is to utilize a hybrid model, where the container contains the same MPI version as the host system.
When using this model the MPI launcher will call the MPI within the container and use it to launch the application.
Do note that the MPI inside the container does not necessarily know how to utilize the fast interconnects. We’ll talk about solving this later.
Creating a simple MPI container
Let’s construct an example container that runs a simple MPI benchmark from OSU Micro-Benchmarks. This benchmark suite is useful for testing whether the MPI installation works and whether the MPI can utilize the fast interconnect.
Because different sites have different MPI versions the definition files differ as well. Pick a definition file for your site.
Bootstrap: docker
From: ubuntu:latest
%arguments
NPROCS=4
OPENMPI_VERSION=4.1.6
OSU_MICRO_BENCHMARKS_VERSION=7.4
%post
### Install OpenMPI dependencies
apt-get update
apt-get install -y wget bash gcc gfortran g++ make file bzip2 ca-certificates libucx-dev
### Build OpenMPI
OPENMPI_VERSION_SHORT=$(echo {{ OPENMPI_VERSION }} | cut -f 1-2 -d '.')
cd /opt
mkdir ompi
wget -q https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION_SHORT}/openmpi-{{ OPENMPI_VERSION }}.tar.bz2
tar -xvf openmpi-{{ OPENMPI_VERSION }}.tar.bz2
# Compile and install
cd openmpi-{{ OPENMPI_VERSION }}
./configure --prefix=/opt/ompi --with-ucx=/usr
make -j{{ NPROCS }}
make install
cd ..
rm -rf openmpi-{{ OPENMPI_VERSION }} openmpi-{{ OPENMPI_VERSION }}.tar.bz2
### Build example application
export OMPI_DIR=/opt/ompi
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
# Build osu benchmarks
cd /opt
wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
tar xf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
cd osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}
./configure --prefix=/opt/osu-micro-benchmarks CC=/opt/ompi/bin/mpicc CFLAGS=-O3
make -j{{ NPROCS }}
make install
cd ..
rm -rf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }} osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
%environment
export OMPI_DIR=/opt/ompi
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
export MANPATH="$OMPI_DIR/share/man:$MANPATH"
%runscript
/opt/osu-micro-benchmarks/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
To build:
srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build triton-openmpi.sif triton-openmpi.def
To run (some extra parameters are needed to prevent launch errors):
$ module load openmpi/4.1.6
$ export PMIX_MCA_gds=hash
$ export UCX_POSIX_USE_PROC_LINK=n
$ export OMPI_MCA_orte_top_session_dir=/tmp/$USER/openmpi
$ srun --partition=batch-milan --mem=2G --nodes=2-2 --ntasks-per-node=1 --time=00:10:00 apptainer run openmpi-triton.sif
srun: job 3521915 queued and waiting for resources
srun: job 3521915 has been allocated resources
# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 3.98
2 8.05
4 15.91
8 32.03
16 64.24
32 125.47
64 245.52
128 469.00
256 877.69
512 1671.24
1024 3218.11
2048 5726.91
4096 8096.24
8192 10266.18
16384 11242.78
32768 11298.70
65536 12038.27
131072 12196.28
262144 12202.05
524288 11786.58
1048576 12258.48
2097152 12179.43
4194304 12199.89
Bootstrap: docker
From: rockylinux:{{ OS_VERSION }}
%arguments
NPROCS=4
OPENMPI_VERSION=4.1.4rc1
OSU_MICRO_BENCHMARKS_VERSION=7.4
GCC_VERSION=9
UCX_VERSION=1.13.0
OS_NAME=rhel
OS_VERSION=8.6
OFED_VERSION=5.6-2.0.9.0
%post
### Install OpenMPI dependencies
# Base tools and newer gcc version
dnf install -y dnf-plugins-core epel-release
dnf config-manager --set-enabled powertools
dnf install -y make gdb wget numactl-devel which
dnf -y install gcc-toolset-{{ GCC_VERSION }}
source /opt/rh/gcc-toolset-{{ GCC_VERSION }}/enable
# Enable Mellanox OFED rpm repo
wget https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
rpm --import RPM-GPG-KEY-Mellanox
rm RPM-GPG-KEY-Mellanox
cd /etc/yum.repos.d/
wget https://linux.mellanox.com/public/repo/mlnx_ofed/{{ OFED_VERSION }}/{{ OS_NAME }}{{ OS_VERSION }}/mellanox_mlnx_ofed.repo
cd /
# Install network library components
dnf -y install rdma-core ucx-ib-{{ UCX_VERSION }} ucx-devel-{{ UCX_VERSION }} ucx-knem-{{ UCX_VERSION }} ucx-cma-{{ UCX_VERSION }} ucx-rdmacm-{{ UCX_VERSION }}
### Install OpenMPI
dnf -y install openmpi-{{ OPENMPI_VERSION }}
### Build example application
export OMPI_DIR=/usr/mpi/gcc/openmpi-{{ OPENMPI_VERSION }}
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
# Build osu benchmarks
cd /opt
wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
tar xf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
cd osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}
./configure --prefix=/opt/osu-micro-benchmarks CC=mpicc CFLAGS=-O3
make -j{{ NPROCS }}
make install
cd ..
rm -rf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }} osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
%environment
export OMPI_DIR=/usr/mpi/gcc/openmpi-{{ OPENMPI_VERSION }}
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
export MANPATH="$OMPI_DIR/share/man:$MANPATH"
%runscript
source /opt/rh/gcc-toolset-{{ GCC_VERSION }}/enable
/opt/osu-micro-benchmarks/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
To build:
apptainer build puhti-openmpi.sif puhti-openmpi.def
To run (some extra parameters are needed to prevent error messages):
$ module load openmpi/4.1.4
$ export PMIX_MCA_gds=hash
$ srun --account=project_XXXXXXX --partition=large --mem=2G --nodes=2-2 --ntasks-per-node=1 --time=00:10:00 apptainer run puhti-openmpi.sif
srun: job 23736111 queued and waiting for resources
srun: job 23736111 has been allocated resources
# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 5.17
2 10.47
4 20.89
8 41.63
16 82.00
32 166.40
64 310.73
128 477.56
256 1162.51
512 2250.29
1024 3941.94
2048 6174.39
4096 8029.47
8192 10120.93
16384 10632.41
32768 10892.60
65536 11609.92
131072 11778.05
262144 12015.96
524288 11970.93
1048576 12008.62
2097152 12050.35
4194304 12058.36
bootstrap: docker
from: ubuntu:latest
%arguments
NPROCS=4
MPICH_VERSION=3.1.4
OSU_MICRO_BENCHMARKS_VERSION=7.4
%post
### Install MPICH dependencies
apt-get update
apt-get install -y file g++ gcc gfortran make gdb strace wget ca-certificates --no-install-recommends
# Build MPICH
wget -q http://www.mpich.org/static/downloads/{{ MPICH_VERSION }}/mpich-{{ MPICH_VERSION }}.tar.gz
tar xf mpich-{{ MPICH_VERSION }}.tar.gz
cd mpich-{{ MPICH_VERSION }}
./configure --disable-fortran --enable-fast=all,O3 --prefix=/usr
make -j{{ NPROCS }}
make install
ldconfig
# Build osu benchmarks
cd /opt
wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
tar xf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
cd osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}
./configure --prefix=/opt/osu-micro-benchmarks CC=mpicc CFLAGS=-O3
make -j{{ NPROCS }}
make install
cd ..
rm -rf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }} osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
%runscript
/opt/osu-micro-benchmarks/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
Building images in not allowed on LUMI, so you need to build this on your own laptop or some other machine:
apptainer build lumi-mpich.sif lumi-mpich.def
Afterwards copy the image to your work directory in LUMI.
To use the fast interconnect you need to install
singularity-bindings
-module with EasyBuild:
module load LUMI/23.09 EasyBuild-user
eb singularity-bindings-system-cpeGNU-23.09-noglibc.eb -r
To run the example:
$ module load LUMI/23.09 EasyBuild-user singularity-bindings
$ export SINGULARITY_BIND=$SINGULARITY_BIND,/usr/lib64/libnl-3.so.200
$ srun --account=project_XXXXXXXXX --partition=dev-g --mem=2G --nodes=2-2 --ntasks-per-node=1 --time=00:10:00 singularity run lumi-mpich.sif
srun: job 8108520 queued and waiting for resources
srun: job 8108520 has been allocated resources
# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 2.03
2 4.09
4 8.17
8 16.23
16 32.64
32 65.57
64 130.49
128 260.55
256 492.28
512 983.37
1024 1965.42
2048 3924.00
4096 7823.52
8192 14349.54
16384 17373.03
32768 18896.90
65536 20906.04
131072 21811.68
262144 22228.01
524288 22430.80
1048576 22537.82
2097152 22592.50
4194304 22619.96
Follow these instructions.
Utilizing the fast interconnects
In order to get the fast interconnects to work with the hybrid model one can either:
Install the interconnect drivers into the image and build the MPI to use them. This is the normal Hybrid approach described in Figure 3.
Mount cluster’s MPI and other network libraries into the image and use them instead of the container’s MPI while running the MPI program. This is described in Figure 4.
Below are explanations on how the interconnect libraries were provided.
The interconnect support was provided by the libucx-dev
-package that
provides Infiniband drivers.
triton-openmpi.def
, line 15:
The OpenMPI installation was then configured to use these drivers:
triton-openmpi.def
, line 26:
./configure --prefix=/opt/ompi --with-ucx=/usr
The interconnect support is provided by installing drivers from Mellanox’s Infiniband driver repository:
puhti-openmpi.def
, lines 27-38:
# Enable Mellanox OFED rpm repo
wget https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
rpm --import RPM-GPG-KEY-Mellanox
rm RPM-GPG-KEY-Mellanox
cd /etc/yum.repos.d/
wget https://linux.mellanox.com/public/repo/mlnx_ofed/{{ OFED_VERSION }}/{{ OS_NAME }}{{ OS_VERSION }}/mellanox_mlnx_ofed.repo
cd /
# Install network library components
dnf -y install rdma-core ucx-ib-{{ UCX_VERSION }} ucx-devel-{{ UCX_VERSION }} ucx-knem-{{ UCX_VERSION }} ucx-cma-{{ UCX_VERSION }} ucx-rdmacm-{{ UCX_VERSION }}
Module singularity-bindings
mounts the system MPI and network drivers
into the container:
$ module load LUMI/23.09 EasyBuild-user singularity-bindings
$ export SINGULARITY_BIND=$SINGULARITY_BIND,/usr/lib64/libnl-3.so.200
$ echo $SINGULARITY_BIND
/opt/cray,/var/spool,/etc/host.conf,/etc/hosts,/etc/nsswitch.conf,/etc/resolv.conf,/etc/ssl/openssl.cnf,/run/cxi,/usr/lib64/libbrotlicommon.so.1,/usr/lib64/libbrotlidec.so.1,/usr/lib64/libcrypto.so.1.1,/usr/lib64/libcurl.so.4,/usr/lib64/libcxi.so.1,/usr/lib64/libgssapi_krb5.so.2,/usr/lib64/libidn2.so.0,/usr/lib64/libjansson.so.4,/usr/lib64/libjitterentropy.so.3,/usr/lib64/libjson-c.so.3,/usr/lib64/libk5crypto.so.3,/usr/lib64/libkeyutils.so.1,/usr/lib64/libkrb5.so.3,/usr/lib64/libkrb5support.so.0,/usr/lib64/liblber-2.4.so.2,/usr/lib64/libldap_r-2.4.so.2,/usr/lib64/libnghttp2.so.14,/usr/lib64/libpcre.so.1,/usr/lib64/libpsl.so.5,/usr/lib64/libsasl2.so.3,/usr/lib64/libssh.so.4,/usr/lib64/libssl.so.1.1,/usr/lib64/libunistring.so.2,/usr/lib64/libzstd.so.1,/lib64/libselinux.so.1,,/usr/lib64/libnl-3.so.200
$ echo $SINGULARITYENV_LD_LIBRARY_PATH
/opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/lib64:/opt/cray/libfabric/1.15.2.0/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64:/opt/cray/pe/gcc-libs
Interconnect support is not explicitly installed.
ABI compatibility in MPI
Different MPI installations do not have necessarily have application binary interface (ABI) compatibility. This means that software built with certain MPI installation does not necessarily run with another MPI installation.
Quite often MPI programs are built with the same version of MPI that will be used to run the program. However, in containerized applications the runtime MPI version might change if an outside MPI is bound into the container.
This can work as there is some ABI compatibility within an MPI family (OpenMPI, MPICH). For more info, see OpenMPI’s page on version compatibility and MPICH’s ABI Compatibility Initiative.
There are also projects like E4S Container Launcher and WI4MPI (Wrapper Interface for MPI) that aim to bypass this problem by creating a wrapper interfaces that the program in the container can be built against. This wrapper can then use different MPI implementations during runtime.
Example on portability: LAMMPS
LAMMPS is a classical molecular dynamics simulation code with a focus on materials modeling.
Let’s build a container with LAMMPS in it:
Bootstrap: docker
From: ubuntu:latest
%arguments
NPROCS=4
OPENMPI_VERSION=4.1.6
LAMMPS_VERSION=29Aug2024
%post
### Install OpenMPI dependencies
apt-get update
apt-get install -y wget bash gcc gfortran g++ make file bzip2 ca-certificates libucx-dev
### Build OpenMPI
OPENMPI_VERSION_SHORT=$(echo {{ OPENMPI_VERSION }} | cut -f 1-2 -d '.')
cd /opt
mkdir ompi
wget -q https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION_SHORT}/openmpi-{{ OPENMPI_VERSION }}.tar.bz2
tar -xvf openmpi-{{ OPENMPI_VERSION }}.tar.bz2
# Compile and install
cd openmpi-{{ OPENMPI_VERSION }}
./configure --prefix=/opt/ompi --with-ucx=/usr
make -j{{ NPROCS }}
make install
cd ..
rm -rf openmpi-{{ OPENMPI_VERSION }} openmpi-{{ OPENMPI_VERSION }}.tar.bz2
### Build example application
# Install LAMMPS dependencies
apt-get install -y cmake
export OMPI_DIR=/opt/ompi
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
export CMAKE_PREFIX_PATH="$OMPI_DIR:$CMAKE_PREFIX_PATH"
# Build LAMMPS
cd /opt
wget -q https://download.lammps.org/tars/lammps-{{ LAMMPS_VERSION }}.tar.gz
tar xf lammps-{{ LAMMPS_VERSION }}.tar.gz
cd lammps-{{ LAMMPS_VERSION }}
cmake -S cmake -B build \
-DCMAKE_INSTALL_PREFIX=/opt/lammps \
-DBUILD_MPI=yes \
-DBUILD_OMP=yes
cmake --build build --parallel {{ NPROCS }} --target install
cp -r examples /opt/lammps/examples
cd ..
rm -rf lammps-{{ LAMMPS_VERSION }} lammps-{{ LAMMPS_VERSION }}.tar.gz
%environment
export OMPI_DIR=/opt/ompi
export PATH="$OMPI_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
export MANPATH="$OMPI_DIR/share/man:$MANPATH"
export LAMMPS_DIR=/opt/lammps
export PATH="$LAMMPS_DIR/bin:$PATH"
export LD_LIBRARY_PATH="$LAMMPS_DIR/lib:$LD_LIBRARY_PATH"
export MANPATH="$LAMMPS_DIR/share/man:$MANPATH"
%runscript
exec /opt/lammps/bin/lmp "$@"
Let’s also create a submission script that runs a LAMMPS example where an indent will pushes against a material:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=2G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --output=lammps_indent.out
# Copy example from image
apptainer exec lammps-openmpi.sif cp -r /opt/lammps/examples/indent .
cd indent
# Load OpenMPI module
module load openmpi
# Run simulation
srun apptainer run ../lammps-openmpi.sif -in in.indent
Now this exact same container can be run in both Triton / Puhti that have OpenMPI installed because both clusters use Slurm and InfiniBand interconnects.
To build the image:
$ srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build lammps-openmpi.sif lammps-openmpi.def
To run the example:
$ export PMIX_MCA_gds=hash
$ export UCX_POSIX_USE_PROC_LINK=n
$ export OMPI_MCA_orte_top_session_dir=/tmp/$USER/openmpi
$ sbatch run_lammps_indent.sh
$ tail -n 27 lammps_indent.out
Loop time of 0.752293 on 4 procs for 30000 steps with 420 atoms
Performance: 10336396.152 tau/day, 39878.072 timesteps/s, 16.749 Matom-step/s
99.6% CPU use with 4 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.31927 | 0.37377 | 0.42578 | 7.5 | 49.68
Neigh | 0.016316 | 0.020162 | 0.023961 | 2.3 | 2.68
Comm | 0.19882 | 0.25882 | 0.31814 | 10.2 | 34.40
Output | 0.00033215 | 0.00038609 | 0.00054361 | 0.0 | 0.05
Modify | 0.044981 | 0.049941 | 0.054024 | 1.7 | 6.64
Other | | 0.04921 | | | 6.54
Nlocal: 105 ave 112 max 98 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Nghost: 92.5 ave 96 max 89 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Neighs: 892.25 ave 1003 max 788 min
Histogram: 2 0 0 0 0 0 0 0 1 1
Total # of neighbors = 3569
Ave neighs/atom = 8.497619
Neighbor list builds = 634
Dangerous builds = 0
Total wall time: 0:00:01
To build the image:
$ apptainer build lammps-openmpi.sif lammps-openmpi.def
To run the example:
$ export PMIX_MCA_gds=hash
$ sbatch --account=project_XXXXXXX --partition=large run_lammps_indent.sh
$ tail -n 27 lammps_indent.out
Loop time of 0.527178 on 4 procs for 30000 steps with 420 atoms
Performance: 14750222.558 tau/day, 56906.723 timesteps/s, 23.901 Matom-step/s
99.7% CPU use with 4 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 0.22956 | 0.26147 | 0.28984 | 5.5 | 49.60
Neigh | 0.012471 | 0.015646 | 0.018613 | 2.3 | 2.97
Comm | 0.13816 | 0.17729 | 0.2192 | 9.3 | 33.63
Output | 0.00023943 | 0.00024399 | 0.00025267 | 0.0 | 0.05
Modify | 0.03212 | 0.035031 | 0.037378 | 1.2 | 6.65
Other | | 0.0375 | | | 7.11
Nlocal: 105 ave 112 max 98 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Nghost: 92.5 ave 96 max 89 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Neighs: 892.25 ave 1003 max 788 min
Histogram: 2 0 0 0 0 0 0 0 1 1
Total # of neighbors = 3569
Ave neighs/atom = 8.497619
Neighbor list builds = 634
Dangerous builds = 0
Total wall time: 0:00:01
Review of this session
Key points to remember
MPI version should match the version installed to the cluster
Cluster MPI module should be loaded for maximum compatibility with job launching
Care must be taken to make certain that the container utilizes fast interconnects