MPI programs and containers

Objectives

  • Learn what complications are involved with MPI containers

  • Learn how to generate a MPI container for your HPC system

What to consider when creating a container for MPI programs?

Message Passing Interface (MPI) is a standardized API and a programming paradigm where programs can use MPI directives to send messages across thousands of processes. It is commonly used in traditional HPC computing.

To handle the scale of the MPI programs the MPI installations are typically tied to the high-speed interconnect available in the computational cluster and to the queue system that the cluster uses.

This can create the following problems when an MPI program is containerized:

  1. Launching of the MPI job can fail if the program does not communicate with the queue system.

  2. The MPI communication performance can be bad if the program does not utilize the high-speed interconnects correctly.

  3. The container can have portability issues when taking it to a different cluster with different MPI, queue system or interconnect.

To solve these problems we first need to know how MPI works.

How MPI works

The launch process for an MPI program works like this:

  1. A reservation is done in the queue system for some number of MPI tasks.

  2. When the reservation gets the resources, individual MPI programs are launched by the queue system (srun) or by an MPI launcher (mpirun).

  3. User’s MPI program calls the MPI librarires it was built against.

  4. These libraries ask the queue system how many other MPI tasks there are.

  5. Individual MPI tasks start running the program collectively. Communication between tasks is done via fast interconnects.

../_images/mpi_job_structure.png

Figure 1: How MPI programs launch

To make this work with various different queue systems and various different interconnects MPI installations often utilize Process Management Interface (PMI/PMI2/PMIx) when they connect to the queue system and Unified Communication X when they connect to the interconnects.

../_images/mpi_install_structure.png

Figure 2: How MPI installations are usually constructed

How to use MPI with a container

Most common way of running MPI programs in containers is to utilize a hybrid model, where the container contains the same MPI version as the host system.

When using this model the MPI launcher will call the MPI within the container and use it to launch the application.

../_images/mpi_job_structure_hybrid.png

Figure 3: Hybrid MPI job launch

Do note that the MPI inside the container does not necessarily know how to utilize the fast interconnects. We’ll talk about solving this later.

Creating a simple MPI container

Let’s construct an example container that runs a simple MPI benchmark from OSU Micro-Benchmarks. This benchmark suite is useful for testing whether the MPI installation works and whether the MPI can utilize the fast interconnect.

Because different sites have different MPI versions the definition files differ as well. Pick a definition file for your site.

triton-openmpi.def:

Bootstrap: docker
From: ubuntu:latest

%arguments

  NPROCS=4
  OPENMPI_VERSION=4.1.6
  OSU_MICRO_BENCHMARKS_VERSION=7.4

%post

  ### Install OpenMPI dependencies

  apt-get update
  apt-get install -y wget bash gcc gfortran g++ make file bzip2 ca-certificates libucx-dev

  ### Build OpenMPI

  OPENMPI_VERSION_SHORT=$(echo {{ OPENMPI_VERSION }} | cut -f 1-2 -d '.')
  cd /opt
  mkdir ompi
  wget -q https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION_SHORT}/openmpi-{{ OPENMPI_VERSION }}.tar.bz2
  tar -xvf openmpi-{{ OPENMPI_VERSION }}.tar.bz2
  # Compile and install
  cd openmpi-{{ OPENMPI_VERSION }}
  ./configure --prefix=/opt/ompi --with-ucx=/usr
  make -j{{ NPROCS }}
  make install
  cd ..
  rm -rf openmpi-{{ OPENMPI_VERSION }} openmpi-{{ OPENMPI_VERSION }}.tar.bz2

  ### Build example application
  
  export OMPI_DIR=/opt/ompi
  export PATH="$OMPI_DIR/bin:$PATH"
  export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"

  # Build osu benchmarks
  cd /opt
  wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
  tar xf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz
  cd osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}
  ./configure --prefix=/opt/osu-micro-benchmarks CC=/opt/ompi/bin/mpicc CFLAGS=-O3
  make -j{{ NPROCS }}
  make install
  cd ..
  rm -rf osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }} osu-micro-benchmarks-{{ OSU_MICRO_BENCHMARKS_VERSION }}.tar.gz

%environment
  export OMPI_DIR=/opt/ompi
  export PATH="$OMPI_DIR/bin:$PATH"
  export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
  export MANPATH="$OMPI_DIR/share/man:$MANPATH"

%runscript
  /opt/osu-micro-benchmarks/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw

To build:

srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build triton-openmpi.sif triton-openmpi.def

To run (some extra parameters are needed to prevent launch errors):

$ module load openmpi/4.1.6
$ export PMIX_MCA_gds=hash
$ export UCX_POSIX_USE_PROC_LINK=n
$ export OMPI_MCA_orte_top_session_dir=/tmp/$USER/openmpi
$ srun --partition=batch-milan --mem=2G --nodes=2-2 --ntasks-per-node=1 --time=00:10:00 apptainer run openmpi-triton.sif
srun: job 3521915 queued and waiting for resources
srun: job 3521915 has been allocated resources

# OSU MPI Bandwidth Test v7.4
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       3.98
2                       8.05
4                      15.91
8                      32.03
16                     64.24
32                    125.47
64                    245.52
128                   469.00
256                   877.69
512                  1671.24
1024                 3218.11
2048                 5726.91
4096                 8096.24
8192                10266.18
16384               11242.78
32768               11298.70
65536               12038.27
131072              12196.28
262144              12202.05
524288              11786.58
1048576             12258.48
2097152             12179.43
4194304             12199.89

Utilizing the fast interconnects

In order to get the fast interconnects to work with the hybrid model one can either:

  1. Install the interconnect drivers into the image and build the MPI to use them. This is the normal Hybrid approach described in Figure 3.

  2. Mount cluster’s MPI and other network libraries into the image and use them instead of the container’s MPI while running the MPI program. This is described in Figure 4.

../_images/mpi_job_structure_bound.png

Figure 4: Container with bound system MPI and network libraries

Below are explanations on how the interconnect libraries were provided.

The interconnect support was provided by the libucx-dev-package that provides Infiniband drivers.

triton-openmpi.def, line 15:

The OpenMPI installation was then configured to use these drivers:

triton-openmpi.def, line 26:

  ./configure --prefix=/opt/ompi --with-ucx=/usr

ABI compatibility in MPI

Different MPI installations do not have necessarily have application binary interface (ABI) compatibility. This means that software built with certain MPI installation does not necessarily run with another MPI installation.

Quite often MPI programs are built with the same version of MPI that will be used to run the program. However, in containerized applications the runtime MPI version might change if an outside MPI is bound into the container.

This can work as there is some ABI compatibility within an MPI family (OpenMPI, MPICH). For more info, see OpenMPI’s page on version compatibility and MPICH’s ABI Compatibility Initiative.

There are also projects like E4S Container Launcher and WI4MPI (Wrapper Interface for MPI) that aim to bypass this problem by creating a wrapper interfaces that the program in the container can be built against. This wrapper can then use different MPI implementations during runtime.

Example on portability: LAMMPS

LAMMPS is a classical molecular dynamics simulation code with a focus on materials modeling.

Let’s build a container with LAMMPS in it:

lammps-openmpi.def:

Bootstrap: docker
From: ubuntu:latest

%arguments

  NPROCS=4
  OPENMPI_VERSION=4.1.6
  LAMMPS_VERSION=29Aug2024

%post

  ### Install OpenMPI dependencies

  apt-get update
  apt-get install -y wget bash gcc gfortran g++ make file bzip2 ca-certificates libucx-dev

  ### Build OpenMPI

  OPENMPI_VERSION_SHORT=$(echo {{ OPENMPI_VERSION }} | cut -f 1-2 -d '.')
  cd /opt
  mkdir ompi
  wget -q https://download.open-mpi.org/release/open-mpi/v${OPENMPI_VERSION_SHORT}/openmpi-{{ OPENMPI_VERSION }}.tar.bz2
  tar -xvf openmpi-{{ OPENMPI_VERSION }}.tar.bz2
  # Compile and install
  cd openmpi-{{ OPENMPI_VERSION }}
  ./configure --prefix=/opt/ompi --with-ucx=/usr
  make -j{{ NPROCS }}
  make install
  cd ..
  rm -rf openmpi-{{ OPENMPI_VERSION }} openmpi-{{ OPENMPI_VERSION }}.tar.bz2

  ### Build example application

  # Install LAMMPS dependencies
  apt-get install -y cmake

  export OMPI_DIR=/opt/ompi
  export PATH="$OMPI_DIR/bin:$PATH"
  export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
  export CMAKE_PREFIX_PATH="$OMPI_DIR:$CMAKE_PREFIX_PATH"
 
  # Build LAMMPS
  cd /opt
  wget -q https://download.lammps.org/tars/lammps-{{ LAMMPS_VERSION }}.tar.gz
  tar xf lammps-{{ LAMMPS_VERSION }}.tar.gz
  cd lammps-{{ LAMMPS_VERSION }}
  cmake -S cmake -B build \
    -DCMAKE_INSTALL_PREFIX=/opt/lammps \
    -DBUILD_MPI=yes \
    -DBUILD_OMP=yes
  cmake --build build --parallel {{ NPROCS }} --target install
  cp -r examples /opt/lammps/examples
  cd ..
  rm -rf lammps-{{ LAMMPS_VERSION }} lammps-{{ LAMMPS_VERSION }}.tar.gz

%environment
  export OMPI_DIR=/opt/ompi
  export PATH="$OMPI_DIR/bin:$PATH"
  export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
  export MANPATH="$OMPI_DIR/share/man:$MANPATH"

  export LAMMPS_DIR=/opt/lammps
  export PATH="$LAMMPS_DIR/bin:$PATH"
  export LD_LIBRARY_PATH="$LAMMPS_DIR/lib:$LD_LIBRARY_PATH"
  export MANPATH="$LAMMPS_DIR/share/man:$MANPATH"

%runscript
  exec /opt/lammps/bin/lmp "$@"

Let’s also create a submission script that runs a LAMMPS example where an indent will pushes against a material:

run_lammps_indent.sh:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=2G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --output=lammps_indent.out

# Copy example from image
apptainer exec lammps-openmpi.sif cp -r /opt/lammps/examples/indent .

cd indent

# Load OpenMPI module
module load openmpi

# Run simulation
srun apptainer run ../lammps-openmpi.sif -in in.indent

Now this exact same container can be run in both Triton / Puhti that have OpenMPI installed because both clusters use Slurm and InfiniBand interconnects.

To build the image:

$ srun --mem=16G --cpus-per-task=4 --time=01:00:00 apptainer build lammps-openmpi.sif lammps-openmpi.def

To run the example:

$ export PMIX_MCA_gds=hash
$ export UCX_POSIX_USE_PROC_LINK=n
$ export OMPI_MCA_orte_top_session_dir=/tmp/$USER/openmpi
$ sbatch run_lammps_indent.sh
$ tail -n 27 lammps_indent.out
Loop time of 0.752293 on 4 procs for 30000 steps with 420 atoms

Performance: 10336396.152 tau/day, 39878.072 timesteps/s, 16.749 Matom-step/s
99.6% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.31927    | 0.37377    | 0.42578    |   7.5 | 49.68
Neigh   | 0.016316   | 0.020162   | 0.023961   |   2.3 |  2.68
Comm    | 0.19882    | 0.25882    | 0.31814    |  10.2 | 34.40
Output  | 0.00033215 | 0.00038609 | 0.00054361 |   0.0 |  0.05
Modify  | 0.044981   | 0.049941   | 0.054024   |   1.7 |  6.64
Other   |            | 0.04921    |            |       |  6.54

Nlocal:            105 ave         112 max          98 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Nghost:           92.5 ave          96 max          89 min
Histogram: 1 0 1 0 0 0 0 1 0 1
Neighs:         892.25 ave        1003 max         788 min
Histogram: 2 0 0 0 0 0 0 0 1 1

Total # of neighbors = 3569
Ave neighs/atom = 8.497619
Neighbor list builds = 634
Dangerous builds = 0
Total wall time: 0:00:01

Review of this session

Key points to remember

  • MPI version should match the version installed to the cluster

  • Cluster MPI module should be loaded for maximum compatibility with job launching

  • Care must be taken to make certain that the container utilizes fast interconnects