Setting up Python environments on LUMI: Or why you shouldn't use the LUMI container wrapper

This guide provides instructions on how to create custom Python environments on the EuroHPC LUMI supercomputer in a way that ensures inter-node communication works properly. This is essential for running multi-node machine learning jobs on LUMI using e.g. PyTorch, TensorFlow, and JAX.

All the information provided here is also available in LUMI official documentation, but as the information is scattered across multiple pages, it can be difficult to find the relevant details. This guide aims to consolidate that information and provide a clear, step-by-step process for setting up your custom environment.

All the scripts used in this guide are available in this GitHub repository, so that it can be easily tested on LUMI.

📖 Table of Contents

Motivation
Building a Python environment with Cotainr
Running jobs with the environment
Contact

🤔 Motivation

Many users move to LUMI from CSC clusters like Mahti and Puhti, where it is convenient to create custom Python environments using the Tykky tool for wrapping Conda environments in Singularity containers.

A similar tool, LUMI container wrapper, is also available on LUMI and is tempting to use because of the familiar workflow. However, using the container wrapper on LUMI has significant drawbacks, especially for multi-node training jobs. Let's consider an example of training a PyTorch model using Distributed Data Parallel (DDP) across two nodes. Here are the epoch durations with environments created in two different ways:

# With an environment created with LUMI container wrapper
Epoch 1: 100%|██████████| 2442/2442 [06:20<00:00,  6.42it/s]

# With an environment created following this guide
Epoch 0: 100%|██████████| 2442/2442 [00:49<00:00, 49.25it/s]

As you can see, the run with the LUMI container wrapper environment is ~7x slower! But why?

The reason is that fast node-to-node communication on LUMI requires the AWS OFI plugin for RCCL, which is the AMD GPU collective communication library, replacement for NVIDIA NCCL. The plugin enables RCCL to use LUMI's Slingshot-11 interconnect, as RCCL does not support it out of the box. LUMI container wrapper installs start from a minimal base image that does not include the AWS OFI plugin. While it is possible to manually inject the plugin, it is easier to build Python environments on top of pre-configured ROCm images that already include it. This can be achieved using the Cotainr tool available on LUMI.

🔨 Building a Python environment with Cotainr

To build the example environment on LUMI, clone the example repository and run the create_environment.sh script:

cd /scratch/<your_LUMI_project_id>
git clone https://github.com/lasuomela/LUMI-Project-Template.git
cd LUMI-Project-Template

sbatch -A <your_LUMI_project_id> slurm_tools/create_environment.sh

The script creates a new Singularity image with the Conda and pip packages specified in environment.yml on top of an ROCm base image provided by LUMI.

Additionally, it creates a SquashFS file that contains a Python virtual environment (venv) with the pip packages that cannot be installed directly with Conda. An editable install of a package you are developing is a good example of such a package. The SquashFS file is mounted inside the Singularity image at runtime to make the packages available.

The environment build is done on a compute node to ensure sufficient RAM and CPU. The build process takes some time, you can monitor the progress in the log_build.out and log_build.err files generated by SLURM.

Here you can see the content of create_environment.sh for reference:

#!/bin/bash -l
#SBATCH --job-name=lumi_env_build     # Job name
#SBATCH --output=logs/log_build.out      # Name of stdout output file
#SBATCH --error=logs/log_build.err       # Name of stderr error file
#SBATCH --partition=small               # partition name
#SBATCH --nodes=1                       # Total number of nodes 
#SBATCH --ntasks-per-node=1             # MPI ranks per node
#SBATCH --cpus-per-task=64
#SBATCH --mem=256G                      # Total memory for job
#SBATCH --time=0-00:20:00               # Run time (d-hh:mm:ss)

# Reference:
# https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/blob/main/07_Extending_containers_with_virtual_environments_for_faster_testing/examples/extending_containers_with_venv.md

# Name of the image to create
IMAGE_NAME=pytorch_example.sif

# Path to the package being developed
PKG_DIR=/scratch/$SLURM_JOB_ACCOUNT/LUMI-Project-Template

# Where to store the image
INSTALL_DIR=/projappl/$SLURM_JOB_ACCOUNT/pytorch_example

# Path to your conda environment file
ENV_FILE_PATH=$PKG_DIR/environment.yml

# Path to the environment base image. Choose a ROCm version that matches your needs.
# On the base images, RCCL is properly configured to use the high-speed Slingshot-11 interconnect
# between nodes. This ensures optimal performance when training across multiple nodes.
BASE_IMAGE_PATH=/appl/local/containers/sif-images/lumi-rocm-rocm-6.2.4.sif

# Remove the old environment
if [ -d "$INSTALL_DIR" ]; then
    rm -rf $INSTALL_DIR
fi
mkdir -p $INSTALL_DIR

# Purge modules and load cotainr module
module purge
module load LUMI/24.03 cotainr

# Install the conda/pip dependencies specified in the .yml file
srun cotainr build $INSTALL_DIR/$IMAGE_NAME \
    --base-image=$BASE_IMAGE_PATH \
    --conda-env=$ENV_FILE_PATH \
    --accept-license

# Stuff beyond here is optional but useful

# Load modules needed for running Singularity containers
module use  /appl/local/containers/ai-modules
module load singularity-AI-bindings

# Create a virtual environment to install stuff that cannot be installed via conda
# such as an editable install of the package being developed
# or packages you want to install with the --no-deps flag
singularity exec $INSTALL_DIR/$IMAGE_NAME bash -c "
  python -m venv $INSTALL_DIR/myenv --system-site-packages &&
  source $INSTALL_DIR/myenv/bin/activate &&
  pip install git+https://github.com/bdaiinstitute/theia.git --no-deps &&
  pip install -e $PKG_DIR &&
  deactivate
"

# Create a SquashFS image of the virtual environment
mksquashfs $INSTALL_DIR/myenv $INSTALL_DIR/myenv.sqsh
rm -rf $INSTALL_DIR/myenv

and the environment.yml:

# An example PyTorch environment for LUMI
name: pytorch_example
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pip
  - lightning>=2.5.1
  - pip:
    # The PyTorch ROCm version here should match the ROCm version
    # of the base image you chose in create_environment.sh
    - --extra-index-url https://download.pytorch.org/whl/rocm6.2.4
    - torch==2.6.0+rocm6.2.4
    - torchvision==0.21.0+rocm6.2.4

🚀 Running jobs with the environment

Once the environment is built, you can run the example PyTorch job like

sbatch -A <your_LUMI_project_id> slurm_tools/run.sh

You can change the number of GPUs and nodes in the run.sh script. If everything is set up correctly, you should see fast epoch times similar to the ones shown in the Motivation section, and the epoch times should decrease ~linearly as you increase the number of nodes used for training.

Here is the content of run.sh for reference:

#!/bin/bash -l
#SBATCH --job-name=pytorch_example     # Job name
#SBATCH --output=logs/log_test.out      # Name of stdout output file
#SBATCH --error=logs/log_test.err       # Name of stderr error file
#SBATCH --partition=dev-g               # partition name
#SBATCH --nodes=2                       # Total number of nodes 
#SBATCH --ntasks-per-node=1             # MPI ranks per node
#SBATCH --gpus-per-node=1               # Allocate one gpu per MPI rank
#SBATCH --cpus-per-task=7               # CPU cores per task
#SBATCH --mem=480G                      # Total memory for job
#SBATCH --time=0-02:00:00               # Run time (d-hh:mm:ss)

# Reference:
# https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop/blob/main/07_Extending_containers_with_virtual_environments_for_faster_testing/examples/extending_containers_with_venv.md

# Path to the environment. Same as INSTALL_DIR in create_environment.sh
ENV_DIR=/projappl/$SLURM_JOB_ACCOUNT/pytorch_example

# Name of the image to to use
IMAGE_NAME=pytorch_example.sif

# Load the required modules
module purge
module load LUMI
module use  /appl/local/containers/ai-modules
module load singularity-AI-bindings

source ~/.bashrc
export SINGULARITYENV_PREPEND_PATH=/user-software/bin # gives access to packages inside the container

# Tell RCCL to use only Slingshot interfaces and GPU RDMA
export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
export NCCL_NET_GDR_LEVEL=PHB

# Run the training script:
# We use the Singularity container from 'create_environment.sh'
# with the --bind option to mount the virtual environment in $ENV_DIR/myenv.sqsh
# into the container at /user-software.
#
# The number of GPUs and nodes are auto-detected from the SLURM environment variables.
srun singularity exec \
   -B $ENV_DIR/myenv.sqsh:/user-software:image-src=/ $ENV_DIR/$IMAGE_NAME \
    python -m pytorch_example.run \
        --num_gpus=$SLURM_GPUS_ON_NODE \
        --num_nodes=$SLURM_JOB_NUM_NODES \

# Bonus:
# With full-node allocations, i.e. running jobs on standard-g or small-g with slurm argument `--exclusive,
# it is beneficial to set the CPU bindings (see https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/distribution-binding/#gpu-binding)

# # Define CPU binding for optimal performance in full-node allocations
# CPU_BIND="mask_cpu:fe000000000000,fe00000000000000"
# CPU_BIND="${CPU_BIND},fe0000,fe000000"
# CPU_BIND="${CPU_BIND},fe,fe00"
# CPU_BIND="${CPU_BIND},fe00000000,fe0000000000"

# srun --cpu-bind=$CPU_BIND singularity exec \
#    -B $ENV_DIR/myenv.sqsh:/user-software:image-src=/ $ENV_DIR/$IMAGE_NAME \
#     python -m pytorch_example.run \
#         --num_gpus=$SLURM_GPUS_ON_NODE \
#         --num_nodes=$SLURM_JOB_NUM_NODES \

Contact

That's it! If you have any questions or run into issues, feel free to reach out to lauri.a.suomela@tuni.fi or open an issue in the GitHub repository.