This is an old revision of the document!

SLURM Cluster Quickstart Guide

To use the GPU resources effectively, this guide provides a step-by-step workflow for setting up an Enroot container, configuring the environment, running interactive sessions, submitting multi-node batch jobs, and managing workloads on an HPC cluster using SLURM.

1. Create Enroot Container

This step creates a lightweight, self-contained environment using an existing `.sqsh` image. There are different releases for different Pytorch and Python Versions. To make things run smoothly please use the correct version.

If your preferred version ist not available please reach out to us via Email: ki-servicezentrum(at)hpi.de

Very Useful Links

NOTE: Always create an enroot container name with pyxis_<your_container_name> to be able to use in a Multi-Node setting

enroot create -n pyxis_torch_2412 /sc/home/<username>/nvidia-pytorch-24.12-py312.sqsh

Use the following command to see a list of containers, you should see your newly created container pyxis_torch_2412:

enroot list

2. Environment Setup

# Allocate one GPU to be able to start the Enroot container
srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=aisc --account=aisc --export=ALL --pty bash
 
# Start Enroot Container mounting the current working directory
enroot start --mount $(pwd):/workspace pyxis_torch_2412
 
# Create virtual environment
python -m venv venv
 
# Activate and install dependencies
source venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements-slurm.txt

3. Interactive Sessions

While developing, you can request one node (up to 8 GPUs on that node) to reduce overhead compared to a multi-node environment.

Why it’s Useful: Enables quick debugging and interactive experimentation.
What to Watch Out For: Pay attention to time limits (`–time=01:00:00`) and GPU requests so you don’t get preempted or blocked by the scheduler.

Single GPU (1hr)

srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=aisc-interactive --account=aisc --export=ALL --pty bash

You can also see the available GPUs by running:

gpualloc

3.1. Queues

Important:

Available SLURM partitions (queues):

At the moment AISC users are allowed to use partition called

aisc-interactive: for interactive jobs only, limited to 8h max (highest job priority)
aisc: for batch jobs only, limited to 5 days max (medium job priority)
aisc-longrun: batch jobs only, limited to 14 days (lowest job priority)

4. SLURM Batch Multi-Node Job Template

Use this when you need to scale beyond a single node. SLURM handles node allocation and job scheduling. Enroot and Pyxis are used to be able to communicate over multiple nodes using MPI.

What to Watch Out For:
- Ensure your `–nodes` and `–gpus` match actual cluster resources.
- Confirm the container mount uses `/dev/infiniband` to use InfiniBand for very fast communication.

vim /home/felix.boelter/my_first_job.slurm

Add the following into your Terminal and adjust it as needed:

#!/bin/bash -eux 
 
# ==============================
# SLURM Job Configuration 
# (Adjust to your Configuration)
# ==============================
 
#SBATCH --job-name=pytorch_job
#SBATCH --nodes=2
#SBATCH --gpus-per-node=5
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=MyFirstJob
#SBATCH --output=logs/%j/debug_output.log
#SBATCH --error=logs/%j/debug_error.log
#SBATCH --time=01:00:00
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --container-writable
 
# Paths (Adjust to your configuration)
export SHARED_STORAGE_ROOT=/home/<username>/
export CONTAINER_WORKSPACE_MOUNT=${SHARED_STORAGE_ROOT}/<Your_Project_Folder>/train
export CONTAINER_NAME=torch_2412
 
# NCCL Configuration for Distributed Training (Don't Change)
export NCCL_IB_DISABLE=0
export NCCL_IB_CUDA_SUPPORT=1
export NCCL_DEBUG=TRACE
export NCCL_DEBUG_FILE="/workspace/nccl_logs/${SLURM_JOB_ID}_%h_%p.txt"
export SLURM_DEBUG=verbose
export CUDA_DEVICE_ORDER=PCI_BUS_ID
 
# Distributed Training Settings (Don't Change)
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
export GPUS_PER_NODE=${SLURM_GPUS_PER_NODE:-8}
export NNODES=${SLURM_NNODES:-1}
export NUM_PROCESSES=$(( NNODES * GPUS_PER_NODE ))
export MULTIGPU_FLAG="--multi_gpu"
 
# Hugging Face Cache Directory
export HF_HOME=${SHARED_STORAGE_ROOT}/.huggingface
 
# Adjust MULTIGPU_FLAG for Single Node
if [[ "$NNODES" -eq "1" ]]; then
    export MULTIGPU_FLAG=""
fi
 
# ==============================
# Job Information
# ==============================
 
echo "===== Job Information ====="
echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "GPUS_PER_NODE: $GPUS_PER_NODE"
echo "NNODES: $NNODES"
echo "NUM_PROCESSES: $NUM_PROCESSES"
echo "MULTIGPU_FLAG: $MULTIGPU_FLAG"
echo "HF_HOME: $HF_HOME"
echo "==========================="
 
# ==============================
# System Information
# ==============================
 
srun hostname
srun echo "CPUs on Node: $SLURM_CPUS_ON_NODE"
srun echo "Node ID: $SLURM_NODEID"
 
# ==============================
# Run Training
# ==============================
 
srun -l \
    --container-name "$CONTAINER_NAME" \
    --container-mounts "$CONTAINER_WORKSPACE_MOUNT:/workspace,/dev/infiniband:/dev/infiniband" \
    --container-writable \
    --container-workdir /workspace \
    --container-mount-home \
    --export=ALL \
    --nodes=$NNODES \
    --ntasks=$NNODES \
    --ntasks-per-node=1 \
    --verbose \
     ./launcher.sh

Save and quit the Vim using your keyboard

ESC
:wq

Create a launcher file to ensure that the connection is not lost while waiting for the master node:

vim /home/felix.boelter/launcher.sh

#!/bin/bash
set -e
 
source venv/bin/activate
current_host=$(hostname)
 
# Adjust the command to your configuration
base_cmd=(
    torchrun 
    --rdzv_backend=c10d 
    --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" 
    --nnodes="${NNODES}" 
    --nproc_per_node="${GPUS_PER_NODE}" 
    main.py 
    --config_path config/train-dist.yaml
)
 
log_info() {
    echo "[$(date '+%T') - ${current_host}] $*"
}
 
wait_for_master() {
    log_info "Waiting for master at ${MASTER_ADDR}:${MASTER_PORT}"
 
    # Bash built-in TCP check instead of nc
    while ! &>/dev/null </dev/tcp/${MASTER_ADDR}/${MASTER_PORT}; do
        sleep 1
    done
 
    log_info "Master ${MASTER_ADDR}:${MASTER_PORT} is ready"
}
 
if [ "${current_host}" = "${MASTER_ADDR}" ]; then
    log_info "Starting MASTER process"
    "${base_cmd[@]}"
else
    wait_for_master
    log_info "Starting WORKER process"
    "${base_cmd[@]}"
fi

5. Job Management

Basic SLURM commands to control and monitor your jobs.

# Submit job
sbatch my_first_job.slurm
 
# Monitor queue
squeue
 
# Cancel job
scancel <job_id>

To check the Logs you can find them under:

# To check if communication is working
tail -f logs/<job_id>/debug_error.log
 
# To check if the launcher is working
tail -f logs/<job_id>/debug_output.log

Usage Guidelines AI Computing Centre

Table of Contents