This is an old revision of the document!
Table of Contents
SLURM Cluster Quickstart Guide
To use the GPU resources effectively, this guide provides a step-by-step workflow for setting up an Enroot container, configuring the environment, running interactive sessions, submitting multi-node batch jobs, and managing workloads on an HPC cluster using SLURM.
1. Create Enroot Container
This step creates a lightweight, self-contained environment using an existing `.sqsh` image. There are different releases for different Pytorch and Python Versions. To make things run smoothly please use the correct version.
If your preferred version ist not available please reach out to us via Email: ki-servicezentrum(at)hpi.de
Very Useful Links
NOTE: Always create an enroot container name with pyxis_<your_container_name> to be able to use in a Multi-Node setting
enroot create -n pyxis_torch_2412 /sc/home/<username>/nvidia-pytorch-24.12-py312.sqsh
Use the following command to see a list of containers, you should see your newly created container pyxis_torch_2412:
enroot list
2. Environment Setup
# Allocate one GPU to be able to start the Enroot container srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=aisc --account=aisc --export=ALL --pty bash # Start Enroot Container mounting the current working directory enroot start --mount $(pwd):/workspace pyxis_torch_2412 # Create virtual environment python -m venv venv # Activate and install dependencies source venv/bin/activate python -m pip install --upgrade pip pip install -r requirements-slurm.txt
3. Interactive Sessions
While developing, you can request one node (up to 8 GPUs on that node) to reduce overhead compared to a multi-node environment.
- Why it’s Useful: Enables quick debugging and interactive experimentation.
- What to Watch Out For: Pay attention to time limits (`–time=01:00:00`) and GPU requests so you don’t get preempted or blocked by the scheduler.
Single GPU (1hr)
srun --nodes=1 --ntasks=1 --gpus=1 --time=01:00:00 --partition=aisc --account=aisc --export=ALL --pty bash
You can also see the available GPUs by running:
gpualloc
3.1. Queues
Important:
Available SLURM partitions (queues):
At the moment AISC users are allowed to use partition called aisc
4. SLURM Batch Multi-Node Job Template
Use this when you need to scale beyond a single node. SLURM handles node allocation and job scheduling. Enroot and Pyxis are used to be able to communicate over multiple nodes using MPI.
- What to Watch Out For:
- Ensure your `–nodes` and `–gpus` match actual cluster resources.
- Confirm the container mount uses `/dev/infiniband` to use InfiniBand for very fast communication.
vim /home/felix.boelter/my_first_job.slurm
Add the following into your Terminal and adjust it as needed:
#!/bin/bash -eux # ============================== # SLURM Job Configuration # (Adjust to your Configuration) # ============================== #SBATCH --job-name=pytorch_job #SBATCH --nodes=2 #SBATCH --gpus-per-node=5 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=MyFirstJob #SBATCH --output=logs/%j/debug_output.log #SBATCH --error=logs/%j/debug_error.log #SBATCH --time=01:00:00 #SBATCH --exclusive #SBATCH --mem=0 #SBATCH --container-writable # Paths (Adjust to your configuration) export SHARED_STORAGE_ROOT=/home/<username>/ export CONTAINER_WORKSPACE_MOUNT=${SHARED_STORAGE_ROOT}/<Your_Project_Folder>/train export CONTAINER_NAME=torch_2412 # NCCL Configuration for Distributed Training (Don't Change) export NCCL_IB_DISABLE=0 export NCCL_IB_CUDA_SUPPORT=1 export NCCL_DEBUG=TRACE export NCCL_DEBUG_FILE="/workspace/nccl_logs/${SLURM_JOB_ID}_%h_%p.txt" export SLURM_DEBUG=verbose export CUDA_DEVICE_ORDER=PCI_BUS_ID # Distributed Training Settings (Don't Change) export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1) export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 )) export GPUS_PER_NODE=${SLURM_GPUS_PER_NODE:-8} export NNODES=${SLURM_NNODES:-1} export NUM_PROCESSES=$(( NNODES * GPUS_PER_NODE )) export MULTIGPU_FLAG="--multi_gpu" # Hugging Face Cache Directory export HF_HOME=${SHARED_STORAGE_ROOT}/.huggingface # Adjust MULTIGPU_FLAG for Single Node if [[ "$NNODES" -eq "1" ]]; then export MULTIGPU_FLAG="" fi # ============================== # Job Information # ============================== echo "===== Job Information =====" echo "MASTER_ADDR: $MASTER_ADDR" echo "MASTER_PORT: $MASTER_PORT" echo "GPUS_PER_NODE: $GPUS_PER_NODE" echo "NNODES: $NNODES" echo "NUM_PROCESSES: $NUM_PROCESSES" echo "MULTIGPU_FLAG: $MULTIGPU_FLAG" echo "HF_HOME: $HF_HOME" echo "===========================" # ============================== # System Information # ============================== srun hostname srun echo "CPUs on Node: $SLURM_CPUS_ON_NODE" srun echo "Node ID: $SLURM_NODEID" # ============================== # Run Training # ============================== srun -l \ --container-name "$CONTAINER_NAME" \ --container-mounts "$CONTAINER_WORKSPACE_MOUNT:/workspace,/dev/infiniband:/dev/infiniband" \ --container-writable \ --container-workdir /workspace \ --container-mount-home \ --export=ALL \ --nodes=$NNODES \ --ntasks=$NNODES \ --ntasks-per-node=1 \ --verbose \ ./launcher.sh
Save and quit the Vim using your keyboard
ESC :wq
Create a launcher file to ensure that the connection is not lost while waiting for the master node:
vim /home/felix.boelter/launcher.sh
#!/bin/bash set -e source venv/bin/activate current_host=$(hostname) # Adjust the command to your configuration base_cmd=( torchrun --rdzv_backend=c10d --rdzv_endpoint="${MASTER_ADDR}:${MASTER_PORT}" --nnodes="${NNODES}" --nproc_per_node="${GPUS_PER_NODE}" main.py --config_path config/train-dist.yaml ) log_info() { echo "[$(date '+%T') - ${current_host}] $*" } wait_for_master() { log_info "Waiting for master at ${MASTER_ADDR}:${MASTER_PORT}" # Bash built-in TCP check instead of nc while ! &>/dev/null </dev/tcp/${MASTER_ADDR}/${MASTER_PORT}; do sleep 1 done log_info "Master ${MASTER_ADDR}:${MASTER_PORT} is ready" } if [ "${current_host}" = "${MASTER_ADDR}" ]; then log_info "Starting MASTER process" "${base_cmd[@]}" else wait_for_master log_info "Starting WORKER process" "${base_cmd[@]}" fi
5. Job Management
Basic SLURM commands to control and monitor your jobs.
# Submit job sbatch my_first_job.slurm # Monitor queue squeue # Cancel job scancel <job_id>
To check the Logs you can find them under:
# To check if communication is working tail -f logs/<job_id>/debug_error.log # To check if the launcher is working tail -f logs/<job_id>/debug_output.log