How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

How to Run MLPerf Llama 2 70B Training on AMD MI325X Without SLURM

This guide covers running the MLPerf Training v5.1 Llama 2 70B LoRA fine-tuning benchmark on a multi-node AMD Instinct MI325X cluster without a SLURM scheduler.

Overview

AMD provides an official MLPerf Training Docker image (rocm/amd-mlperf:llama2_70b_training_5.1) designed primarily for SLURM-managed clusters. However, many environments use simpler SSH-based orchestration. This post demonstrates how to run multi-node distributed training using PyTorch’s rendezvous mechanism.

Hardware Setup

  • Cluster: 4× MiTAC MI325X nodes
  • GPUs: 8× AMD Instinct MI325X per node (32 total)
  • Network: High-speed interconnect for RCCL communication
  • Storage: Shared NFS mount at /mnt/shared

Prerequisites

Software Dependencies

Component Version Notes
ROCm 6.2+ Host driver and runtime
Docker 24.0+ With GPU support configured
RCCL Included in container ROCm Collective Communications Library
PyTorch 2.4+ (ROCm) Included in container

Host Setup

  1. ROCm Installation: Follow ROCm installation guide for your Linux distribution.

  2. Docker GPU Access: Ensure Docker can access AMD GPUs:

   docker run --rm --device /dev/dri --device /dev/kfd rocm/pytorch:latest rocm-smi
  1. Multi-Node Networking:

    • Passwordless SSH between all nodes
    • High-speed network interface (InfiniBand/RoCE recommended)
    • Shared filesystem accessible from all nodes
  2. Pull the MLPerf Container on all nodes:

   docker pull rocm/amd-mlperf:llama2_70b_training_5.1

Data Preparation

The benchmark requires ~270GB for the Llama 2 70B model and GovReport dataset. A HuggingFace token with Llama 2 license acceptance is required:

export HF_TOKEN=your_token_here
./finetune_llama.sh prepare

Two Approaches for Multi-Node Training

AMD’s container supports two launch methods:

1. SLURM-Based (AMD Default)

# Requires SLURM scheduler
sbatch run_with_docker_slurm.sh

2. Manual Multi-Node with Rendezvous

For non-SLURM environments, PyTorch’s torchrun supports a rendezvous backend that handles rank assignment automatically:

torchrun 
  --nnodes=4 
  --nproc_per_node=8 
  --rdzv_backend=c10d 
  --rdzv_endpoint=MASTER_IP:29500 
  --rdzv_id=mlperf_run 
  train.py

This command runs identically on all nodes – the c10d backend coordinates rank assignment.

Implementation

Our approach uses SSH to launch training on each node, passing the distributed configuration via environment variables:

Container Launch Pattern

Each node runs:

# Start container with data mounts
docker run --rm --init --detach 
  --net=host --ipc=host 
  --device /dev/dri --device /dev/kfd 
  --name mlperf_llama2sft 
  -v $DATADIR/data:/data 
  -v $DATADIR/model:/ckpt 
  -v $RESULTS:/logs 
  -v $CODE_DIR:/workspace/code 
  rocm/amd-mlperf:llama2_70b_training_5.1 sleep infinity

# Execute training with distributed config
docker exec 
  -e MASTER_ADDR=$MASTER_IP 
  -e MASTER_PORT=29500 
  -e SLURM_NNODES=$NUM_NODES 
  -e SLURM_NODEID=$NODE_RANK 
  -e NCCL_SOCKET_IFNAME=$NET_IF 
  mlperf_llama2sft 
  bash -c 'cd /workspace/code && source config_MI325X_4x8x1.sh && bash ./run_and_time_slurm.sh'

Orchestration Script

The main script SSHs to each node in parallel:

for node_idx in 0 1 2 3; do
  ssh node-$node_idx "launch_training.sh $node_idx $NUM_NODES" &
done
wait

Key Configuration

The config_MI325X_4x8x1.sh sets critical parameters:

export DGXNNODES=4
export DGXNGPU=8
export FP8=True
export LR=0.0004
export MBS=1  # micro batch size
export MAX_STEPS=1024

Results

Single Node (8 GPUs)

Metric Value
Throughput 2.79 samples/sec
Time to Converge 20.57 minutes
Final Loss 0.921 (target: ≤0.925)

Four Nodes (32 GPUs)

Metric Value
Throughput 11.15 samples/sec
Time to Converge 12.40 minutes
Final Loss 0.924 (target: ≤0.925)

Scaling Analysis

Metric 1-Node 4-Node Scaling Factor
GPUs 8 32
Batch Size 8 32
Throughput 2.79 11.15 4.0×

Near-linear throughput scaling validates that the network interconnect is not a bottleneck.

Comparison with Official Results

Our single-node result (20.57 min) matches AMD’s official MLPerf v5.1 submission (~21 min) for MI325X, confirming correct configuration.

Key Takeaways

  1. Container Design: AMD’s container expects training scripts at /workspace/code – mount custom configs there rather than extracting files.

  2. Network Interface: Set NCCL_SOCKET_IFNAME to your high-speed network interface for optimal RCCL performance.

  3. SLURM Variables: The container’s run_and_time_slurm.sh reads SLURM_NNODES and SLURM_NODEID – these can be set manually for non-SLURM environments.

  4. Scaling: Expect near-linear throughput scaling on properly configured clusters. Time-to-convergence scaling may differ due to batch size effects on convergence dynamics.

Resources

Full Script

The complete finetune_llama.sh script supports:

  • Single and multi-node runs
  • Configurable NEXP for MLPerf-compliant submissions
  • Automatic config selection based on node count
./finetune_llama.sh run 4      # 4-node, single run
./finetune_llama.sh run 4 10   # 4-node, 10 runs (MLPerf submission)

Interested in the full script? Reach out via LinkedIn and I’ll be happy to share.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

SC #6: Petición web con async/await

Related Posts