Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Train large models faster by distributing work across multiple GPUs on a single instance.

1. Deploy a multi-GPU instance

runcrate instances create --name multi-gpu --gpu A100 --gpu-count 4 --template ubuntu-train
runcrate instances status multi-gpu

2. Verify GPU topology

runcrate ssh multi-gpu -- "nvidia-smi"
runcrate ssh multi-gpu -- "python -c 'import torch; print(f\"GPUs: {torch.cuda.device_count()}\")'"

3. PyTorch DDP training

Upload and run your training script with torchrun:
runcrate cp ./train_ddp.py multi-gpu:/workspace/

runcrate ssh multi-gpu -- "cd /workspace && torchrun \
  --nproc_per_node=4 \
  --master_port=29500 \
  train_ddp.py \
  --batch-size 32 \
  --epochs 10 \
  --lr 1e-4"

4. DeepSpeed ZeRO (for larger models)

runcrate ssh multi-gpu -- "pip install deepspeed"

runcrate cp ./ds_config.json multi-gpu:/workspace/
runcrate cp ./train_deepspeed.py multi-gpu:/workspace/

runcrate ssh multi-gpu -- "cd /workspace && deepspeed \
  --num_gpus=4 \
  train_deepspeed.py \
  --deepspeed_config ds_config.json"

5. Monitor training

runcrate ssh multi-gpu -- nvidia-smi
runcrate ssh multi-gpu -- "nvidia-smi topo -m"
runcrate ssh multi-gpu -- "tail -30 /workspace/output/training.log"

6. Download results

runcrate cp multi-gpu:/workspace/output/ ./training-output/

Tips

  • DDP scales linearly with GPU count for data-parallel workloads. 4 GPUs = ~3.8x throughput.
  • DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across GPUs — use it when the model does not fit on a single GPU.
  • Use --gradient_accumulation_steps to simulate larger batch sizes without increasing per-GPU memory.

Cleanup

runcrate instances delete multi-gpu