> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Multi-GPU Distributed Training

> Set up PyTorch DDP or DeepSpeed on a multi-GPU instance for distributed model training.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Train large models faster by distributing work across multiple GPUs on a single instance.

## 1. Deploy a multi-GPU instance

```bash theme={"theme":"github-dark"}
runcrate instances create --name multi-gpu --gpu A100 --gpu-count 4 --template ubuntu-train
runcrate instances status multi-gpu
```

## 2. Verify GPU topology

```bash theme={"theme":"github-dark"}
runcrate ssh multi-gpu -- "nvidia-smi"
runcrate ssh multi-gpu -- "python -c 'import torch; print(f\"GPUs: {torch.cuda.device_count()}\")'"
```

## 3. PyTorch DDP training

Upload and run your training script with `torchrun`:

```bash theme={"theme":"github-dark"}
runcrate cp ./train_ddp.py multi-gpu:/workspace/

runcrate ssh multi-gpu -- "cd /workspace && torchrun \
  --nproc_per_node=4 \
  --master_port=29500 \
  train_ddp.py \
  --batch-size 32 \
  --epochs 10 \
  --lr 1e-4"
```

## 4. DeepSpeed ZeRO (for larger models)

```bash theme={"theme":"github-dark"}
runcrate ssh multi-gpu -- "pip install deepspeed"

runcrate cp ./ds_config.json multi-gpu:/workspace/
runcrate cp ./train_deepspeed.py multi-gpu:/workspace/

runcrate ssh multi-gpu -- "cd /workspace && deepspeed \
  --num_gpus=4 \
  train_deepspeed.py \
  --deepspeed_config ds_config.json"
```

## 5. Monitor training

```bash theme={"theme":"github-dark"}
runcrate ssh multi-gpu -- nvidia-smi
runcrate ssh multi-gpu -- "nvidia-smi topo -m"
runcrate ssh multi-gpu -- "tail -30 /workspace/output/training.log"
```

## 6. Download results

```bash theme={"theme":"github-dark"}
runcrate cp multi-gpu:/workspace/output/ ./training-output/
```

## Tips

* DDP scales linearly with GPU count for data-parallel workloads. 4 GPUs = \~3.8x throughput.
* DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across GPUs — use it when the model does not fit on a single GPU.
* Use `--gradient_accumulation_steps` to simulate larger batch sizes without increasing per-GPU memory.

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete multi-gpu
```
