> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Benchmark GPU Performance

> Deploy different GPUs, run standard AI benchmarks, and compare performance across RTX 4090, A100, and H100.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Compare GPU performance before committing to a configuration. Deploy instances with different GPUs, run standardized benchmarks, and choose the right GPU for your workload.

## 1. Deploy benchmark instances

```bash theme={"theme":"github-dark"}
runcrate instances create --name bench-4090 --gpu RTX4090
runcrate instances create --name bench-a100 --gpu A100
runcrate instances create --name bench-h100 --gpu H100
```

```bash theme={"theme":"github-dark"}
runcrate instances status bench-4090
runcrate instances status bench-a100
runcrate instances status bench-h100
```

## 2. Install benchmark tools

```bash theme={"theme":"github-dark"}
runcrate ssh bench-4090 -- "pip install torch vllm"
runcrate ssh bench-a100 -- "pip install torch vllm"
runcrate ssh bench-h100 -- "pip install torch vllm"
```

## 3. FP16 matrix multiplication benchmark

Test raw compute throughput — run on each instance to compare:

```bash theme={"theme":"github-dark"}
runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 8192
a = torch.randn(size, size, dtype=torch.float16, device='cuda')
b = torch.randn(size, size, dtype=torch.float16, device='cuda')
for _ in range(10): torch.mm(a, b)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.time() - start
tflops = (2 * size**3 * 100) / elapsed / 1e12
print(f'FP16 matmul: {tflops:.1f} TFLOPS ({elapsed:.2f}s for 100 iters)')
\""
```

## 4. Memory bandwidth test

```bash theme={"theme":"github-dark"}
runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 256 * 1024 * 1024
a = torch.randn(size, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): b.copy_(a)
torch.cuda.synchronize()
elapsed = time.time() - start
bw = (2 * size * 2 * 100) / elapsed / 1e9
print(f'Memory bandwidth: {bw:.0f} GB/s')
\""
```

## 5. Check GPU specs

```bash theme={"theme":"github-dark"}
runcrate ssh bench-4090 -- "python -c \"
import torch
p = torch.cuda.get_device_properties(0)
print(f'GPU: {p.name}, VRAM: {p.total_mem / 1e9:.1f} GB, SMs: {p.multi_processor_count}')
\""
```

## Expected results

| Benchmark                | RTX 4090 | A100 80 GB | H100 80 GB |
| ------------------------ | -------- | ---------- | ---------- |
| FP16 matmul (TFLOPS)     | \~165    | \~312      | \~990      |
| Memory bandwidth (GB/s)  | \~1,000  | \~2,000    | \~3,350    |
| Llama 8B tok/s (batch=1) | \~90     | \~130      | \~180      |

## Tips

* Run benchmarks 3 times and average — GPU boost clocks vary between runs.
* FP16 matmul tests compute-bound workloads (training). Memory bandwidth tests memory-bound workloads (inference).
* The RTX 4090 offers the best price-to-performance for inference. The H100 is best for training throughput.

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete bench-4090
runcrate instances delete bench-a100
runcrate instances delete bench-h100
```