Benchmark GPU Performance

Compare GPU performance before committing to a configuration. Deploy instances with different GPUs, run standardized benchmarks, and choose the right GPU for your workload.

1. Deploy benchmark instances

runcrate instances create --name bench-4090 --gpu RTX4090
runcrate instances create --name bench-a100 --gpu A100
runcrate instances create --name bench-h100 --gpu H100

runcrate instances status bench-4090
runcrate instances status bench-a100
runcrate instances status bench-h100

2. Install benchmark tools

runcrate ssh bench-4090 -- "pip install torch vllm"
runcrate ssh bench-a100 -- "pip install torch vllm"
runcrate ssh bench-h100 -- "pip install torch vllm"

3. FP16 matrix multiplication benchmark

Test raw compute throughput — run on each instance to compare:

runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 8192
a = torch.randn(size, size, dtype=torch.float16, device='cuda')
b = torch.randn(size, size, dtype=torch.float16, device='cuda')
for _ in range(10): torch.mm(a, b)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.time() - start
tflops = (2 * size**3 * 100) / elapsed / 1e12
print(f'FP16 matmul: {tflops:.1f} TFLOPS ({elapsed:.2f}s for 100 iters)')
\""

4. Memory bandwidth test

runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 256 * 1024 * 1024
a = torch.randn(size, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): b.copy_(a)
torch.cuda.synchronize()
elapsed = time.time() - start
bw = (2 * size * 2 * 100) / elapsed / 1e9
print(f'Memory bandwidth: {bw:.0f} GB/s')
\""

5. Check GPU specs

runcrate ssh bench-4090 -- "python -c \"
import torch
p = torch.cuda.get_device_properties(0)
print(f'GPU: {p.name}, VRAM: {p.total_mem / 1e9:.1f} GB, SMs: {p.multi_processor_count}')
\""

Expected results

Benchmark	RTX 4090	A100 80 GB	H100 80 GB
FP16 matmul (TFLOPS)	~165	~312	~990
Memory bandwidth (GB/s)	~1,000	~2,000	~3,350
Llama 8B tok/s (batch=1)	~90	~130	~180

Tips

Run benchmarks 3 times and average — GPU boost clocks vary between runs.
FP16 matmul tests compute-bound workloads (training). Memory bandwidth tests memory-bound workloads (inference).
The RTX 4090 offers the best price-to-performance for inference. The H100 is best for training throughput.

Cleanup

runcrate instances delete bench-4090
runcrate instances delete bench-a100
runcrate instances delete bench-h100

​1. Deploy benchmark instances

​2. Install benchmark tools

​3. FP16 matrix multiplication benchmark

​4. Memory bandwidth test

​5. Check GPU specs

​Expected results

​Tips

​Cleanup