Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Compare GPU performance before committing to a configuration. Deploy instances with different GPUs, run standardized benchmarks, and choose the right GPU for your workload.
1. Deploy benchmark instances
runcrate instances create --name bench-4090 --gpu RTX4090
runcrate instances create --name bench-a100 --gpu A100
runcrate instances create --name bench-h100 --gpu H100
runcrate instances status bench-4090
runcrate instances status bench-a100
runcrate instances status bench-h100
runcrate ssh bench-4090 -- "pip install torch vllm"
runcrate ssh bench-a100 -- "pip install torch vllm"
runcrate ssh bench-h100 -- "pip install torch vllm"
3. FP16 matrix multiplication benchmark
Test raw compute throughput — run on each instance to compare:
runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 8192
a = torch.randn(size, size, dtype=torch.float16, device='cuda')
b = torch.randn(size, size, dtype=torch.float16, device='cuda')
for _ in range(10): torch.mm(a, b)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.time() - start
tflops = (2 * size**3 * 100) / elapsed / 1e12
print(f'FP16 matmul: {tflops:.1f} TFLOPS ({elapsed:.2f}s for 100 iters)')
\""
4. Memory bandwidth test
runcrate ssh bench-4090 -- "python -c \"
import torch, time
size = 256 * 1024 * 1024
a = torch.randn(size, dtype=torch.float16, device='cuda')
b = torch.empty_like(a)
torch.cuda.synchronize()
start = time.time()
for _ in range(100): b.copy_(a)
torch.cuda.synchronize()
elapsed = time.time() - start
bw = (2 * size * 2 * 100) / elapsed / 1e9
print(f'Memory bandwidth: {bw:.0f} GB/s')
\""
5. Check GPU specs
runcrate ssh bench-4090 -- "python -c \"
import torch
p = torch.cuda.get_device_properties(0)
print(f'GPU: {p.name}, VRAM: {p.total_mem / 1e9:.1f} GB, SMs: {p.multi_processor_count}')
\""
Expected results
| Benchmark | RTX 4090 | A100 80 GB | H100 80 GB |
|---|
| FP16 matmul (TFLOPS) | ~165 | ~312 | ~990 |
| Memory bandwidth (GB/s) | ~1,000 | ~2,000 | ~3,350 |
| Llama 8B tok/s (batch=1) | ~90 | ~130 | ~180 |
Tips
- Run benchmarks 3 times and average — GPU boost clocks vary between runs.
- FP16 matmul tests compute-bound workloads (training). Memory bandwidth tests memory-bound workloads (inference).
- The RTX 4090 offers the best price-to-performance for inference. The H100 is best for training throughput.
Cleanup
runcrate instances delete bench-4090
runcrate instances delete bench-a100
runcrate instances delete bench-h100