Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Runcrate
Inference first. Compute when you need it. Runcrate is built around two products on a single account, a single bill, and a single control plane. Most teams start with the Inference Engine — one API for 140+ open-source models. When they outgrow shared inference, the Compute side gives them on-demand GPU instances and dedicated clusters without re-onboarding.Inference Engine
OpenAI-compatible API for 140+ open-source models. Chat, image, video, TTS, ASR — billed per token or per generation.
Compute
On-demand GPU instances and dedicated clusters. Containers, VMs, or bare-metal — H100, H200, B200, B300.
Inference Engine
Production inference for the open-source model ecosystem. One API, one bill, one place to manage everything.Models API
140+ open-source models behind a single OpenAI-compatible endpoint — Llama, DeepSeek, Qwen, GLM, Kimi, Mistral, and more.
SDKs
Official Python and TypeScript clients. Drop-in replacements for the OpenAI SDK — point the base URL at Runcrate and you’re done.
Per-token billing
Pay only for tokens generated. No idle GPUs, no minimums, no commitments.
Get started with the Inference Engine
Make your first API call in under 60 seconds.
Compute
When you need a specific GPU, full root access, or reserved capacity, the Compute side covers it — on the same account as inference.GPU Instances
Containers, VMs, or bare-metal with dedicated NVIDIA GPUs. Root SSH access, hourly billing, deploy in 60 seconds.
Storage
Persistent volumes with a built-in file explorer. Data survives instance termination.
Dedicated Clusters
Reserved bare-metal clusters from 16 to 128+ nodes. H100, H200, B200, B300 with InfiniBand. 12–24 month terms.
Browse compute options
Pick a GPU, pick a region, deploy.
Which one do you need?
| Inference Engine | Compute | |
|---|---|---|
| Best for | Building AI features on open-source models | Training, fine-tuning, custom inference servers, reserved capacity |
| Billing | Per token / per generation | Per hour (instances) · Monthly (dedicated) |
| Setup time | 60 seconds | 60 seconds (instances) · 1–2 weeks (dedicated) |
| Commitment | None | None (instances) · 12–24 months (dedicated) |
| Access | Self-serve · API key | Self-serve (instances) · Contact sales (dedicated) |
| GPUs | Managed for you | H100, H200, B200, B300, A100, L40S, RTX 4090 |