Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Runcrate

Inference first. Compute when you need it. Runcrate is built around two products on a single account, a single bill, and a single control plane. Most teams start with the Inference Engine — one API for 140+ open-source models. When they outgrow shared inference, the Compute side gives them on-demand GPU instances and dedicated clusters without re-onboarding.

Inference Engine

OpenAI-compatible API for 140+ open-source models. Chat, image, video, TTS, ASR — billed per token or per generation.

Compute

On-demand GPU instances and dedicated clusters. Containers, VMs, or bare-metal — H100, H200, B200, B300.

Inference Engine

Production inference for the open-source model ecosystem. One API, one bill, one place to manage everything.

Models API

140+ open-source models behind a single OpenAI-compatible endpoint — Llama, DeepSeek, Qwen, GLM, Kimi, Mistral, and more.

SDKs

Official Python and TypeScript clients. Drop-in replacements for the OpenAI SDK — point the base URL at Runcrate and you’re done.

Per-token billing

Pay only for tokens generated. No idle GPUs, no minimums, no commitments.

Get started with the Inference Engine

Make your first API call in under 60 seconds.

Compute

When you need a specific GPU, full root access, or reserved capacity, the Compute side covers it — on the same account as inference.

GPU Instances

Containers, VMs, or bare-metal with dedicated NVIDIA GPUs. Root SSH access, hourly billing, deploy in 60 seconds.

Storage

Persistent volumes with a built-in file explorer. Data survives instance termination.

Dedicated Clusters

Reserved bare-metal clusters from 16 to 128+ nodes. H100, H200, B200, B300 with InfiniBand. 12–24 month terms.

Browse compute options

Pick a GPU, pick a region, deploy.

Which one do you need?

Inference EngineCompute
Best forBuilding AI features on open-source modelsTraining, fine-tuning, custom inference servers, reserved capacity
BillingPer token / per generationPer hour (instances) · Monthly (dedicated)
Setup time60 seconds60 seconds (instances) · 1–2 weeks (dedicated)
CommitmentNoneNone (instances) · 12–24 months (dedicated)
AccessSelf-serve · API keySelf-serve (instances) · Contact sales (dedicated)
GPUsManaged for youH100, H200, B200, B300, A100, L40S, RTX 4090
Not sure which fits? Start with the Inference Engine. Most teams never need anything else.