Solutions

·

Inference

Run any model
in production.

Access 200+ models through a single OpenAI-compatible API -- Claude 4 Sonnet, DeepSeek-V3.2, Llama 4 Scout, Gemini 2.5 Flash, and more. Or self-host open models on dedicated GPUs with vLLM or TGI for full control over latency and throughput.

200+
Models available
<100ms
P50 latency
Per-token
Billing

Two Approaches

API or self-hosted. Your call.

Inference API

One OpenAI-compatible endpoint for 200+ models. Switch between Claude 4, Gemini 2.5, DeepSeek-V3.2, Llama 4, Qwen3, and more with a single parameter change.

Self-hosted with vLLM / TGI

Deploy open models on dedicated H100s or H200s. Full control over batching, quantization, and KV cache configuration for maximum throughput.

Auto-scaling

Scale from zero to hundreds of replicas based on traffic. No manual intervention -- handles traffic spikes automatically, scales down when idle.

Per-token billing

API usage billed per input/output token. Dedicated instances billed per minute. No idle costs on API, no minimums on instances.

Latency optimization

Speculative decoding, continuous batching, and tensor parallelism on dedicated instances. API models served on optimized infrastructure for sub-100ms P50.

Zero egress fees

No charges for data transfer. Stream responses to your users without surprise bandwidth costs eating into your margins.

Available Models

Every frontier model.
One endpoint.

Text, code, image, and video generation models available via API. New models added within days of release.

Claude 4 SonnetAnthropicReasoning, coding, analysis
DeepSeek-V3.2DeepSeekCode generation, math, reasoning
Llama 4 ScoutMetaGeneral purpose, multilingual
Gemini 2.5 FlashGoogleFast inference, multimodal
Qwen3AlibabaMultilingual, long context
FLUX.2 / Sora 2 / Veo 3.0Image & videoVisual generation

How It Works

Three steps to production inference.

01

Pick your approach

Use the Inference API for instant access to 200+ models, or deploy a dedicated instance with vLLM/TGI for self-hosted control.

02

Integrate in minutes

Our API is OpenAI-compatible -- swap one base URL and start calling. For dedicated instances, SSH in or use the browser IDE to configure your serving stack.

03

Scale with traffic

API auto-scales automatically. Dedicated instances can be cloned and load-balanced. Monitor latency, tokens, and costs from the dashboard.

Start running models on Runcrate.

Get an API key and call your first model in under a minute. 200+ models, per-token pricing, no credit card required to start.

Per-token billing
Pay only for what you generate
OpenAI-compatible
Swap one URL, keep your code
Cancel anytime
No lock-in, no penalties