Runcrate Inference Engine
Dedicated capacity on the open-source frontier — Llama, DeepSeek, Kimi K2, Qwen, GLM. Served on Arc for 2–3× more tokens per GPU and 40–60% off the aggregator bill.
● Powered by Arc Inference Engine
WHY DEDICATED INFERENCE
Throughput
More tokens per GPU on Arc — our serving stack — than naive vLLM. You see a smaller bill; we see fewer GPUs doing your job.
Reliability
Dedicated capacity sized to your traffic profile. No noisy neighbors, no 429s on burst, p99 latency floors in the contract.
Cost
Off the typical aggregator rate card on committed-spend pricing. No PTU math, no burndown multipliers, no TPM calculator.
WHAT YOU GET
Reserved GPUs sized to your workload. No shared queues, no noisy neighbors, no throttling at peak.
Our own serving stack. Speculative decoding, continuous batching, KV-cache reuse — 2–3× more tokens per GPU-second than vanilla vLLM.
Our engineers tune your deployment for your latency, throughput, and cost targets. Not a shared support queue.
Swap one base URL. Existing SDK code, prompts, and streaming logic keep working. Migrations are a one-day project.
Monthly minimum, discounted rate, overage at the same rate. No PTU math. No TPM calculator.
99.9% uptime, p99 latency floors, US / EU / APAC pinning. SOC 2 Type II partners, HIPAA on request.
UNDER THE HOOD
Most inference providers run vanilla vLLM or SGLang and call it a day. Arc is our own serving stack — continuous batching tuned to production traffic, aggressive KV-cache reuse, speculative decoding, and quantization that doesn't sacrifice quality.
The result is 2–3× more tokens per GPU-second on the same hardware. You see it as lower per-token cost. We see it as fitting more of your workload onto fewer GPUs.
Aggregate tok/s · 8× H100
Measured on Kimi K2 (1T params, sparse MoE) at FP8 with realistic chat traffic. Per-workload numbers available on request.
VS THE AGGREGATORS
| Feature | Typical aggregator | Runcrate Inference Engine |
|---|---|---|
| Capacity | Shared serverless pool | Dedicated, sized to your workload |
| Throughput | Naive vLLM / SGLang | Arc — 2–3× tok/s/GPU |
| Latency variance | Noisy-neighbor jitter | p99 floor in contract |
| Pricing model | Per-token list rates | Committed-spend at 40–60% off |
| Rate limits | TPM ceiling, 429s on burst | Burst-tolerant, no hard caps |
| Region pinning | Often unavailable on OSS | US / EU / APAC available |
| Custom / fine-tuned models | Limited or pay-extra | Bring your model, we serve it |
| Account team | Shared support queue | Named CSM + on-call eng |
ON-DEMAND MODELS

Meta
Llama 3.3 70B
Coding · autocomplete · chat

DeepSeek
DeepSeek R1
671B MoE · reasoning · open weights

Moonshot
Kimi K2
1T params · MoE · 256K context

Alibaba
Qwen3 235B
Tool use · long context

Zhipu
GLM 4.5
Multilingual frontier

OpenAI
gpt-oss 120B
Open weights · OpenAI
Bring a fine-tune of any open-source base. We serve it on Arc with the same throughput multiplier.
See the full catalogTALK TO AN ENGINEER
Tell us your model, your spend, and your traffic profile. We respond within 24 hours with a rate card and a 7-day pilot plan running parallel to your existing provider.
Prefer Slack?
Create a Slack Connect channel
FAQ