Runcrate Inference Engine

Inference engineered
for when latency, cost, and control all matter.

Dedicated capacity on the open-source frontier — Llama, DeepSeek, Kimi K2, Qwen, GLM. Served on Arc for 2–3× more tokens per GPU and 40–60% off the aggregator bill.

Powered by Arc Inference Engine

2–3×
Tokens per GPU
40–60%
Off the aggregator bill
200+
Open-source models
99.9%
Uptime SLA

WHY DEDICATED INFERENCE

Stop renting time on someone else's queue.

2–3×

Throughput

More tokens per GPU on Arc — our serving stack — than naive vLLM. You see a smaller bill; we see fewer GPUs doing your job.

99.9%

Reliability

Dedicated capacity sized to your traffic profile. No noisy neighbors, no 429s on burst, p99 latency floors in the contract.

40–60%

Cost

Off the typical aggregator rate card on committed-spend pricing. No PTU math, no burndown multipliers, no TPM calculator.

WHAT YOU GET

Inference that behaves like infrastructure.

Dedicated capacity

Reserved GPUs sized to your workload. No shared queues, no noisy neighbors, no throttling at peak.

Arc-optimized throughput

Our own serving stack. Speculative decoding, continuous batching, KV-cache reuse — 2–3× more tokens per GPU-second than vanilla vLLM.

Hands-on engineering

Our engineers tune your deployment for your latency, throughput, and cost targets. Not a shared support queue.

OpenAI-compatible API

Swap one base URL. Existing SDK code, prompts, and streaming logic keep working. Migrations are a one-day project.

Committed-spend pricing

Monthly minimum, discounted rate, overage at the same rate. No PTU math. No TPM calculator.

SLA and region pinning

99.9% uptime, p99 latency floors, US / EU / APAC pinning. SOC 2 Type II partners, HIPAA on request.

UNDER THE HOOD

Arc Inference Engine. More tokens per GPU.

Most inference providers run vanilla vLLM or SGLang and call it a day. Arc is our own serving stack — continuous batching tuned to production traffic, aggressive KV-cache reuse, speculative decoding, and quantization that doesn't sacrifice quality.

The result is 2–3× more tokens per GPU-second on the same hardware. You see it as lower per-token cost. We see it as fitting more of your workload onto fewer GPUs.

Get throughput numbers for your workload

Aggregate tok/s · 8× H100

Naive vLLM1,200 – 1,800
Optimized open-source2,500 – 3,500
Arc Inference Engine3,000 – 5,000

Measured on Kimi K2 (1T params, sparse MoE) at FP8 with realistic chat traffic. Per-workload numbers available on request.

VS THE AGGREGATORS

Built different from the Fireworks / Together / DeepInfra shape.

Capacity
Typical: Shared serverless pool
Runcrate: Dedicated, sized to your workload
Throughput
Typical: Naive vLLM / SGLang
Runcrate: Arc — 2–3× tok/s/GPU
Latency variance
Typical: Noisy-neighbor jitter
Runcrate: p99 floor in contract
Pricing model
Typical: Per-token list rates
Runcrate: Committed-spend at 40–60% off
Rate limits
Typical: TPM ceiling, 429s on burst
Runcrate: Burst-tolerant, no hard caps
Region pinning
Typical: Often unavailable on OSS
Runcrate: US / EU / APAC available
Custom / fine-tuned models
Typical: Limited or pay-extra
Runcrate: Bring your model, we serve it
Account team
Typical: Shared support queue
Runcrate: Named CSM + on-call eng

ON-DEMAND MODELS

The open-source frontier, deployed and tuned.

meta

Meta

Llama 3.3 70B

Coding · autocomplete · chat

deepseek

DeepSeek

DeepSeek R1

671B MoE · reasoning · open weights

moonshot

Moonshot

Kimi K2

1T params · MoE · 256K context

qwen

Alibaba

Qwen3 235B

Tool use · long context

glm

Zhipu

GLM 4.5

Multilingual frontier

openai

OpenAI

gpt-oss 120B

Open weights · OpenAI

Bring a fine-tune of any open-source base. We serve it on Arc with the same throughput multiplier.

See the full catalog

TALK TO AN ENGINEER

Size your deployment.

Tell us your model, your spend, and your traffic profile. We respond within 24 hours with a rate card and a 7-day pilot plan running parallel to your existing provider.

Response within 24 hours
7-day pilot, no commitment
Custom rate card for your workload

Prefer Slack?

Create a Slack Connect channel

FAQ

The questions every buyer asks.

Dedicated inference, sized to your workload.