What does serverless mean here?

You do not provision, manage, or pay for GPU instances. You send API requests and pay per unit of work (tokens, images, seconds). The infrastructure is fully managed.

Is there a latency penalty vs. self-hosted?

For most workloads, no. Models run on dedicated H100 GPUs with optimized serving stacks. Latency is comparable to or better than typical self-hosted setups.

Can I use this for high-throughput production?

Yes. The platform handles concurrent requests and scales automatically. Contact us for dedicated capacity if you need guaranteed throughput beyond standard limits.

runcrate

Contact Sales Console

SERVERLESS INFERENCE

GPU inference, zero infrastructure.

Stop managing GPU instances, CUDA versions, model loading, and autoscaling. Runcrate's serverless inference runs 200+ AI models on dedicated hardware with per-token billing. No cold starts, no idle costs, no infrastructure to maintain. Send requests, get results, pay for what you use.

200+

Models

None

Cold starts

Chat, image, video, audio

Modalities

Get API Key GPU Pricing

AVAILABLE GPUS

GPUs you can deploy today.

Model	Provider	Price	Detail
deepseek-ai/DeepSeek-V3	DeepSeek	Per-token	128K context, MoE architecture
meta-llama/Llama-4-Scout-17B-16E-Instruct	Meta	Per-token	17B MoE, 128K context
black-forest-labs/FLUX.1-dev	Black Forest Labs	Per-image	12B, photorealistic images
openai/whisper-large-v3	OpenAI	Per-minute	Speech-to-text, 100+ languages

deepseek-ai/DeepSeek-V3

DeepSeekPer-token

128K context, MoE architecture

meta-llama/Llama-4-Scout-17B-16E-Instruct

MetaPer-token

17B MoE, 128K context

black-forest-labs/FLUX.1-dev

Black Forest LabsPer-image

12B, photorealistic images

openai/whisper-large-v3

OpenAIPer-minute

Speech-to-text, 100+ languages

WHY RUNCRATE

Built for production.

Zero Infrastructure

No GPU provisioning, no Docker containers, no autoscaling policies, no CUDA debugging. Send a request, get a result. Runcrate handles everything else.

No Cold Starts

Models are always warm and ready. First request is as fast as the thousandth. No waiting for model loading, container spin-up, or weight downloads.

Per-Usage Billing

Pay per token, per image, per second of audio, or per second of video. No idle GPU costs, no monthly minimums, no seat licenses. Credits never expire.

Multi-Modal

Chat, image generation, video generation, speech-to-text, text-to-speech, embeddings, and vision, all through one API and one billing account.

COMPARISON

Runcrate vs Self-Hosted GPU.

Feature	Runcrate	Self-Hosted GPU
Setup time	< 60 seconds	Hours to days
Cold starts	None	Model loading time
Scaling	Automatic	Manual autoscaling
Idle cost	$0	Full GPU cost 24/7
Maintenance	Zero	CUDA, drivers, monitoring

Setup time

Runcrate: < 60 seconds

Self-Hosted GPU: Hours to days

Cold starts

Runcrate: None

Self-Hosted GPU: Model loading time

Scaling

Runcrate: Automatic

Self-Hosted GPU: Manual autoscaling

Idle cost

Runcrate: $0

Self-Hosted GPU: Full GPU cost 24/7

Maintenance

Runcrate: Zero

Self-Hosted GPU: CUDA, drivers, monitoring

GET STARTED

Try it now.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

# Chat completion
chat = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello, world!"}],
)

# Image generation
image = client.images.generate(
    model="black-forest-labs/FLUX.1-dev",
    prompt="A futuristic cityscape",
)

# Speech-to-text
transcript = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.mp3", "rb"),
)

FAQ

Common questions.

Start with serverless inference.

Get API Key View Pricing