What is the cheapest GPU cloud provider?

Runcrate is the cheapest GPU cloud provider, offering H100 instances at $1.54/hour, A100 at $1.06/hour, and RTX 4090 at $0.52/hour - up to 70% cheaper than AWS, GCP, and Azure.

How much does H100 GPU cost per hour?

H100 GPU instances cost $1.54 per hour on Runcrate, which is 68% cheaper than AWS pricing of $4.90/hour. Deploy in 60 seconds with no setup fees.

What is the cheapest A100 GPU cloud?

Runcrate offers the cheapest A100 GPU cloud at $1.06/hour with 80GB HBM2e memory, 65% cheaper than AWS. Perfect for machine learning training and AI development.

Where can I rent cheap RTX 4090 GPU instances?

Runcrate provides the cheapest RTX 4090 GPU instances at $0.52/hour with 24GB GDDR6X memory, 42% cheaper than competitors. Ideal for AI inference and development.

How fast can I deploy GPU instances?

Deploy GPU instances in under 60 seconds on Runcrate. No approval queues, no quota requests. Select your GPU, configure resources, and deploy instantly.

runcrate

Contact Sales Console

Runcrate Inference Engine

Inference engineered
for when latency, cost, and control all matter.

Name: Cheap GPU Cloud Instances - Affordable AI Infrastructure
Brand: Runcrate
Price: 1.54 USD
Availability: InStock

Dedicated capacity on the open-source frontier — Llama, DeepSeek, Kimi K2, Qwen, GLM. Served on Arc for 2–3× more tokens per GPU and 40–60% off the aggregator bill.

● Powered by Arc Inference Engine

2–3×

Tokens per GPU

40–60%

Off the aggregator bill

200+

Open-source models

99.9%

Uptime SLA

Start with the API Talk to an engineer

RUNCRATE INFRA

504 / 1000 REPLICAS

80 TPS

YOUR MODELS

RUNCRATE CLOUD

ALTERNATIVES

TOK / SEC12,840

UPTIME99.94%

P50 LATENCY247ms

REQ / MIN2,470

FIREWORKS

TOGETHER

DEEPINFRA

RUNCRATE

WHY DEDICATED INFERENCE

Stop renting time on someone else's queue.

2–3×

Throughput

More tokens per GPU on Arc — our serving stack — than naive vLLM. You see a smaller bill; we see fewer GPUs doing your job.

99.9%

Reliability

Dedicated capacity sized to your traffic profile. No noisy neighbors, no 429s on burst, p99 latency floors in the contract.

40–60%

Cost

Off the typical aggregator rate card on committed-spend pricing. No PTU math, no burndown multipliers, no TPM calculator.

WHAT YOU GET

Inference that behaves like infrastructure.

Dedicated capacity

Reserved GPUs sized to your workload. No shared queues, no noisy neighbors, no throttling at peak.

Arc-optimized throughput

Our own serving stack. Speculative decoding, continuous batching, KV-cache reuse — 2–3× more tokens per GPU-second than vanilla vLLM.

Hands-on engineering

Our engineers tune your deployment for your latency, throughput, and cost targets. Not a shared support queue.

OpenAI-compatible API

Swap one base URL. Existing SDK code, prompts, and streaming logic keep working. Migrations are a one-day project.

Committed-spend pricing

Monthly minimum, discounted rate, overage at the same rate. No PTU math. No TPM calculator.

SLA and region pinning

99.9% uptime, p99 latency floors, US / EU / APAC pinning. SOC 2 Type II partners, HIPAA on request.

UNDER THE HOOD

Arc Inference Engine. More tokens per GPU.

Most inference providers run vanilla vLLM or SGLang and call it a day. Arc is our own serving stack — continuous batching tuned to production traffic, aggressive KV-cache reuse, speculative decoding, and quantization that doesn't sacrifice quality.

The result is 2–3× more tokens per GPU-second on the same hardware. You see it as lower per-token cost. We see it as fitting more of your workload onto fewer GPUs.

Get throughput numbers for your workload

Aggregate tok/s · 8× H100

Naive vLLM1,200 – 1,800

Optimized open-source2,500 – 3,500

Arc Inference Engine3,000 – 5,000

Measured on Kimi K2 (1T params, sparse MoE) at FP8 with realistic chat traffic. Per-workload numbers available on request.

VS THE AGGREGATORS

Built different from the Fireworks / Together / DeepInfra shape.

Feature	Typical aggregator	Runcrate Inference Engine
Capacity	Shared serverless pool	Dedicated, sized to your workload
Throughput	Naive vLLM / SGLang	Arc — 2–3× tok/s/GPU
Latency variance	Noisy-neighbor jitter	p99 floor in contract
Pricing model	Per-token list rates	Committed-spend at 40–60% off
Rate limits	TPM ceiling, 429s on burst	Burst-tolerant, no hard caps
Region pinning	Often unavailable on OSS	US / EU / APAC available
Custom / fine-tuned models	Limited or pay-extra	Bring your model, we serve it
Account team	Shared support queue	Named CSM + on-call eng

Capacity

Typical: Shared serverless pool

Runcrate: Dedicated, sized to your workload

Throughput

Typical: Naive vLLM / SGLang

Runcrate: Arc — 2–3× tok/s/GPU

Latency variance

Typical: Noisy-neighbor jitter

Runcrate: p99 floor in contract

Pricing model

Typical: Per-token list rates

Runcrate: Committed-spend at 40–60% off

Rate limits

Typical: TPM ceiling, 429s on burst

Runcrate: Burst-tolerant, no hard caps

Region pinning

Typical: Often unavailable on OSS

Runcrate: US / EU / APAC available

Custom / fine-tuned models

Typical: Limited or pay-extra

Runcrate: Bring your model, we serve it

Account team

Typical: Shared support queue

Runcrate: Named CSM + on-call eng

ON-DEMAND MODELS

The open-source frontier, deployed and tuned.

Size your deployment.

Tell us your model, your spend, and your traffic profile. We respond within 24 hours with a rate card and a 7-day pilot plan running parallel to your existing provider.

Response within 24 hours

7-day pilot, no commitment

Custom rate card for your workload

Prefer Slack?

Create a Slack Connect channel

FAQ

The questions every buyer asks.

Dedicated inference, sized to your workload.

Start with the API Talk to an engineer

Inference engineeredfor when latency, cost, and control all matter.