Do I need to change my code?

If you use Ollama's OpenAI-compatible mode, change the base_url to https://api.runcrate.ai/v1 and add your API key. That's it.

Are the model weights the same?

We run the original model weights at full precision. No quantization unless the model was released that way.

Is this faster than local Ollama?

For most users, yes. Models run on H100 GPUs with optimized serving stacks. Unless you have a high-end local GPU, cloud inference will be significantly faster.

Can I use this for development and production?

Yes. Many teams prototype locally with Ollama and deploy to Runcrate's API for production. Same models, same API shape, just more capacity.

runcrate

Contact Sales Console

OLLAMA IN THE CLOUD

Your Ollama models, cloud-powered.

Love Ollama but limited by your local GPU? Run the same models in the cloud through Runcrate's API. Llama, Qwen, DeepSeek, Gemma, Mistral, and more, all accessible via an OpenAI-compatible endpoint. No VRAM constraints, no thermal throttling, no model downloads. Just swap the endpoint.

200+

Models

OpenAI-compatible

API format

< 60s

Setup time

Get API Key View Pricing

QUICK START

Integrate in minutes.

from openai import OpenAI

# Same code you'd use with Ollama's OpenAI-compatible mode
client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "user", "content": "Summarize the benefits of cloud inference."}
    ],
)
print(response.choices[0].message.content)

AVAILABLE MODELS

Models you can use today.

Model	Provider	Price	Detail
meta-llama/Llama-4-Scout-17B-16E-Instruct	Meta	Per-token	17B MoE, 128K context
Qwen/Qwen3-32B	Alibaba	Per-token	32B, strong multilingual
deepseek-ai/DeepSeek-V3	DeepSeek	Per-token	128K context, MoE
google/gemma-3-27b-it	Google	Per-token	27B, instruction-tuned
mistralai/Mistral-Small-24B-Instruct-2501	Mistral	Per-token	24B, fast inference

meta-llama/Llama-4-Scout-17B-16E-Instruct

MetaPer-token

17B MoE, 128K context

Qwen/Qwen3-32B

AlibabaPer-token

32B, strong multilingual

deepseek-ai/DeepSeek-V3

DeepSeekPer-token

128K context, MoE

google/gemma-3-27b-it

GooglePer-token

27B, instruction-tuned

mistralai/Mistral-Small-24B-Instruct-2501

MistralPer-token

24B, fast inference

WHY RUNCRATE

Built for production.

No Local GPU Needed

Stop waiting for model downloads and fighting with CUDA versions. Your models run on H100s in the cloud, accessible from anywhere.

Same Models, Bigger Scale

The models you know from Ollama, running on enterprise-grade hardware. No VRAM limits, no thermal throttling, no 4-bit quantization compromises.

OpenAI-Compatible API

If your code works with Ollama's OpenAI compatibility mode, it works with Runcrate. Change the base URL and API key, done.

Pay Per Token

No idle GPU costs. You pay for the tokens you generate, not for a machine sitting warm. Ideal for bursty or unpredictable workloads.

COMPARISON

Runcrate vs Local Ollama.

Feature	Runcrate	Local Ollama
GPU hardware	H100 / H200 cloud	Your local GPU
Model size limit	No VRAM limit	Limited by your GPU
Setup	API key + one line	Download + install
Cost model	Per-token, no idle cost	Electricity + GPU purchase
Available models	200+ via API	Community-maintained

GPU hardware

Runcrate: H100 / H200 cloud

Local Ollama: Your local GPU

Model size limit

Runcrate: No VRAM limit

Local Ollama: Limited by your GPU

Setup

Runcrate: API key + one line

Local Ollama: Download + install

Cost model

Runcrate: Per-token, no idle cost

Local Ollama: Electricity + GPU purchase

Available models

Runcrate: 200+ via API

Local Ollama: Community-maintained

FAQ

Common questions.

Run Ollama models in the cloud.

Get API Key View Pricing