Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Serve any open-source LLM behind an OpenAI-compatible endpoint on your own GPU. This guide uses vLLM, the production standard for LLM serving — the same engine Stripe uses to process 50M+ daily API calls.

What you’ll build

A self-hosted inference API that serves Llama 3.1 70B (or any model) on an A100/H100, accessible from anywhere via a public IP. You can point your existing OpenAI SDK code at it.

Why vLLM

vLLM uses PagedAttention to manage GPU memory efficiently — on an 80GB H100 running a 7B FP16 model, this means serving 100+ concurrent requests instead of ~30. The V1 engine (default since v0.6.0) added disaggregated prefill/decode, preventing long prompts from blocking in-flight requests.

Option A: CLI

1. Deploy the instance

runcrate instances create --name llm-server --gpu A100
Wait for it to deploy:
runcrate instances status llm-server

2. Install vLLM and start the server

runcrate ssh llm-server -- "pip install vllm"

runcrate ssh llm-server -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

3. Test it

# Get the instance IP
runcrate instances info llm-server

# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 256
  }'

4. Point your app at it

from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Explain vLLM in one sentence."}],
)
print(response.choices[0].message.content)

Option B: Python SDK

from runcrate import Runcrate
import time

client = Runcrate(api_key="rc_live_...")

# Deploy an A100
instance = client.instances.create(
    name="llm-server",
    gpu_type="A100",
    gpu_count=1,
    startup_commands=[
        "pip install vllm",
        "nohup python -m vllm.entrypoints.openai.api_server "
        "--model meta-llama/Llama-3.1-70B-Instruct "
        "--tensor-parallel-size 1 "
        "--max-model-len 8192 "
        "--port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &",
    ],
)

# Wait for deployment
while True:
    status = client.instances.get_status(instance.id)
    if status.status == "deployed":
        print(f"Server ready at http://{status.ip}:8000")
        break
    time.sleep(10)

Option C: MCP (via Claude Code / Cursor)

“Deploy an A100 instance called llm-server. Once it’s ready, install vLLM and start serving Llama 3.1 70B on port 8000. Give me the IP when it’s up.”
Your AI assistant will:
  1. Call create_instance with name: "llm-server" and gpu: "A100"
  2. Poll instance_status until deployed
  3. Call ssh_execute to install vLLM and start the server
  4. Return the IP from get_instance

Multi-GPU serving

For larger models (70B+ at FP16, or 405B with quantization), use tensor parallelism across multiple GPUs:
runcrate instances create --name llm-server-4gpu --gpu H100 --gpu-count 4

runcrate ssh llm-server-4gpu -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &"

Monitoring

# Check GPU memory and utilization
runcrate ssh llm-server -- nvidia-smi

# Check vLLM logs
runcrate ssh llm-server -- "tail -50 /root/vllm.log"

# Check active request count
runcrate ssh llm-server -- "curl -s localhost:8000/metrics | grep vllm_num_requests"

Cleanup

runcrate instances delete llm-server