> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy a vLLM Inference Server

> Serve open-source LLMs behind an OpenAI-compatible API on a dedicated GPU with vLLM.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Serve any open-source LLM behind an OpenAI-compatible endpoint on your own GPU. This guide uses vLLM, the production standard for LLM serving — the same engine Stripe uses to process 50M+ daily API calls.

## What you'll build

A self-hosted inference API that serves Llama 3.1 70B (or any model) on an A100/H100, accessible from anywhere via a public IP. You can point your existing OpenAI SDK code at it.

## Why vLLM

vLLM uses PagedAttention to manage GPU memory efficiently — on an 80GB H100 running a 7B FP16 model, this means serving 100+ concurrent requests instead of \~30. The V1 engine (default since v0.6.0) added disaggregated prefill/decode, preventing long prompts from blocking in-flight requests.

***

## Option A: CLI

### 1. Deploy the instance

```bash theme={"theme":"github-dark"}
runcrate instances create --name llm-server --gpu A100
```

Wait for it to deploy:

```bash theme={"theme":"github-dark"}
runcrate instances status llm-server
```

### 2. Install vLLM and start the server

```bash theme={"theme":"github-dark"}
runcrate ssh llm-server -- "pip install vllm"

runcrate ssh llm-server -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

### 3. Test it

```bash theme={"theme":"github-dark"}
# Get the instance IP
runcrate instances info llm-server

# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 256
  }'
```

### 4. Point your app at it

```python theme={"theme":"github-dark"}
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Explain vLLM in one sentence."}],
)
print(response.choices[0].message.content)
```

***

## Option B: Python SDK

```python theme={"theme":"github-dark"}
from runcrate import Runcrate
import time

client = Runcrate(api_key="rc_live_...")

# Deploy an A100
instance = client.instances.create(
    name="llm-server",
    gpu_type="A100",
    gpu_count=1,
    startup_commands=[
        "pip install vllm",
        "nohup python -m vllm.entrypoints.openai.api_server "
        "--model meta-llama/Llama-3.1-70B-Instruct "
        "--tensor-parallel-size 1 "
        "--max-model-len 8192 "
        "--port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &",
    ],
)

# Wait for deployment
while True:
    status = client.instances.get_status(instance.id)
    if status.status == "deployed":
        print(f"Server ready at http://{status.ip}:8000")
        break
    time.sleep(10)
```

***

## Option C: MCP (via Claude Code / Cursor)

> "Deploy an A100 instance called llm-server. Once it's ready, install vLLM and start serving Llama 3.1 70B on port 8000. Give me the IP when it's up."

Your AI assistant will:

1. Call `create_instance` with `name: "llm-server"` and `gpu: "A100"`
2. Poll `instance_status` until deployed
3. Call `ssh_execute` to install vLLM and start the server
4. Return the IP from `get_instance`

***

## Multi-GPU serving

For larger models (70B+ at FP16, or 405B with quantization), use tensor parallelism across multiple GPUs:

```bash theme={"theme":"github-dark"}
runcrate instances create --name llm-server-4gpu --gpu H100 --gpu-count 4

runcrate ssh llm-server-4gpu -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &"
```

## Monitoring

```bash theme={"theme":"github-dark"}
# Check GPU memory and utilization
runcrate ssh llm-server -- nvidia-smi

# Check vLLM logs
runcrate ssh llm-server -- "tail -50 /root/vllm.log"

# Check active request count
runcrate ssh llm-server -- "curl -s localhost:8000/metrics | grep vllm_num_requests"
```

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete llm-server
```
