> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy Llama on a Cloud GPU

> Self-host Llama 4 Scout, Llama 3.1 70B, or Llama 3.1 405B on a dedicated GPU with vLLM. Full setup guide.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Run any Llama model on a dedicated GPU — from Llama 3.1 8B on an RTX 4090 to Llama 3.1 405B across four H100s. Three deployment paths: the Models API (zero infrastructure), vLLM self-hosting (full control), and Ollama (fast prototyping).

## Which Llama model to pick

| Model                         | Parameters                  | GPU              | VRAM needed    | Approx. cost |
| ----------------------------- | --------------------------- | ---------------- | -------------- | ------------ |
| Llama 3.1 8B Instruct         | 8B                          | RTX 4090 (24 GB) | \~16 GB (FP16) | \~\$0.35/hr  |
| Llama 4 Scout                 | 17B active (109B total MoE) | A100 80 GB       | \~70 GB (BF16) | \~\$1.60/hr  |
| Llama 3.1 70B Instruct        | 70B                         | A100 80 GB       | \~70 GB (BF16) | \~\$1.60/hr  |
| Llama 3.1 70B Instruct        | 70B                         | 2x A100 40 GB    | \~35 GB each   | \~\$2.40/hr  |
| Llama 3.1 405B Instruct (FP8) | 405B                        | 4x H100 80 GB    | \~50 GB each   | \~\$10.00/hr |

Rule of thumb: a model needs roughly 2x its parameter count in bytes of VRAM at FP16, or 1x at FP8/INT8. When in doubt, go one tier up — you can always downgrade later.

***

## Option 1: Models API (easiest — no GPU needed)

The fastest path. Hit the Runcrate Models API directly and pay per token. No instance to manage, no vLLM to install, no GPU to provision.

### curl

```bash theme={"theme":"github-dark"}
curl https://api.runcrate.ai/v1/chat/completions \
  -H "Authorization: Bearer rc_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "messages": [
      {"role": "user", "content": "Explain mixture-of-experts in two sentences."}
    ],
    "max_tokens": 256
  }'
```

### Python (OpenAI SDK)

```python theme={"theme":"github-dark"}
from openai import OpenAI

client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "What is Llama 4 Scout?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)
```

### TypeScript (OpenAI SDK)

```typescript theme={"theme":"github-dark"}
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runcrate.ai/v1",
  apiKey: "rc_live_YOUR_API_KEY",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  messages: [{ role: "user", content: "What is Llama 4 Scout?" }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
```

Works with any model in the [catalog](/models/model-catalog) — swap the model string and go.

***

## Option 2: Self-host with vLLM (full control)

Run your own OpenAI-compatible endpoint on a dedicated GPU. You control the model, the context length, the quantization, and the scaling.

### Deploy Llama 3.1 8B (single RTX 4090)

```bash theme={"theme":"github-dark"}
runcrate instances create --name llama-8b --gpu RTX4090
```

Wait for deployment:

```bash theme={"theme":"github-dark"}
runcrate instances status llama-8b
```

Install vLLM and start serving:

```bash theme={"theme":"github-dark"}
runcrate ssh llama-8b -- "pip install vllm"

runcrate ssh llama-8b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

### Deploy Llama 3.1 70B (single A100 80 GB)

```bash theme={"theme":"github-dark"}
runcrate instances create --name llama-70b --gpu A100

runcrate ssh llama-70b -- "pip install vllm"

runcrate ssh llama-70b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

### Deploy Llama 4 Scout (single A100 80 GB)

```bash theme={"theme":"github-dark"}
runcrate instances create --name llama-scout --gpu A100

runcrate ssh llama-scout -- "pip install vllm"

runcrate ssh llama-scout -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

### Deploy Llama 3.1 405B FP8 (4x H100)

The 405B model requires tensor parallelism across multiple GPUs:

```bash theme={"theme":"github-dark"}
runcrate instances create --name llama-405b --gpu H100 --gpu-count 4

runcrate ssh llama-405b -- "pip install vllm"

runcrate ssh llama-405b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"
```

### Test your endpoint

```bash theme={"theme":"github-dark"}
# Get the instance IP
runcrate instances info llama-70b

# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What makes Llama open-weight?"}],
    "max_tokens": 256
  }'
```

### Point your app at it

Once the server is running, point any OpenAI-compatible SDK at your instance:

```python theme={"theme":"github-dark"}
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Summarize the Llama 3.1 release."}],
)
print(response.choices[0].message.content)
```

```typescript theme={"theme":"github-dark"}
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://<INSTANCE_IP>:8000/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-70B-Instruct",
  messages: [{ role: "user", content: "Summarize the Llama 3.1 release." }],
});

console.log(response.choices[0].message.content);
```

### Monitoring

```bash theme={"theme":"github-dark"}
# GPU memory and utilization
runcrate ssh llama-70b -- nvidia-smi

# vLLM logs
runcrate ssh llama-70b -- "tail -50 /root/vllm.log"

# Active requests
runcrate ssh llama-70b -- "curl -s localhost:8000/metrics | grep vllm_num_requests"
```

***

## Option 3: Self-host with Ollama (simpler, quantized)

Ollama runs quantized models with a single command. Good for development and prototyping — not recommended for production throughput.

### Deploy and set up

```bash theme={"theme":"github-dark"}
runcrate instances create --name llama-ollama --gpu RTX4090

runcrate ssh llama-ollama -- "curl -fsSL https://ollama.com/install.sh | sh"
```

### Pull and serve a model

```bash theme={"theme":"github-dark"}
# Pull Llama 3.1 8B (Q4 quantized — fits easily in 24 GB)
runcrate ssh llama-ollama -- "ollama pull llama3.1:8b"

# Start the server on all interfaces
runcrate ssh llama-ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
```

### Test it

```bash theme={"theme":"github-dark"}
runcrate instances info llama-ollama

curl http://<INSTANCE_IP>:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello from Ollama."}],
    "stream": false
  }'
```

Ollama also exposes an OpenAI-compatible endpoint at `/v1/chat/completions`, so you can use the same OpenAI SDK pattern:

```python theme={"theme":"github-dark"}
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is Ollama?"}],
)
print(response.choices[0].message.content)
```

### Limitations

* Quantized models (Q4/Q5) trade quality for memory efficiency. For production accuracy, use vLLM with FP16 or FP8.
* Ollama's serving throughput is lower than vLLM — fine for single-user development, not for concurrent production traffic.
* Larger models (70B Q4) need an A100 80 GB even with quantization.

***

## Benchmarks

Expected throughput for each model/GPU combination with vLLM, batch size 1, 2048-token output:

| Model              | GPU           | Tokens/sec (output) | Time to first token |
| ------------------ | ------------- | ------------------- | ------------------- |
| Llama 3.1 8B       | RTX 4090      | \~90–110 tok/s      | \~50 ms             |
| Llama 3.1 8B       | A100 80 GB    | \~120–150 tok/s     | \~35 ms             |
| Llama 4 Scout      | A100 80 GB    | \~60–80 tok/s       | \~80 ms             |
| Llama 3.1 70B      | A100 80 GB    | \~25–35 tok/s       | \~150 ms            |
| Llama 3.1 70B      | 2x A100 40 GB | \~20–30 tok/s       | \~200 ms            |
| Llama 3.1 405B FP8 | 4x H100       | \~15–25 tok/s       | \~300 ms            |

Throughput scales with concurrent requests. At 8+ concurrent requests, vLLM's continuous batching can push aggregate throughput 3–5x higher than single-request numbers.

***

## Which approach to choose

| Approach             | Best for                                     | Cost           | Setup time   |
| -------------------- | -------------------------------------------- | -------------- | ------------ |
| **Models API**       | Production apps, no infra to manage          | Per token      | 60 seconds   |
| **vLLM self-host**   | Custom serving, max throughput, data privacy | Per hour (GPU) | \~10 minutes |
| **Ollama self-host** | Development, prototyping, experimentation    | Per hour (GPU) | \~5 minutes  |

**Start with the Models API** if you want to ship today. Move to vLLM self-hosting when you need dedicated throughput, custom context lengths, or want to keep all data on your own infrastructure.

***

## Cleanup

When you're done with self-hosted instances:

```bash theme={"theme":"github-dark"}
runcrate instances delete llama-8b
runcrate instances delete llama-70b
runcrate instances delete llama-scout
runcrate instances delete llama-405b
runcrate instances delete llama-ollama
```
