> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Set Up Model Serving with AI Agents

> Use MCP tools to deploy a GPU instance, install vLLM, start an OpenAI-compatible endpoint, test it, and monitor performance — all through conversation.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Go from zero to a live inference endpoint in a single conversation. Your AI agent provisions the GPU, installs the serving framework, starts the server, and hands you the URL.

***

## "Set up a vLLM server with Llama 3.1 70B on an A100."

The agent handles the full deployment:

1. **`list_gpu_types`** — confirms A100 80GB pricing (\$1.60/hr)
2. **`create_instance`** — deploys `llama-serve` with A100 80GB
3. **`instance_status`** — polls until running
4. **`ssh_execute`** — `pip install vllm`
5. **`ssh_execute`** — starts vLLM:
   ```
   nohup python -m vllm.entrypoints.openai.api_server \
     --model meta-llama/Llama-3.1-70B-Instruct \
     --max-model-len 8192 \
     --port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &
   ```
6. **`ssh_execute`** — `sleep 30 && curl -s localhost:8000/health` to verify
7. **`get_instance`** — retrieves the public IP

> Your vLLM server is live:
>
> ```
> http://203.0.113.42:8000/v1/chat/completions
> ```
>
> It's OpenAI-compatible. Use it as a drop-in replacement:
>
> ```python theme={"theme":"github-dark"}
> from openai import OpenAI
> client = OpenAI(base_url="http://203.0.113.42:8000/v1", api_key="unused")
> ```
>
> Model: Llama 3.1 70B Instruct | GPU: A100 80GB | Cost: \$1.60/hr

***

## "Test the endpoint with a quick prompt and show me the latency."

The agent sends a test request:

1. **`ssh_execute`** — runs a timed curl request from the instance itself:
   ```
   time curl -s localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"user","content":"What is vLLM?"}],"max_tokens":128}'
   ```

> **Response:**
> "vLLM is a high-throughput serving engine for large language models that uses PagedAttention..."
>
> **Latency:**
>
> * Time to first token: \~140ms
> * Total generation (128 tokens): \~4.2s
> * Throughput: \~30 tokens/sec
>
> Server is healthy and performing within expected range for 70B on A100.

***

## "Check GPU usage on the serving instance. Is it handling load well?"

The agent inspects the server:

1. **`ssh_execute`** — `nvidia-smi`
2. **`ssh_execute`** — `curl -s localhost:8000/metrics | grep -E 'num_requests|gpu_cache'`
3. **`ssh_execute`** — `tail -10 /root/vllm.log`

> **GPU status:**
>
> * Memory: 71,200 / 81,920 MiB (87% — model loaded, healthy headroom)
> * Utilization: 34% (moderate load)
>
> **Server metrics:**
>
> * Active requests: 2
> * KV cache usage: 12%
> * Requests served (last hour): 847
>
> The server has plenty of capacity. KV cache at 12% means you can handle \~8x more concurrent requests before hitting memory pressure.

***

## Tools used in this workflow

| Tool                                  | Purpose                                            |
| ------------------------------------- | -------------------------------------------------- |
| `list_gpu_types`                      | Check GPU availability and pricing                 |
| `create_instance` / `instance_status` | Provision and wait for deployment                  |
| `ssh_execute`                         | Install vLLM, start server, test endpoint, monitor |
| `get_instance`                        | Retrieve the public IP for the endpoint URL        |
