Go from zero to a live inference endpoint in a single conversation. Your AI agent provisions the GPU, installs the serving framework, starts the server, and hands you the URL.Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
”Set up a vLLM server with Llama 3.1 70B on an A100.”
The agent handles the full deployment:list_gpu_types— confirms A100 80GB pricing ($1.60/hr)create_instance— deploysllama-servewith A100 80GBinstance_status— polls until runningssh_execute—pip install vllmssh_execute— starts vLLM:ssh_execute—sleep 30 && curl -s localhost:8000/healthto verifyget_instance— retrieves the public IP
Your vLLM server is live:It’s OpenAI-compatible. Use it as a drop-in replacement:Model: Llama 3.1 70B Instruct | GPU: A100 80GB | Cost: $1.60/hr
”Test the endpoint with a quick prompt and show me the latency.”
The agent sends a test request:ssh_execute— runs a timed curl request from the instance itself:
Response: “vLLM is a high-throughput serving engine for large language models that uses PagedAttention…” Latency:Server is healthy and performing within expected range for 70B on A100.
- Time to first token: ~140ms
- Total generation (128 tokens): ~4.2s
- Throughput: ~30 tokens/sec
”Check GPU usage on the serving instance. Is it handling load well?”
The agent inspects the server:ssh_execute—nvidia-smissh_execute—curl -s localhost:8000/metrics | grep -E 'num_requests|gpu_cache'ssh_execute—tail -10 /root/vllm.log
GPU status:Server metrics:
- Memory: 71,200 / 81,920 MiB (87% — model loaded, healthy headroom)
- Utilization: 34% (moderate load)
The server has plenty of capacity. KV cache at 12% means you can handle ~8x more concurrent requests before hitting memory pressure.
- Active requests: 2
- KV cache usage: 12%
- Requests served (last hour): 847
Tools used in this workflow
| Tool | Purpose |
|---|---|
list_gpu_types | Check GPU availability and pricing |
create_instance / instance_status | Provision and wait for deployment |
ssh_execute | Install vLLM, start server, test endpoint, monitor |
get_instance | Retrieve the public IP for the endpoint URL |