> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Batch Inference at Scale

> Deploy an H100, load an LLM, process thousands of inputs with vLLM batch mode, download results, and tear down.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Process thousands of prompts through an LLM in a single session. Deploy a GPU, run vLLM in offline batch mode, collect results, tear down. Pay only for the compute hours you use.

## 1. Prepare your input file

Create a JSONL file with one prompt per line:

```json theme={"theme":"github-dark"}
{"prompt": "Summarize this review: 'Great product, fast shipping.'", "id": "r001"}
{"prompt": "Summarize this review: 'Arrived broken. Returning.'", "id": "r002"}
```

## 2. Deploy and upload

```bash theme={"theme":"github-dark"}
runcrate instances create --name batch-job --gpu H100 --template ubuntu-inference
runcrate instances status batch-job

runcrate cp ./prompts.jsonl batch-job:/workspace/prompts.jsonl
```

## 3. Install vLLM and upload the batch script

```bash theme={"theme":"github-dark"}
runcrate ssh batch-job -- "pip install vllm"
```

```python theme={"theme":"github-dark"}
# batch_infer.py
import json
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", max_model_len=4096)
params = SamplingParams(max_tokens=256, temperature=0.1)

with open("/workspace/prompts.jsonl") as f:
    items = [json.loads(line) for line in f]

prompts = [item["prompt"] for item in items]
outputs = llm.generate(prompts, params)

with open("/workspace/results.jsonl", "w") as f:
    for item, output in zip(items, outputs):
        f.write(json.dumps({
            "id": item["id"],
            "response": output.outputs[0].text,
        }) + "\n")

print(f"Processed {len(items)} prompts.")
```

```bash theme={"theme":"github-dark"}
runcrate cp ./batch_infer.py batch-job:/workspace/batch_infer.py
runcrate ssh batch-job -- "cd /workspace && python batch_infer.py"
```

## 4. Download results and tear down

```bash theme={"theme":"github-dark"}
runcrate cp batch-job:/workspace/results.jsonl ./results.jsonl
runcrate instances delete batch-job
```

## Monitor progress

```bash theme={"theme":"github-dark"}
runcrate ssh batch-job -- nvidia-smi
runcrate ssh batch-job -- "wc -l /workspace/results.jsonl"
```

## Cost estimate

| Prompts | Model     | GPU        | Time     | Approx. cost |
| ------- | --------- | ---------- | -------- | ------------ |
| 10,000  | Llama 8B  | RTX 4090   | \~15 min | \~\$0.09     |
| 10,000  | Llama 70B | A100 80 GB | \~30 min | \~\$0.80     |
| 100,000 | Llama 70B | H100 80 GB | \~2 hrs  | \~\$5.00     |

## Tips

* vLLM offline batch mode uses continuous batching — much faster than sequential API calls.
* For 100K+ prompts, split the file and process in chunks to avoid OOM.
* Check your balance before starting: `runcrate billing balance`.
