Run Batch Inference at Scale

Process thousands of prompts through an LLM in a single session. Deploy a GPU, run vLLM in offline batch mode, collect results, tear down. Pay only for the compute hours you use.

1. Prepare your input file

Create a JSONL file with one prompt per line:

{"prompt": "Summarize this review: 'Great product, fast shipping.'", "id": "r001"}
{"prompt": "Summarize this review: 'Arrived broken. Returning.'", "id": "r002"}

2. Deploy and upload

runcrate instances create --name batch-job --gpu H100 --template ubuntu-inference
runcrate instances status batch-job

runcrate cp ./prompts.jsonl batch-job:/workspace/prompts.jsonl

3. Install vLLM and upload the batch script

runcrate ssh batch-job -- "pip install vllm"

# batch_infer.py
import json
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", max_model_len=4096)
params = SamplingParams(max_tokens=256, temperature=0.1)

with open("/workspace/prompts.jsonl") as f:
    items = [json.loads(line) for line in f]

prompts = [item["prompt"] for item in items]
outputs = llm.generate(prompts, params)

with open("/workspace/results.jsonl", "w") as f:
    for item, output in zip(items, outputs):
        f.write(json.dumps({
            "id": item["id"],
            "response": output.outputs[0].text,
        }) + "\n")

print(f"Processed {len(items)} prompts.")

runcrate cp ./batch_infer.py batch-job:/workspace/batch_infer.py
runcrate ssh batch-job -- "cd /workspace && python batch_infer.py"

4. Download results and tear down

runcrate cp batch-job:/workspace/results.jsonl ./results.jsonl
runcrate instances delete batch-job

Monitor progress

runcrate ssh batch-job -- nvidia-smi
runcrate ssh batch-job -- "wc -l /workspace/results.jsonl"

Cost estimate

Prompts	Model	GPU	Time	Approx. cost
10,000	Llama 8B	RTX 4090	~15 min	~$0.09
10,000	Llama 70B	A100 80 GB	~30 min	~$0.80
100,000	Llama 70B	H100 80 GB	~2 hrs	~$5.00

Tips

vLLM offline batch mode uses continuous batching — much faster than sequential API calls.
For 100K+ prompts, split the file and process in chunks to avoid OOM.
Check your balance before starting: runcrate billing balance.

​1. Prepare your input file

​2. Deploy and upload

​3. Install vLLM and upload the batch script

​4. Download results and tear down

​Monitor progress

​Cost estimate

​Tips