Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Process thousands of prompts through an LLM in a single session. Deploy a GPU, run vLLM in offline batch mode, collect results, tear down. Pay only for the compute hours you use.
Create a JSONL file with one prompt per line:
{"prompt": "Summarize this review: 'Great product, fast shipping.'", "id": "r001"}
{"prompt": "Summarize this review: 'Arrived broken. Returning.'", "id": "r002"}
2. Deploy and upload
runcrate instances create --name batch-job --gpu H100 --template ubuntu-inference
runcrate instances status batch-job
runcrate cp ./prompts.jsonl batch-job:/workspace/prompts.jsonl
3. Install vLLM and upload the batch script
runcrate ssh batch-job -- "pip install vllm"
# batch_infer.py
import json
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", max_model_len=4096)
params = SamplingParams(max_tokens=256, temperature=0.1)
with open("/workspace/prompts.jsonl") as f:
items = [json.loads(line) for line in f]
prompts = [item["prompt"] for item in items]
outputs = llm.generate(prompts, params)
with open("/workspace/results.jsonl", "w") as f:
for item, output in zip(items, outputs):
f.write(json.dumps({
"id": item["id"],
"response": output.outputs[0].text,
}) + "\n")
print(f"Processed {len(items)} prompts.")
runcrate cp ./batch_infer.py batch-job:/workspace/batch_infer.py
runcrate ssh batch-job -- "cd /workspace && python batch_infer.py"
4. Download results and tear down
runcrate cp batch-job:/workspace/results.jsonl ./results.jsonl
runcrate instances delete batch-job
Monitor progress
runcrate ssh batch-job -- nvidia-smi
runcrate ssh batch-job -- "wc -l /workspace/results.jsonl"
Cost estimate
| Prompts | Model | GPU | Time | Approx. cost |
|---|
| 10,000 | Llama 8B | RTX 4090 | ~15 min | ~$0.09 |
| 10,000 | Llama 70B | A100 80 GB | ~30 min | ~$0.80 |
| 100,000 | Llama 70B | H100 80 GB | ~2 hrs | ~$5.00 |
Tips
- vLLM offline batch mode uses continuous batching — much faster than sequential API calls.
- For 100K+ prompts, split the file and process in chunks to avoid OOM.
- Check your balance before starting:
runcrate billing balance.