> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluate LLM Performance on Cloud GPU

> Deploy a GPU instance, install lm-eval-harness, run benchmarks on open-source models, and download results.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Run standardized benchmarks on LLMs using lm-evaluation-harness. Compare models on MMLU, HellaSwag, ARC, and more — on your own GPU, with full reproducibility.

## 1. Deploy a GPU instance

```bash theme={"theme":"github-dark"}
runcrate instances create --name eval --gpu A100 --template ubuntu-devbox
runcrate instances status eval
```

## 2. Install lm-eval-harness

```bash theme={"theme":"github-dark"}
runcrate ssh eval -- "pip install lm-eval[vllm] vllm"
```

## 3. Run a benchmark

Evaluate Llama 3.1 8B on MMLU (5-shot):

```bash theme={"theme":"github-dark"}
runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-mmlu"
```

## 4. Run a full benchmark suite

```bash theme={"theme":"github-dark"}
runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-full"
```

## 5. Compare two models

```bash theme={"theme":"github-dark"}
runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/qwen-7b-full"
```

## 6. Download results

```bash theme={"theme":"github-dark"}
runcrate ssh eval -- "cat /workspace/results/llama-8b-full/results.json | python -m json.tool"
runcrate cp eval:/workspace/results/ ./eval-results/
```

## Available benchmark tasks

| Task             | Measures                     |
| ---------------- | ---------------------------- |
| `mmlu`           | Knowledge across 57 subjects |
| `hellaswag`      | Common-sense reasoning       |
| `arc_challenge`  | Science reasoning (hard)     |
| `truthfulqa_mc2` | Truthfulness                 |
| `gsm8k`          | Grade-school math            |
| `humaneval`      | Code generation              |

## Tips

* Use `--batch_size auto` to find the largest batch size that fits in VRAM.
* The vLLM backend is significantly faster than the default HuggingFace backend.
* For gated models, authenticate with `huggingface-cli login` first.
* Run the same tasks with the same `num_fewshot` across models for fair comparison.

## Cleanup

```bash theme={"theme":"github-dark"}
runcrate instances delete eval
```