> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Debug GPU Issues with AI Agents

> Use MCP tools to diagnose CUDA out-of-memory errors, check GPU utilization, inspect logs, and get fix suggestions — all through conversation.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

When training crashes or inference slows down, your AI agent can SSH into the instance, inspect GPU state, read logs, and suggest concrete fixes.

***

## "My training script is crashing with CUDA OOM. Can you check what's happening?"

The agent investigates the GPU and the error:

1. **`list_instances`** — finds your running instances
2. **`ssh_execute`** — runs `nvidia-smi` on the target instance
3. **`ssh_execute`** — runs `tail -50 /root/train.log` to find the crash

> **Instance:** `training-run` (1x A100 40GB)
> **GPU memory:** 39,800 / 40,960 MiB (97% — nearly full)
>
> The log shows:
>
> ```
> torch.cuda.OutOfMemoryError: CUDA out of memory.
> Tried to allocate 2.00 GiB. GPU 0 has 1.12 GiB free.
> ```
>
> You're running out of memory during the backward pass. Three options:
>
> 1. **Reduce batch size** from 16 to 8 — halves peak memory
> 2. **Enable gradient checkpointing** — add `model.gradient_checkpointing_enable()` before training
> 3. **Upgrade GPU** — deploy on an A100 80GB for 2x the VRAM at \$1.60/hr

***

## "Check if there are any zombie processes eating GPU memory on my dev-box."

The agent inspects processes on the GPU:

1. **`ssh_execute`** — runs `nvidia-smi` to see GPU process list
2. **`ssh_execute`** — runs `ps aux | grep python` to cross-reference

> Found 3 Python processes using GPU memory on **dev-box**:
>
> | PID  | GPU Memory | Command                                                       |
> | ---- | ---------- | ------------------------------------------------------------- |
> | 1842 | 12,400 MiB | `python train.py` (your active run)                           |
> | 1203 | 8,200 MiB  | `python eval.py` (finished 2 hours ago, still holding memory) |
> | 980  | 4,100 MiB  | `python -m jupyter` (idle notebook kernel)                    |
>
> PIDs 1203 and 980 are zombie processes. Want me to kill them? That would free 12.3 GB of VRAM.

If you say yes, the agent runs:

3. **`ssh_execute`** — `kill 1203 1980`
4. **`ssh_execute`** — `nvidia-smi` again to confirm memory is freed

> Killed 2 zombie processes. GPU memory dropped from 24.7 GB to 12.4 GB. You now have 28.5 GB free.

***

## "My inference server is responding slowly. What's wrong?"

The agent checks GPU utilization, system resources, and the serving process:

1. **`get_instance`** — confirms instance specs
2. **`ssh_execute`** — runs `nvidia-smi dmon -s u -c 5` (5 samples of GPU utilization)
3. **`ssh_execute`** — runs `free -h` to check RAM
4. **`ssh_execute`** — runs `df -h /` to check disk space
5. **`ssh_execute`** — runs `tail -30 /root/vllm.log` to check the serving logs

> **Instance:** `inference-prod` (1x H100 80GB)
>
> * **GPU utilization:** averaging 15% (not compute-bound)
> * **GPU memory:** 72 GB / 80 GB (healthy)
> * **System RAM:** 58 GB / 64 GB (92% — swapping likely)
> * **Disk:** 94% full — only 12 GB free
>
> The bottleneck is system RAM, not GPU. vLLM is likely swapping to disk during request preprocessing. Two fixes:
>
> 1. Clear disk cache: `rm -rf /root/.cache/pip /tmp/*`
> 2. Restart vLLM with `--max-num-seqs 64` (down from default 256) to reduce CPU-side memory

***

## Tools used in this workflow

| Tool             | Purpose                                                             |
| ---------------- | ------------------------------------------------------------------- |
| `list_instances` | Find the problematic instance                                       |
| `get_instance`   | Check instance specs and configuration                              |
| `ssh_execute`    | Run `nvidia-smi`, read logs, kill processes, check system resources |
