> ## Documentation Index
> Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitor Training Progress with AI Agents

> Use MCP tools to check training logs, GPU utilization, disk space, and estimate time remaining — all through conversation.

export const RuncrateStyles = () => {
  if (typeof document !== 'undefined' && !document.getElementById('runcrate-overrides')) {
    const s = document.createElement('style');
    s.id = 'runcrate-overrides';
    s.textContent = `
      /* Match Runcrate's rounding scale (--radius: 0.75rem) */
      .rounded-sm { border-radius: 0.5rem !important; }   /* 8px */
      .rounded-md { border-radius: 0.625rem !important; } /* 10px */
      .rounded-lg { border-radius: 0.75rem !important; }  /* 12px */
      .rounded-l-sm { border-top-left-radius: 0.5rem !important; border-bottom-left-radius: 0.5rem !important; }
      .rounded-r-sm { border-top-right-radius: 0.5rem !important; border-bottom-right-radius: 0.5rem !important; }
      .rounded-l-md { border-top-left-radius: 0.625rem !important; border-bottom-left-radius: 0.625rem !important; }
      .rounded-r-md { border-top-right-radius: 0.625rem !important; border-bottom-right-radius: 0.625rem !important; }
      .rounded-l-lg { border-top-left-radius: 0.75rem !important; border-bottom-left-radius: 0.75rem !important; }
      .rounded-r-lg { border-top-right-radius: 0.75rem !important; border-bottom-right-radius: 0.75rem !important; }

      /* Cards: never pure white in light mode */
      .card { background-color: #fcfcfc !important; border-radius: 0.75rem !important; }
      html.dark .card { background-color: #141414 !important; }

      /* Docs hero box */
      .rc-hero { background-color: #fcfcfc; border: 1px solid #e0e0e0; }
      html.dark .rc-hero { background-color: #141414; border-color: #242424; }
      html.dark .rc-hero h1 { color: #f5f5f5; }

      /* Runcrate scrollbar — thin, transparent track, hide-until-hover thumb */
      ::-webkit-scrollbar { width: 6px; height: 6px; background-color: transparent; }
      ::-webkit-scrollbar-track { background-color: transparent; }
      ::-webkit-scrollbar-thumb { background-color: rgba(155, 155, 155, 0.5); border-radius: 10px; transition: opacity 0.3s ease; opacity: 0; }
      ::-webkit-scrollbar-thumb:hover { background-color: rgba(155, 155, 155, 0.7); }
      *:hover::-webkit-scrollbar-thumb,
      *:focus::-webkit-scrollbar-thumb,
      *:active::-webkit-scrollbar-thumb { opacity: 1; }
      * { scrollbar-width: thin; scrollbar-color: rgba(155, 155, 155, 0.5) transparent; }
    `;
    document.head.appendChild(s);
  }
  return null;
};

<RuncrateStyles />

Instead of SSH-ing in manually to check `nvidia-smi` and `tail` logs, ask your AI agent. It can read logs, parse metrics, check system health, and estimate completion time.

***

## "How's training going on my A100 instance?"

The agent checks logs and GPU state:

1. **`ssh_execute`** — `tail -20 /root/train.log`
2. **`ssh_execute`** — `nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader`

> **Training progress on `training-run`:**
>
> ```
> Epoch 3/10 | Step 4,200/14,000 | Loss: 0.3421 | LR: 1.8e-5
> Epoch 3/10 | Step 4,220/14,000 | Loss: 0.3398 | LR: 1.8e-5
> ```
>
> **GPU health:**
>
> * Utilization: 96%
> * Memory: 74,200 / 81,920 MiB (91%)
> * Temperature: 72C (healthy, throttles at 83C)
>
> **Progress:** 30% complete (4,200 / 14,000 steps).
> **Estimated time remaining:** \~4.5 hours at the current step rate (\~36 steps/min).
> Loss is trending down steadily — training looks healthy.

***

## "Check disk space and make sure I won't run out mid-training."

The agent inspects storage:

1. **`ssh_execute`** — `df -h /`
2. **`ssh_execute`** — `du -sh /workspace/checkpoints/ 2>/dev/null`
3. **`ssh_execute`** — `ls -lhrt /workspace/checkpoints/ | tail -5`

> **Disk usage on `training-run`:**
>
> | Mount | Used   | Available | Capacity |
> | ----- | ------ | --------- | -------- |
> | /     | 142 GB | 58 GB     | 71%      |
>
> **Checkpoint directory:** 89 GB across 3 checkpoints:
>
> ```
> checkpoint-1400/  28 GB  (epoch 1)
> checkpoint-2800/  30 GB  (epoch 2)
> checkpoint-4200/  31 GB  (epoch 3, latest)
> ```
>
> Each checkpoint is \~30 GB. You have 7 more epochs, which means \~210 GB of additional checkpoints — but you only have 58 GB free.
>
> **You'll run out of disk at epoch 5.** Options:
>
> 1. Delete old checkpoints: `rm -rf checkpoint-1400 checkpoint-2800` (frees 58 GB)
> 2. Save only the latest checkpoint (add `--save_total_limit 2` to your training args)
> 3. Write checkpoints to a mounted persistent volume instead

***

## "Compare GPU utilization across all my running instances."

The agent checks every instance:

1. **`list_instances`** — gets all running instances
2. **`ssh_execute`** on each — `nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader`

> | Instance       | GPU       | Utilization | Memory       | Status                |
> | -------------- | --------- | ----------- | ------------ | --------------------- |
> | training-run   | A100 80GB | 96%         | 74.2 / 80 GB | training actively     |
> | inference-prod | H100 80GB | 28%         | 72.0 / 80 GB | serving (low traffic) |
> | dev-box        | RTX 4090  | 0%          | 0.4 / 24 GB  | idle                  |
>
> **training-run** is fully utilized — healthy.
> **inference-prod** is underutilized at 28% — could downgrade to A100 if traffic stays low.
> **dev-box** is completely idle — consider terminating to save \$0.35/hr.

***

## Tools used in this workflow

| Tool             | Purpose                                        |
| ---------------- | ---------------------------------------------- |
| `list_instances` | Find all running instances                     |
| `ssh_execute`    | Check GPU stats, read logs, inspect disk usage |