Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Instead of SSH-ing in manually to check nvidia-smi and tail logs, ask your AI agent. It can read logs, parse metrics, check system health, and estimate completion time.

”How’s training going on my A100 instance?”

The agent checks logs and GPU state:
  1. ssh_executetail -20 /root/train.log
  2. ssh_executenvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
Training progress on training-run:
Epoch 3/10 | Step 4,200/14,000 | Loss: 0.3421 | LR: 1.8e-5
Epoch 3/10 | Step 4,220/14,000 | Loss: 0.3398 | LR: 1.8e-5
GPU health:
  • Utilization: 96%
  • Memory: 74,200 / 81,920 MiB (91%)
  • Temperature: 72C (healthy, throttles at 83C)
Progress: 30% complete (4,200 / 14,000 steps). Estimated time remaining: ~4.5 hours at the current step rate (~36 steps/min). Loss is trending down steadily — training looks healthy.

”Check disk space and make sure I won’t run out mid-training.”

The agent inspects storage:
  1. ssh_executedf -h /
  2. ssh_executedu -sh /workspace/checkpoints/ 2>/dev/null
  3. ssh_executels -lhrt /workspace/checkpoints/ | tail -5
Disk usage on training-run:
MountUsedAvailableCapacity
/142 GB58 GB71%
Checkpoint directory: 89 GB across 3 checkpoints:
checkpoint-1400/  28 GB  (epoch 1)
checkpoint-2800/  30 GB  (epoch 2)
checkpoint-4200/  31 GB  (epoch 3, latest)
Each checkpoint is ~30 GB. You have 7 more epochs, which means ~210 GB of additional checkpoints — but you only have 58 GB free. You’ll run out of disk at epoch 5. Options:
  1. Delete old checkpoints: rm -rf checkpoint-1400 checkpoint-2800 (frees 58 GB)
  2. Save only the latest checkpoint (add --save_total_limit 2 to your training args)
  3. Write checkpoints to a mounted persistent volume instead

”Compare GPU utilization across all my running instances.”

The agent checks every instance:
  1. list_instances — gets all running instances
  2. ssh_execute on each — nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
InstanceGPUUtilizationMemoryStatus
training-runA100 80GB96%74.2 / 80 GBtraining actively
inference-prodH100 80GB28%72.0 / 80 GBserving (low traffic)
dev-boxRTX 40900%0.4 / 24 GBidle
training-run is fully utilized — healthy. inference-prod is underutilized at 28% — could downgrade to A100 if traffic stays low. dev-box is completely idle — consider terminating to save $0.35/hr.

Tools used in this workflow

ToolPurpose
list_instancesFind all running instances
ssh_executeCheck GPU stats, read logs, inspect disk usage