nvidia-smi and tail logs, ask your AI agent. It can read logs, parse metrics, check system health, and estimate completion time.
”How’s training going on my A100 instance?”
The agent checks logs and GPU state:ssh_execute—tail -20 /root/train.logssh_execute—nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
Training progress ontraining-run:GPU health:Progress: 30% complete (4,200 / 14,000 steps). Estimated time remaining: ~4.5 hours at the current step rate (~36 steps/min). Loss is trending down steadily — training looks healthy.
- Utilization: 96%
- Memory: 74,200 / 81,920 MiB (91%)
- Temperature: 72C (healthy, throttles at 83C)
”Check disk space and make sure I won’t run out mid-training.”
The agent inspects storage:ssh_execute—df -h /ssh_execute—du -sh /workspace/checkpoints/ 2>/dev/nullssh_execute—ls -lhrt /workspace/checkpoints/ | tail -5
Disk usage ontraining-run:Checkpoint directory: 89 GB across 3 checkpoints:
Mount Used Available Capacity / 142 GB 58 GB 71% Each checkpoint is ~30 GB. You have 7 more epochs, which means ~210 GB of additional checkpoints — but you only have 58 GB free. You’ll run out of disk at epoch 5. Options:
- Delete old checkpoints:
rm -rf checkpoint-1400 checkpoint-2800(frees 58 GB)- Save only the latest checkpoint (add
--save_total_limit 2to your training args)- Write checkpoints to a mounted persistent volume instead
”Compare GPU utilization across all my running instances.”
The agent checks every instance:list_instances— gets all running instancesssh_executeon each —nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
training-run is fully utilized — healthy. inference-prod is underutilized at 28% — could downgrade to A100 if traffic stays low. dev-box is completely idle — consider terminating to save $0.35/hr.
Instance GPU Utilization Memory Status training-run A100 80GB 96% 74.2 / 80 GB training actively inference-prod H100 80GB 28% 72.0 / 80 GB serving (low traffic) dev-box RTX 4090 0% 0.4 / 24 GB idle
Tools used in this workflow
| Tool | Purpose |
|---|---|
list_instances | Find all running instances |
ssh_execute | Check GPU stats, read logs, inspect disk usage |