Monitor Training Progress with AI Agents

Instead of SSH-ing in manually to check nvidia-smi and tail logs, ask your AI agent. It can read logs, parse metrics, check system health, and estimate completion time.

”How’s training going on my A100 instance?”

The agent checks logs and GPU state:

ssh_execute — tail -20 /root/train.log
ssh_execute — nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader

Training progress on training-run:
Epoch 3/10 | Step 4,200/14,000 | Loss: 0.3421 | LR: 1.8e-5
Epoch 3/10 | Step 4,220/14,000 | Loss: 0.3398 | LR: 1.8e-5
GPU health:

Utilization: 96%

Memory: 74,200 / 81,920 MiB (91%)

Temperature: 72C (healthy, throttles at 83C)

Progress: 30% complete (4,200 / 14,000 steps). Estimated time remaining: ~4.5 hours at the current step rate (~36 steps/min). Loss is trending down steadily — training looks healthy.

”Check disk space and make sure I won’t run out mid-training.”

The agent inspects storage:

ssh_execute — df -h /
ssh_execute — du -sh /workspace/checkpoints/ 2>/dev/null
ssh_execute — ls -lhrt /workspace/checkpoints/ | tail -5

Disk usage on training-run:
Mount Used Available Capacity
/ 142 GB 58 GB 71%
Checkpoint directory: 89 GB across 3 checkpoints:
checkpoint-1400/  28 GB  (epoch 1)
checkpoint-2800/  30 GB  (epoch 2)
checkpoint-4200/  31 GB  (epoch 3, latest)
Each checkpoint is ~30 GB. You have 7 more epochs, which means ~210 GB of additional checkpoints — but you only have 58 GB free. You’ll run out of disk at epoch 5. Options:

Delete old checkpoints: rm -rf checkpoint-1400 checkpoint-2800 (frees 58 GB)

Save only the latest checkpoint (add --save_total_limit 2 to your training args)

Write checkpoints to a mounted persistent volume instead

Mount	Used	Available	Capacity
/	142 GB	58 GB	71%

”Compare GPU utilization across all my running instances.”

The agent checks every instance:

list_instances — gets all running instances
ssh_execute on each — nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader

Instance GPU Utilization Memory Status
training-run A100 80GB 96% 74.2 / 80 GB training actively
inference-prod H100 80GB 28% 72.0 / 80 GB serving (low traffic)
dev-box RTX 4090 0% 0.4 / 24 GB idle
training-run is fully utilized — healthy. inference-prod is underutilized at 28% — could downgrade to A100 if traffic stays low. dev-box is completely idle — consider terminating to save $0.35/hr.

Instance	GPU	Utilization	Memory	Status
training-run	A100 80GB	96%	74.2 / 80 GB	training actively
inference-prod	H100 80GB	28%	72.0 / 80 GB	serving (low traffic)
dev-box	RTX 4090	0%	0.4 / 24 GB	idle

Tools used in this workflow

Tool	Purpose
`list_instances`	Find all running instances
`ssh_execute`	Check GPU stats, read logs, inspect disk usage

​”How’s training going on my A100 instance?”

​”Check disk space and make sure I won’t run out mid-training.”

​”Compare GPU utilization across all my running instances.”

​Tools used in this workflow

”How’s training going on my A100 instance?”

”Check disk space and make sure I won’t run out mid-training.”

”Compare GPU utilization across all my running instances.”

Tools used in this workflow