Building a Self-Running ML Training Pipeline

01 Overview

Background

Running experiments on a GPU cluster means a lot of waiting. Submit a job, wait hours, check results, tweak the config, submit again. The loop is entirely manual, and it gets old fast — especially when you're staring at squeue at 3am.

This post covers how I automated the full training loop using OpenClaw, including the part that's hardest to automate: analyzing the results and deciding what to try next.

Stack

PSC Bridges2 (H100) — actual training
AWS EC2 (always-on) — orchestration hub
OpenClaw / Hachimi — monitoring, decision-making, Telegram notifications
GitHub — archive every run under runs/v{n}/
Notion — live dashboard

02 Architecture Selection First

Before writing any training code, I needed to know what architecture to actually build. Reading papers myself is slow and easy to get wrong. My approach was to split this into two steps.

First, Hachimi put together a research prompt based on my specific constraints — parameter budget, dataset size, task type. Then I fed that prompt into ChatGPT Deep Research.

The split makes sense: OpenClaw is good at executing tasks and managing workflow, but ChatGPT Deep Research is noticeably better at synthesizing recent literature — it can scan papers from last month and distill them into a ranked recommendation. Within 20 minutes I had a concrete direction backed by 2024–2025 papers, which saved a lot of aimless experimentation.

03 Pipeline Design

Architecture

You (Telegram)
    ↕  Hachimi notifies you / waits for approval
AWS EC2 (always-on orchestration hub)
    ├── analyze_and_trigger.py   parse log, call LLM, submit next job
    ├── pull_and_commit.sh       git commit + competition submission
    └── OpenClaw / Hachimi       monitoring + judgment layer
         ↕  SSH callback on job completion
PSC Bridges2 (GPU cluster)
    └── train.py → results.json → triggers callback

Flow

Submit sbatch submit_v1.sh
Training finishes; the script SSH-calls back to AWS with the log path
AWS parses the training curve, calls an LLM for analysis, generates the next config
Next job submits automatically (sbatch submit_v2.sh)
Competition submission, Notion update, and git commit all happen automatically
Telegram receives a 3-line summary: runtime, final metric, what changed

Once this is set up, the full "run → analyze → next run" cycle requires no manual steps.

04 Implementation Details

HEARTBEAT.md Instead of Cron

OpenClaw's heartbeat system reads HEARTBEAT.md in your workspace on every tick and executes whatever it says. You write monitoring logic in plain English — no cron syntax, no daemon to manage:

## Training Monitor
Check `squeue -u username` for running jobs.
- RUNNING: tail the latest log, report epoch/metric, update Notion
- COMPLETED: read results.json, analyze the curve, ask me about the next version
- FAILED: read the .err log, report the cause, wait for instructions — do not auto-resubmit

Update the file anytime; it takes effect on the next heartbeat.

SSH Callback

At the end of your SBATCH script:

ssh -i ~/.ssh/id_ed25519 user@your-aws-server \
  "python3 /home/ubuntu/pipeline/analyze_and_trigger.py \
    --run v1 --log '$LOG_PATH' --next-run v2" \
  || echo "callback failed (non-fatal)"

Two things to set up before this works:

Generate the SSH key on the cluster, not your laptop. Compute nodes don't have your local key — add the cluster's public key to AWS authorized_keys.
On PSC Bridges2, compute nodes only have /jet/home/username/. The /home/username symlink exists only on login nodes. Hardcode full paths everywhere, including --output and --error in SBATCH headers.

Analysis Script

analyze_and_trigger.py does four things: 1. Parse the training log — extract loss and metric curves 2. Call an LLM with a structured prompt: current curve + current config → what should change? 3. Parse the response into config fields 4. Submit the next job

The LLM doesn't need to be powerful. It just needs to reason about convergence: still improving? val loss diverging from train loss? learning rate too high? A concise prompt and a fast model is enough.

05 Things That Went Wrong

Job dies in 1 second

Exit code 0:53 (signal 53). No error output. Multiple jobs in a row. Bad cluster node, not a code bug. Fix: Always run a short smoke test first (1 epoch, 5% data) before committing to a full job.

SBATCH output path doesn't exist

Slurm writes the output file before the script runs. If the directory doesn't exist, the job is killed with no log to debug from. Fix: mkdir -p the log directory before sbatch, or add it as the first line of the SBATCH script itself.

Callback fails silently

If the SSH callback fails, nothing happens — training finishes normally, the next job never submits, no notification. You won't know until you check manually. Fix: Add || echo "callback failed" so training doesn't abort. More importantly, test the SSH path before you need it during a real run.

API key expires silently

Hardcoded API key expires, analysis script crashes, no next job, nothing. Fix: Store credentials in AWS SSM Parameter Store and fetch at runtime. Keys expire; scripts end up in git.

06 What Hachimi Actually Did

The Python scripts are deterministic pipeline glue — they do the same thing every run. Hachimi handles the cases where judgment is required:

When I manually cancelled the wrong auto-submitted job, Hachimi caught the version conflict
When the LLM API key expired mid-experiment, Hachimi read the logs directly, analyzed them, and proposed the next configuration without being asked
Throughout the experiment, all I needed to read was the Telegram summary — I never had to keep a terminal open

Scripts handle the happy path. Hachimi handles everything else.

07 Recommendations

Disable automation first. Get the pipeline working end-to-end manually. Add the callback, then the auto-submit. One layer at a time.
Write results.json at the end of every run. At minimum: final metric, best metric + epoch, parameter count. Makes automated analysis much cleaner.
Test the callback chain early. Verify PSC → AWS SSH works before your first real job, not after a 4-hour wait.
Don't hardcode credentials. Keys expire; scripts end up in git.