I built a workflow where agents create sprint plans, build features, and test them without me in the loop, all starting from one design conversation. Two skills carry most of the load: phase-planner turns chat into a structured phase plan, and phase-runner executes that plan sprint by sprint until there's something shippable.
What I Was Doing Before
Before I had this workflow, I was the workflow.
I'd design a feature in chat, then manually turn that conversation into a task list, or just start prompting implementation and hope the agent kept context. I'd run npm run check myself after every chunk of work. I'd open the browser to confirm UI changes. I'd update my notes on what was done, what was next, and what had broken. Every new session meant re-explaining the project, re-stating dependencies, re-deciding what "done" meant.
Chat was useful for thinking. It was a lousy execution layer. Conversations don't have task statuses, acceptance criteria, or a record of what shipped yesterday. I was copying context between sessions, sequencing work in my head, and catching regressions because I happened to think of running a test, not because anything in the system required it.
I built phase-planner and phase-runner to automate the parts I'd been doing by hand: translating design into structure, running sprints against that structure, verifying the output independently, and keeping the plan current as work completed.
Step 1: Design in Chat
Everything still starts with a conversation.
I talk through the product or feature the way I would with a sharp engineer: what's the goal, what's in scope, what are the data models, what are the edge cases, what already exists in the codebase that this has to fit into. The AI pushes back. I refine. We argue about complexity.
Scope, stack, and technical approach get decided here, before any structured plan exists. The output of this step isn't code. It's understanding: a shared picture of what we're building and why.
Step 2: Phase-Planner: Chat Becomes a Phase Plan
When the design feels solid, I say something like: use phase-planner to turn what we discussed into a phase doc.
Thinking stops. Structure starts.
Phase-planner is a skill that knows how to write phase plan documents: markdown files with sprints, task tables, acceptance criteria, dependencies, and verification instructions. It takes the messy output of a design conversation and translates it into something an execution agent can follow without asking twenty clarifying questions.
Phase Plan Structure ──────────────────────────────────────────── Phases Group related work (foundation, feature vertical, polish) Sprints Small completable chunks (a few hours to a day each) Tasks Concrete implementation steps with module hints Criteria What "done" means: testable, not vague Deps What must exist before this sprint can run Verify How to prove the sprint worked (CLI checks, UI routes)
Phase-planner is not a code-writing skill. It doesn't touch the app. It produces the artifact everything downstream depends on: a plan that survives across sessions, that I can read and edit, and that phase-runner will treat as authoritative.
Step 3: Review: Where My Attention Actually Goes
I spend the most time here, and no agent does this step for me.
I read the phase plan carefully. Does it match what I actually want, or did the translation from chat introduce features I didn't ask for? Is there unnecessary complexity, tasks that could be cut, abstractions that don't earn their keep? Does the dependency order hold: schema before services, APIs before UI?
I edit the plan directly. I cut tasks. I reorder sprints. I tighten acceptance criteria until they're testable, not vague. I add verification instructions where the agent would otherwise have to guess what to check.
A bad plan executed perfectly still ships the wrong thing. I'd rather spend an hour here than ten hours watching agents build something I didn't want.
Once the phase doc reflects what I want, I'm ready to execute.
Step 4: Phase-Runner: Autonomous Execution
I say: implement phase 6 with phase-runner.
From here, the orchestration skill takes over. Phase-runner reads the phase file, finds incomplete sprints, and runs them in order, or in parallel waves when dependencies and file overlap make that safe. I'm out of the relay loop. The plan decides what runs next.
Phase-runner is an orchestrator, not an implementer. It doesn't write application code. It doesn't run tests. It doesn't open a browser. It manages a pipeline of specialized sub-agents, each with a narrow job, and enforces sequencing between them.
The Sub-Agent Roster
Each role has a boundary. Implementation agents write code. Verify agents run CLI checks. Wave-test agents run browser QA. Doc-sync agents update the plan. No role bleeds into another.
Role Job Returns ───────────────────────────────────────────────────────────────────────── Implementation Write code for one sprint's tasks Structured result block CLI Verify Run typecheck, lint, unit tests Pass / partial / fail Wave Test Start dev server, screenshot target routes Pass / warn / fail Doc Sync Update task statuses in the phase plan Success per sprint UI Implementation Same as impl, but reads design system first Structured result block
The structured result format is the glue. Every implementation agent ends with a machine-parseable block: completed task numbers, blocked tasks with reasons, notes for the next sprint. The orchestrator reads that block. Doc-sync reads that block. Nothing relies on prose interpretation.
Verification Gates: Agents Don't Self-Grade
Implementation agents do not run their own checks. That separation is the whole point.
An implementation sub-agent is told explicitly: do not run npm run check. Do not open the browser. Do not edit the phase plan. Write the code, return the result, and stop.
A separate verify sub-agent runs CLI checks after implementation returns. On UI sprints, a separate wave-test sub-agent runs browser QA after CLI verify passes. Only when both gates pass does doc-sync update the plan.
If verify fails, phase-runner re-spawns the same sprint's implementation agent with the failure details injected. Not a generic "fix" task, not a separate debugging agent. The sprint scope stays intact. The agent fixes what broke and returns again. Verify runs again. The loop continues until pass or until a retry limit triggers a human checkpoint.
Without verification gates, you get agents that mark their own homework and move on while tests are red. I won't trust a workflow that lets implementers self-report success.
What a Phase Run Looks Like in Practice
Recent example, anonymized but real in structure and scale.
I was building a feature for a medical tracking app: replacing a legacy JSON dose-history field with titration periods derived from daily medication logs. The data model had to change, backfill scripts had to run against real patient-shaped fixtures, read and write paths had to migrate, and two new UI tabs had to ship, all without breaking adherence tracking or interval medications.
Sprint Focus ────────────────────────────────────────────────────────────── 1 Schema: add provenance column to daily logs 2 Pure derivation functions with unit tests 3 Archive legacy data and backfill scripts 4 Run backfill on dev database, reconcile mismatches 5 Write path: titration edits write logs, not JSON 6 Read path: derived titration everywhere in services 7 UI: editable titration history tab 8 UI: activity log tab with source badges 9 QA regression across tracker, history, and event log
Phase-runner ran them sequentially. The dependency chain was strict. You can't build the UI on derived titration until the read path exists, and the read path can't exist until backfill populates the logs.
Sprints 1 through 5 were data and logic work. Each followed the shorter gate path: implement, CLI verify, doc-sync. No browser testing until there was UI to look at.
Sprint 6 was the pivot point. Services started returning derived titration instead of parsed JSON. CLI verify passed, then wave-test opened the medication tracker in a browser and confirmed that historical dates showed the correct dose slots and that backfill rows didn't inflate adherence counts.
Sprints 7 and 8 were UI-primary. Implementation agents read the project design system before touching components. Wave-test checked the titration history chart, modal edit flows, and responsive layout at mobile and desktop widths.
Sprint 9 was explicit regression: confirm the daily tracker still writes logs correctly, confirm interval medications like a quarterly injection still titrate properly, confirm the global event log still merges medication entries.
I intervened twice: once at a phase boundary to review the completed plan status, and once when wave-test flagged a layout issue on a narrow viewport that needed a sprint retry. I did not sit in the loop for every task. I did not copy code between agents. I did not manually run tests.
Why This Works
The phase plan is where state lives.
AI assistants are stateless. Every conversation starts from zero. A phase plan file is persistent, versioned, and precise. It answers the questions a new agent would ask: what's done, what's next, what does "done" mean, what depends on what. Phase-planner creates that memory from chat. Phase-runner consumes it. I curate it in between.
Narrow jobs beat one agent doing everything.
One agent trying to plan, implement, test, update docs, and manage sequencing will drop threads. Sub-agents with narrow jobs and explicit return contracts don't. The orchestrator's only job is sequencing and parsing, work that benefits from a clean context window, not a crowded one.
Separate verify steps are why I trust the run.
I can walk away during a phase run because I know failed tests will loop back to implementation, not get waved through. CLI verify catches type errors and broken unit tests. Wave-test catches layout regressions and flows that compile but don't work in a browser. Doc-sync only runs after both pass.
I want review to be the slow part.
The workflow pushes my attention to the highest-leverage step: making sure the plan is right before execution starts. Everything after that is mechanical, and that's intentional. An hour spent tightening acceptance criteria saves ten hours of agents building the wrong thing confidently.
Agentic Engineering, not Vibe Coding
I design a feature in conversation, translate it into a structured sprint plan, review that plan until it's right, and hand execution to phase-runner. The orchestration layer runs implementation and verification on its own, sprint by sprint, gate by gate, while the phase plan stays current.
I run this primarily in Cursor: skills, sub-agents, @ invocation. The same pattern works in Claude, or anywhere you can hand an agent a structured plan and enforce contracts between sub-tasks.
What scales is the orchestration: a plan that persists, agents with narrow jobs, and verification that doesn't trust the implementer to self-report success.