The Less I Type, the Better the Output: A Context Framework I Had Claude Design for AI Coding

Fable 5, a Claude model, became available to me for a limited window. I could have spent that window shipping features. Instead I had it build something that would outlast the model itself: a context framework for running AI coding across several models.

The reason is a thing I keep circling back to when I work with coding agents. The less a human does by hand, the better the quality and the speed. Not because people are bad at the work, but because every manual step is a place to forget a constraint or approve something on autopilot.

So this is a design write-up, not a benchmark post. I have not run this on a long project for long enough to put numbers on it. What I can tell you is that using it felt more efficient, and I can show you the exact failure modes it was built to kill. If you want a graph, this isn’t that article. If you keep watching your agent walk off in the wrong direction and you keep rewriting the same prompt, keep reading.

The pain I kept hitting

Before any of this, my problem was simple and expensive. The agent would drift. It would go off on something that wasn’t the point, burn tokens on it, and I would spend real time pulling it back on course.

A lot of that traced back to me. When I write a task by hand, I leave out the parts I think are obvious. The agent doesn’t share my “obviously.” It fills the gap with its own assumption, does something I didn’t ask for, and now there’s rework. Sometimes the thing runs away entirely.

The workaround I found almost by accident: don’t write the prompt myself. Have a model write the prompt, then read it and check it before it runs. That was steadier. This framework is that habit turned into a system.

Watching the same thing happen over and over, I could name four failure modes I wanted gone:

  • the agent going off the rails
  • too many pull requests and re-reviews that never end
  • my own bias pulling the work along, so course-correction comes late
  • approvals turning into rubber stamps once things get automated

One thing up front: the framework’s own docs treat these four as design targets, and they label their effect numbers as hypotheses, not measurements. I’m keeping that line here.

The core idea: the human doesn’t write long prompts

The center of the whole thing is one sentence. The human does not write long prompts. A model writes the prompt for the next step into a file, and the human copies and pastes it.

Day to day, the human does three things:

  1. Open NEXT_ACTION.md to see what’s next.
  2. Copy NEXT_PROMPT.md and paste it into the model it names.
  3. Read only the “needs human confirmation” part, then approve or send it back.

Every model, when it finishes, is required to update NEXT_ACTION.md and NEXT_PROMPT.md. So the loop keeps going on its own. I’m not composing instructions each turn.

The files sit next to the repo, not inside it:

/
  AGENTS.md              # short pointer for whichever CLI starts here
  _ai_management/        # the framework files (kept out of git)
  repo/                  # the actual code you git init and push

That split matters: the code you publish stays clean, and the coordination files stay out of your commit history.

The four failure modes, and the design response

1. Going off the rails

This was my worst one. The agent drifts, and course-correction eats the afternoon.

The design response is boring on purpose. The spec is the single source of truth, and a model is not allowed to implement anything that depends on an open, undecided question. That’s a hard stop, not a suggestion. Acceptance criteria can’t be written as “it works correctly”; they have to name the input, the output, and the command that checks them. And each task is routed by how much damage a wrong move would cost, not by which model I happen to like. There are defined routes from light fixes up to research work, with a rule to pick the heavier route when you’re unsure.

2. Too many PRs, reviews that never end

Push “be critical” too hard and review turns into a game. The count of comments starts to look like the output, so a reviewer manufactures harmless nitpicks and sends the work back again and again.

The response is a review discipline with a few hard rules. Every comment has to carry a concrete failure scenario: which input, in which state, produces which wrong result. If you can’t write that, it goes in an “observations” box and does not block the decision. Only high and medium severity can send work back; low-severity-only means approve with notes. A re-review after a bounce is scoped to “did the previous points get fixed,” plus new high-severity only, so a bounce can’t be stretched into three. Two bounces, then the route changes instead of a third. A request to send work back also has to state repair cost against the damage of leaving it in; if the damage is smaller than the fix, it doesn’t bounce. And the model that wrote the code doesn’t get to be its final reviewer. The last review crosses model families.

3. My own bias pulling the work along

My input is a mix of two things: decisions I’ve actually made, and opinions I’m still thinking through. If the model treats a half-formed opinion as a decision, my under-baked idea gets built, and quality drops.

The response is a prefix convention. Anything that starts with 決定: (decision:) is a decision. It’s executed without debate and logged. Anything else is an opinion, and the model has to answer with a technical judgment, agree, agree with conditions, or disagree, and give its grounds before it acts. Flattering agreement is banned; it can’t open with “you’re absolutely right.” It’s also not required to disagree. If my idea is the best one, it says so in a line and takes it. When it does disagree and I override with 決定:, it runs the thing but logs “human decision over model objection” with the reason, so later we can check who was right. Purpose and priorities stay my call; facts and implementation correctness get checked against evidence, whoever said them.

4. Rubber-stamp approvals

When a person approves machine output, a bias creeps in: “it was generated, it’s probably fine.” Approval goes hollow. A bare “please confirm” makes it worse.

The response is an approval gate format. No free-form “looks good?” A request has to be filled into a fixed block:

## Approval gate
- Approving:  line>
- Risk: high / medium / low
- Human checks (max 3):  file / where to look / what "OK" looks like>
- Unapproval conditions (don't approve if any one is true):
  -  in verifiable form; at least one checkable by a command>
- Unverified:  the model couldn't verify; writing "none" needs grounds>
- Recommendation: approve / approve after scrutiny / don't approve

(The framework’s labels are in Japanese; this is my English rendering of the same shape.) The unapproval conditions are the load-bearing part: the specific things that, if any one is true, mean do not approve, with at least one written so a command can verify it. If a model can’t write those conditions, the task isn’t ready, and it gets sent back on the spot. The final gate is written by the last model that reviewed the work, not the one that built it, because you go easy on your own output. And every so often, a low-risk “light approval” gets pulled at random for a full check, to see whether the light path has gone hollow.

If you read the three above closely, they’re one problem wearing three faces. The docs put it this way: evaluation gets bent by relationship instead of evidence. The single rule under all of it is the same shape each time. Agreement needs grounds, objection needs a failure scenario, approval needs unapproval conditions.

What surprised me, and where this stops

The honest part. What surprised me wasn’t a clever trick. I didn’t expect a model to build something this large. It came back as a full folder of specs, logs, and templates.

I also left one thing out on purpose. There’s a benchmark stage in the design, and I chose not to run it. The project keeps qualitative notes only, no metrics. That’s a real gap, and it’s why I keep writing “felt” instead of “measured.” I haven’t run it long enough to claim a number.

So read the four responses above as design intent that lined up with my experience, not as proven results. The failure modes are real, and I’ve hit every one of them. Whether this framework cuts them by some percentage, I can’t tell you yet.

Takeaway

If I compress the whole thing to one line, it’s what I’d say first if a friend asked: cut the amount the human does by hand, and both quality and speed go up. Fewer manual steps means fewer places to slip.

The other half is about what you keep for the human. Don’t pull the human out of approval. Make approval cheap and honest instead: outputs the human can check quickly, a small amount to check, and an explanation attached. In this framework the human’s approval concentrates at just three points: locking the spec, the final sign-off on a diff, and anything irreversible like a push.

Disclosure: the framework described here was designed by Claude (Fable 5). This article was written with AI assistance, then edited and fact-checked by me; the design choices and the experience are mine. Also tagged #abotwrotethis.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Ubuntu Agent Starter

Related Posts