How I accidentally start SDD by failing at prompts for six months

The confession

I spent the first six months of serious AI pair programming producing what I now call vibe architecture.

You know the pattern. You open a chat with a strong model. You explain what you want. It produces clean code fast. You feel productive. Three weeks later the repo looks like it was designed by five different people, on five different days, with five different mental models.

Each file is locally correct. The system is globally confused.

I would plan with the model in one session. I would implement in another. By step five the implementation had drifted far enough that the plan was basically historical fiction. Then I would come back after a weekend and lose the thread. Not because the model did something wrong. It did exactly what I asked at each moment. The issue was continuity. Nobody was holding the bar across moments.

That loop repeated across multiple projects, including the first months of building OrKa largely solo. I learned something obvious in hindsight. The problem was not output quality. The problem was the absence of a development system that keeps output coherent over time.

That is when I stopped chasing better prompts and started building better constraints.

Out of that shift, I ended up with a working methodology. People have been calling it Specs Driven Development, SDD. I do not care much about the name. I care about the behavior it enforces. The constraints do not live in prompts. They live in the architecture around prompts. The AI becomes useful at scale because the process becomes reliable at scale.

The prompt delusion

Prompts are ephemeral. Codebases are permanent.

You can craft a beautiful system prompt. You can say “follow the plan” and “do not add features” and “write tests” and “document decisions”. It will comply. Then context changes. A new chat starts. You switch tools. You paste fewer files. You forget to include one assumption. The model drifts. Not maliciously. Just naturally. Because prompts are not governance. They are conversation.

I call this the prompt delusion. It is the belief that the right wording can produce consistent behavior across time, across sessions, across different tasks, and across different tools.

Humans solved this problem for humans with process and gates. We use linters. We use CI. We use review. We use typed interfaces and invariants. We do not rely on people remembering a paragraph from a handbook.

So I stopped trying to discipline the model with paragraphs. I started to discipline the workflow with structure.

The key idea is simple. Constraints that live in prompts are suggestions. Constraints that live in systems are guarantees.

A lint rule does not drift. A CI gate does not “feel” like doing something else. A review checklist does not forget what you agreed last Tuesday. If you want AI output to stay aligned, you need the same kind of enforcement. You need a development system that makes the correct path the easiest path, and makes the wrong path expensive.

The real 80/20 split

I still work roughly 80/20. About 80 percent of the code that lands in my repo is AI generated in some form. About 20 percent is the part that only I can own.

But the critical nuance is that the 20 percent is not “some code and some tests.” It is not evenly spread. It is concentrated in a few responsibilities that define the quality of the whole.

The human part is architecture decisions. It is domain and business logic validation. It is edge case reasoning when the system meets reality. It is plan approval. It is saying “this is the bar” and keeping it there.

The AI part is scaffolding, boilerplate, repetition, test writing, glue code, refactors that follow explicit constraints, documentation drafts, and implementation of well specified changes.

If you let the AI own the bar, you get speed and drift. If you keep the bar human, and make the AI operate inside a strict process, you get speed and consistency.

That is the stance that shaped everything that follows. AI is not the decision maker. AI is an assistant that plans with you, executes inside scope, and reviews before you ship. You remain accountable. You remain the one holding the bar.

The breakthrough was not “ask for a solution”

Most people use a planner model as a solution vending machine.

They say “design me the architecture” or “give me the best approach” and they accept it because it sounds coherent. That is exactly how vibe architecture happens. The model is skilled at producing plausible plans. It is not responsible for the long term maintenance of your repo. You are.

The shift that fixed my outcomes was this.

I stopped asking the planner for the solution. I started using the planner as a debate partner while I proposed my solution.

That changes the power dynamic. The planning phase becomes a structured argument about trade offs. The plan becomes a negotiated artifact. The human remains the owner of the direction. The model becomes the adversarial collaborator that tries to break your assumptions.

So I now enter planning with a draft approach in my head. Not a fully detailed design. But a real proposal. I state it clearly. Then I ask the planner to attack it. I ask it to propose alternatives. I ask it to enumerate costs I will pay later. I ask it to tell me what I will regret in six months.

Then we iterate until the plan is something I can sign with my name.

This is the part I want to highlight because it is the core of why the method works. You do not outsource judgment. You formalize judgment. The AI assists. The human decides.

The three roles that made it stable

A single AI assistant that plans, codes, and reviews is a liability. It is like letting one person design the system, implement the system, and approve the system. You get blind spots. You get rationalization. You get self confirmation.

What worked for me was splitting the workflow into three roles with hard constraints. Planner. Executor. Reviewer.

The important part is not the labels. The important part is that each role has restricted powers and a strict handoff protocol.

The planner reads and thinks and writes plans. The planner does not write code. Not because you asked nicely. Because it cannot. Tool permissions are restricted.

The executor implements. The executor does not invent new scope. The executor is forced to read the approved plan, list touched files, and execute step by step. If reality requires deviation, the executor stops and escalates. The human decides whether to update the plan or to abort.

The reviewer reviews. The reviewer does not “rubber stamp.” It is forced to ask questions first. What was the goal. What constraints were in place. How was it tested. What is the rollback. Then it reviews against those answers.

This separation is not a fancy trick. It is the same principle we use in engineering organizations because it works. It reduces drift. It forces explicit decisions. It keeps a record.

And crucially, it keeps me in the loop where it matters. I do not need to be the typist. I need to be the governor.

The client planning method

Planning works best when you treat it like a client entering a shop with a need, not a solution.

Bad planning starts with premature commitment. “Build me a scraper with browser automation.” You have already picked tooling and complexity before you validated the problem framing.

Good planning starts with intent. “I need structured data for this downstream use. The scope is X. The constraints are Y. The risks are Z.”

Then you debate solutions. You ask why. You cut complexity. You choose what to postpone. You decide what not to build.

This is where I now bring my own proposed approach early.

I will say something like this. I think we can implement a direct HTTP export instead of browser automation. I think we can store the raw payload and defer normalization. I think we can keep one canonical schema and derive views later. I think we should avoid introducing a new dependency unless we can justify it.

Then the planner attacks. It will say what breaks if you defer normalization. It will say what you lose if you store raw blobs. It will point out hidden coupling. It will propose a more robust approach. It will also point out when my instinct is over engineering.

This is not “AI gives me a plan.” This is “I bring a plan and we stress test it.”

One real example locked this in for me.

I was about to implement a data extraction pipeline. The initial AI proposal was browser automation. Headless browser, navigate pages, click export, download per page, retry logic, throttling, session persistence. It was well designed and also absurdly heavy.

I asked one question. Is there a direct export endpoint.

There was. One request. One download. No browser. No per page logic. No category of failure modes that come with automation.

That discovery did not happen because the model is dumb. It happened because planning without a human hypothesis tends to follow the first plausible path. When you present your own approach and force argument, you surface simpler solutions faster.

So the rule became clear. Brainstorming is loose and creative. Execution is strict and disciplined. You iterate freely until you are confident. Then you lock it down.

The .ai folder is the memory that actually works

Prompts vanish. Chats disappear into history. Context windows compress. Tooling changes. You need persistent memory that you can diff, review, and ship with the repo.

So every plan, every changelog, and every decision note lives in a .ai/ folder at the root of the service being worked on.

This solves multiple problems at once.

It makes the reasoning traceable. Not in an abstract way. In a concrete way where you can answer “why did we do it like this” with a file path.

It makes onboarding real. A new teammate can read the plans and changelogs and see what the system was supposed to be, what it became, and which trade offs were accepted.

It makes recovery faster. When something breaks, you can inspect the delta between sessions. Not just the git diff, but the intent behind the diff.

It improves the next planning session because the planner can read the past. It stops re proposing already rejected choices. It stops re discovering old constraints. It becomes less repetitive and more useful.

If you build agent systems, you will recognize the pattern. This is persistent memory, but in a human readable format. No embeddings. No magical vector store. Just version controlled text that creates institutional memory.

The changelog mandate

The single most valuable practice in this method is the mandatory changelog after each execution session.

Not optional. Not “if you have time.” Mandatory.

Because the changelog is the bridge between plan and reality. Plans are aspirational. Changelogs are factual. The difference between them is where learning lives.

A proper changelog captures what was done, what files changed, what decisions were made during implementation, how it was tested, what remains, and what risks were discovered.

The most important part is decisions. Not every decision belongs in the original plan. Reality introduces surprises. You will discover an input you did not anticipate. You will find a dependency conflict. You will learn the data is messier than expected. The executor will make micro decisions. Without a changelog, those decisions evaporate. Later, you will argue about them again. Or worse, you will reverse them without remembering why they existed.

With changelogs, the project stays coherent across weeks. That is what stopped me from losing the thread in solo work. It is also what let AI generated work become safe. Because I had a written record that I could review like an engineer, not like a chat participant.

System prompts as version controlled standards

In this workflow, the repo has a single source of truth for behavioral constraints. A system prompt file at the root.

Think of it as the equivalent of lint and format config, but for AI interaction.

It contains non negotiable architecture constraints, naming conventions, testing requirements, patterns to follow, anti patterns to avoid, and examples of correct usage in this codebase.

The key point is that it is version controlled. It changes via PR. When standards evolve, you do not rely on people remembering a new convention. The tooling loads the file. The AI sees it. The behavior becomes consistent.

This is not about writing a perfect prompt. It is about writing a living standard that evolves with the codebase.

The plan lifecycle

Plans have states. Draft. In review. Approved. Implemented.

Draft is where debate happens. This is where I push my solution. This is where the planner attacks it. This is where we document trade offs. This is where we choose long term costs consciously, instead of paying them accidentally.

Approved is the gate. Once approved, execution is not creative anymore. It is disciplined. The executor follows the plan. If something is missing, the executor escalates. Either we update the plan, or we stop.

Implemented is not just “code merged.” It is plan satisfied. It is also “what changed from the plan and why” captured in changelogs.

This lifecycle is what stops drift. The plan is not a vague Jira ticket. It is a contract.

Long term planning without illusion

Here is the tension. You want long term planning. You also want to avoid pretending you can foresee everything.

The way I handle it is to make trade offs explicit, and to separate what must be stable from what can be flexible.

Stable things include public interfaces, data models, invariants, naming systems, dependency boundaries, and failure behavior. If those are wrong, the system rots fast.

Flexible things include internal module structure, some implementation strategies, and performance tuning. Those can iterate.

The planner is useful here, but only if you treat it like a critic. If you let it author the plan alone, it will often over specify. It will propose infrastructure that is impressive and expensive. It will try to be robust everywhere. That is a trap.

When I bring my own approach, I can force a different conversation. I can say I want the minimal stable core now, and extension points later. I can say I want to defer optimization until measurements exist. I can say I want fewer dependencies to reduce future maintenance. Then the planner helps me evaluate the cost of those choices. It does not override them.

This is where I keep the bar human. I decide what “good enough” means for this iteration, and what “must not break” means for the system.

A day in the life

A real session looks like this.

I start with planning. I state the problem. I state my proposed solution. I state constraints. Then I ask the planner to critique and to propose alternatives. We go back and forth until the plan reads like something I would sign.

Then I approve the plan. I switch to execution. The executor reads the approved plan, enumerates touched files, and implements step by step. When reality deviates, it stops. I decide. If needed, we update the plan and continue.

Then we review. The reviewer asks questions first. It checks testing. It checks interface consistency. It checks whether the changes match the plan and the repo standards. It returns actionable feedback.

Then a changelog is written. Then I merge.

The result is that AI contributes heavily to throughput, but it does not own direction. The system stays coherent. The record stays durable. Future me suffers less.

When not to use it

This process has overhead. It is not for typos. It is not for trivial one line fixes. It is not for a quick experiment you might throw away.

But if the work touches multiple files, introduces new concepts, changes data flow, or will need explanation later, the overhead pays back fast.

The heuristic I use is simple. If I would sketch it on a whiteboard before coding, it deserves a plan. If I would just open the file and type, it does not.

Cognitive infrastructure beats prompt engineering

This methodology is the same philosophy I apply when building agent systems.

You do not treat the model as an oracle. You treat it as a component inside a process you can inspect and reproduce.

In agent graphs, relying on a single model decision produces random walks in complex spaces. The fix is structure, scoring, constraints, and observability.

In development, relying on a single prompt produces random walk codebases. The fix is plans, gates, changelogs, and role separation.

In both cases, the win is infrastructure, not magic words.

Getting started without turning it into theater

You can adopt this gradually.

Start by writing one version controlled standards file. Keep it short and specific to your repo.

Then add the .ai/ folder and write one plan for one non trivial change.

Then require a changelog after the session.

Then split roles if your tooling supports it. Remove code writing capability from the planner. Make the executor stop when scope changes. Make the reviewer ask questions first.

The biggest change is not technical. It is psychological.

Stop asking AI to deliver the solution. Bring your solution. Use AI to test it, improve it, and implement it inside constraints. Keep the bar human.

If you do that, the AI becomes what it should have been from the start. A force multiplier that does not erode your architecture.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

snakension – a browser extension for playing snake when you’re offline

Next Post

A Comprehensive Guide to Learning Full Stack Web Development in 2026

Related Posts