Software

6 minute read

Long-running agents need more than memory

May 18, 2026

Anthropic’s managed-agent harness solves one hard problem: continuity. Progress logs, feature lists, git checkpoints, and startup scripts give each new session a map of what happened. But continuity is not governance. As agents work across more sessions, the question changes from “did the agent remember?” to “did the agent stay within its architectural constraints?”

In May 2026, Anthropic published a detailed look at how their internal engineering teams use Claude Code as a long-running managed agent. The infrastructure pattern they describe is worth reading carefully: initializer agents that prepare the workspace, feature lists that define remaining work, progress files that record what happened, git commits that preserve recoverable state, startup checks that orient each new session, and end-to-end tests that stop agents from declaring victory prematurely.

This is not prompt engineering. It is operational infrastructure for agents working across many sessions on the same codebase. The problems it solves are real, and the solutions are well-reasoned.

But the pattern solves continuity. It does not solve governance. Those are two different problems, and conflating them is the most expensive mistake a team can make when designing long-running agent workflows.

Agents as shift workers

The framing that makes the managed-agent pattern click is the relay team metaphor. A long-running agent workflow looks less like one developer with a prompt and more like a team of engineers handing work across shifts.

Each shift worker arrives, reads the handoff notes, picks up where the last person stopped, makes progress, and leaves a record for the next person. The work continues across interruptions. The codebase evolves across sessions. No single session owns the full context.

That framing makes the continuity infrastructure obvious. You need handoff notes that are authoritative (progress files), a work queue that persists across shifts (feature lists), recoverable state at every checkpoint (git commits), orientation scripts so each shift starts correctly (startup checks), and pass/fail criteria that the work must satisfy (E2E tests).

Anthropic’s harness provides all five. What it does not provide is the architectural contract that defines what kind of work each shift is allowed to do.

In a real engineering team, that contract exists in ADRs, architecture review boards, code review standards, and the accumulated institutional knowledge of senior engineers. In a long-running agent loop, none of that is automatically present. The harness tells the agent what happened. It does not tell the agent what must remain true.

What the harness gets right

Before addressing the gap, it is worth being precise about what the harness actually solves:

Initializer agent — prepares the workspace before the main agent session begins.
Feature list — a durable queue of remaining work, written as discrete, completable items.
Progress file — a running record of what each session changed, decided, and left incomplete.
Git commits as checkpoints — every meaningful unit of work lands as a recoverable commit.
E2E tests as the victory condition — agents cannot declare a feature complete until the tests pass.

The pattern is good engineering. Each piece of infrastructure corresponds to a real failure mode that long-running agents encounter in practice.

The remaining gap: continuity is not governance

A progress file can tell the next agent: “Here is what I changed.”

It cannot reliably tell the agent: “This architecture boundary must not be crossed. This dependency is forbidden. This ADR supersedes that older decision. This pattern is allowed only in this scope.”

That distinction matters in practice because the questions a progress file answers and the questions a governance layer answers are different in kind, not just degree:

Layer	Question it answers
Progress log	What happened?
Feature list	What remains?
Git history	What changed?
Test harness	Does it work?
Governance layer	Is this allowed?

The first four layers are all answered by the managed-agent harness. The fifth is not. A test suite can verify that the output is functionally correct. It cannot verify that the output is architecturally compliant. Those are different properties, and a codebase can be full of passing tests while being full of architectural violations.

Agent harnesses preserve continuity. Governance preserves intent.

Why this gets harder as agents run longer

Over many sessions, a long-running agent loop may:

Infer outdated patterns from old code. If earlier sessions used a deprecated pattern, the new session infers that pattern is correct and continues it.
Reintroduce forbidden dependencies. A dependency was removed for a documented architectural reason. A later session adds it back because it solves the immediate problem and the prohibition is not in any artifact the agent reads.
Bypass undocumented conventions. Architecture that exists in institutional memory but not in enforceable documents is invisible to the agent.
Optimize locally while violating system-level constraints. Each session makes a locally reasonable change. The cumulative effect crosses an architectural boundary that no single session was responsible for maintaining.

None of these failures show up in a progress file. None of them cause a test suite to fail. They accumulate silently across sessions and become visible only when the codebase is far enough from its architectural intent that the cost of correction is high.

The role of governance

Governance sits beside the harness. It does not replace progress logs, tests, or git. It gives the agent a deterministic way to check architectural compatibility at each session boundary and at each commit boundary.

The managed-agent startup sequence, extended with governance:

pwd
git log --oneline -20
cat claude-progress.txt
cat feature_list.json
mneme check --mode warn

Before commit or PR:

mneme check --mode strict

In CI, on every push:

mneme check --mode strict --ci

The framing is important: the harness tells the agent where it is. Governance tells it what boundaries it must respect. Both are necessary. Neither substitutes for the other.

ADRs as durable intent, not documentation

The governance layer requires a source of architectural authority. In well-run engineering teams, that source is the ADR corpus: Architecture Decision Records that capture not just what was decided, but why, what alternatives were rejected, and what constraints the decision implies.

For most teams, ADRs sit in /docs/adr and are read only when someone thinks to look. They are documentation, not enforcement. A long-running agent will not read them at session start. A commit hook will not check against them.

A governance layer changes this. Rather than reading the ADR folder as a documentation corpus, it compiles the ADR corpus into a decision graph with declared properties:

Which decisions are active, superseded, or deprecated?
Which decision applies to which file, service, or scope?
Which decision is newer and overrides an older one?
Which dependencies or patterns does each decision forbid or require?
When two decisions conflict on the same scope, which one wins?

A long-running agent operating under that system can answer: which decision applies to this change, and am I compliant with it? That is a different question from what did the progress file say? and it requires a different infrastructure to answer.

Where governance checkpoints belong

Governance is not a single check at a single moment. The right enforcement points correspond to the moments of highest leverage:

Session start (warn mode) — before any code is written, load constraints and surface existing violations without blocking work.
Pre-tool execution — block actions that are obviously forbidden before they happen.
Pre-commit (strict mode) — the primary enforcement gate, catching architectural drift before it becomes branch history.
Pre-PR — produces an explainable report of which rules applied, which passed, which failed, and why.
CI — the backstop that enforces team-level architectural contracts on every push.

The harness ensures the agent knows where it is. Governance ensures the agent knows where it must not go. Both are infrastructure. Neither is a nice-to-have for long-running loops.

Conclusion: memory is not enough

Anthropic’s managed-agent harness is well-designed infrastructure for a real problem. Teams building on Claude Code or similar agent systems should study and adopt this pattern.

But a progress file is descriptive, not prescriptive. It records what happened. It does not enforce what must remain true. And as agent loops grow longer, the gap between those two things grows wider.

The next phase of agent infrastructure needs a governance layer — one that resolves competing ADRs deterministically, produces explainable audit traces, and enforces architectural contracts at the boundaries where agents make consequential changes.

Long-running agents need memory to continue work. They need governance to continue work safely. The next generation of agent infrastructure will not just preserve context. It will preserve intent.

That is the layer Mneme is built for.

Originally published at https://mnemehq.com/insights/long-running-agents-need-governance/

Spiral Model In Software Engineering: Pros & Cons

May 18, 2026

Planning

Integrated Project Delivery In Construction: Pros & Cons

May 18, 2026

enhance-your-security-with-proxy-hostname

3 min

Software

Enhance Your Security with Proxy Hostname

When you’re facing regional content blocks or slow internet speeds, proxy hostnames can offer a solution. This simple…

Alex Walton

January 15, 2025

acessando-containers-do-amazon-ecs-fargate-pelo-aws-cli

5 min

Software

Acessando Containers do Amazon ECS Fargate pelo AWS Cli

A utilização de containers, especialmente no ambiente da nuvem, revolucionou a maneira como desenvolvemos, implantamos e escalamos aplicações.…

Timothy Harfield

February 16, 2024

1 min

Software

ArkUI-X平台差异化

跨平台使用场景是一套ArkTS代码运行在多个终端设备上，如Android、iOS、OpenHarmony（含基于OpenHarmony发行的商业版，如HarmonyOS Next）。当不同平台业务逻辑不同，或使用了不支持跨平台的API，就需要根据平台不同进行一定代码差异化适配。当前仅支持在代码运行态进行差异化，接下来详细介绍场景及如何差异化适配。使用场景平台差异化适用于以下两种典型场景： 1.自身业务逻辑不同平台本来就有差异； 2.在OpenHarmony上调用了不支持跨平台的API，这就需要在OpenHarmony上仍然调用对应API，其他平台通过Bridge桥接机制进行差异化处理；判断平台类型可以通过let osName: string = deviceInfo.osFullName;获取对应OS名字，该接口已支持跨平台，不同平台上其返回值如下: OpenHarmony上，osName等于OpenHarmony-XXX Android上，osName等于Android XXX iOS上，osName等于iOS XXX 示例如下:…

Sarah Hoffman

June 25, 2025

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Digital Metrology TraceBossPro Surface Gauge Software

The AI writes the tests. It doesn’t get to grade them.

The Hidden Cost of Moving Every ADF Pipeline in a Microsoft Fabric Migration

Trending Tags

Long-running agents need more than memory

Agents as shift workers

What the harness gets right

The remaining gap: continuity is not governance

Why this gets harder as agents run longer

The role of governance

ADRs as durable intent, not documentation

Where governance checkpoints belong

Conclusion: memory is not enough

Leave a Reply Cancel reply

Previous Post

Spiral Model In Software Engineering: Pros & Cons

Next Post

Integrated Project Delivery In Construction: Pros & Cons

Long-running agents need more than memory

Agents as shift workers

What the harness gets right

The remaining gap: continuity is not governance

Why this gets harder as agents run longer

The role of governance

ADRs as durable intent, not documentation

Where governance checkpoints belong

Conclusion: memory is not enough

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts