Code Review When Half the Diffs Are From Agents

Code review was invented for a world in which a human wrote each diff, slowly, and another human read it, also slowly. The ratio held. Reviewers could plausibly read every line of every change without becoming the bottleneck. The practice scaled because both sides scaled at roughly the same rate.

Agents break the ratio. The author writes faster than the reviewer can read. If the reviewer’s practice does not change, one of two things happens: the reviewer becomes the bottleneck and the agent’s throughput is wasted, or the reviewer keeps the cadence and starts rubber-stamping. Most teams quietly choose option two and pretend they chose option one.

This is fixable. But it requires admitting that the review practice itself has to change.

What code review is actually for

Strip code review down and it serves a small number of real purposes: catching bugs the author missed, verifying the change does what the PR description claims, sharing knowledge across the team, enforcing standards the tools cannot, and holding contributors to a baseline of care.

When the contributor was a human, all of these blurred together because the same act (reading the diff line by line) addressed most of them. The reviewer would catch a typo, notice the missing edge case, learn how the new module worked, and check that the naming was consistent, all in the same pass.

This worked because human authorship is slow and uneven. The cost of writing the next line was high enough that the author was thinking, which meant the diff carried signal about intent on every line. Reading every line was reading every decision.

Agents do not author that way. They produce a lot of correct-looking code very fast, and the cost of any individual line is low. Reading every line of an agent-authored PR is not reading every decision; it is reading a lot of plausible boilerplate looking for the few places where a decision actually happened.

The shift in review practice has to follow the shift in authorship.

Higher-level questions

The questions that matter on an agent-authored PR are not “does line 47 do the right thing?” They are:

Did the change touch the files it should have touched, and only those? Agents often produce edits in surprising places: files they did not need to modify, configs they decided to “while I’m here” update. The diff scope itself is signal.

Does the test prove the thing it claims to prove? Agents are great at producing tests that pass. They are less great at producing tests that would fail if the code were wrong. A test that asserts the function returned, without asserting what it returned, is a test that adds confidence without adding evidence.

Does the change follow the patterns the rest of the codebase uses? Or does it introduce a new pattern, in a way that suggests the agent did not see, or did not follow, the existing one? New patterns introduced without discussion are the source of most agent-induced drift.

Is the abstraction the change introduces actually needed? Agents have a strong bias toward extracting helpers, adding flags, and parameterizing things. Most of those abstractions are pre-emptive and wrong. The reviewer’s job is to push back on speculative generality.

Does the PR description match the diff? A surprisingly cheap check that catches a real category of problem: the agent solved a different problem than the one stated, and the description was written to match the solution rather than the original intent.

None of these questions require reading every line. All of them require understanding the shape of the change. The shift is from line-by-line to shape-first.

When to read every line anyway

Some changes still warrant the old practice. The categories that earn line-level review:

Changes to security-sensitive code. Auth, permissions, anything touching user data, anything that talks to the outside world. The cost of a missed bug is too high to leave to shape-first review.

Changes to code the team is unfamiliar with. If the diff is in a module nobody on the team has touched in a year, the act of reading it line by line is the knowledge-sharing. Skipping it loses the only chance to update the team’s mental model.

Changes that the harness sensors flag with anything other than green. Linter warnings, test flakes, type errors in adjacent files, and so on. All of these are reasons to slow down and read.

Changes that look too simple. If a one-line diff makes a hard problem disappear, the most likely explanation is that the problem is still there and the diff just stopped reporting it.

For everything else, shape-first is the rational allocation of attention. Pretending you can read line by line at agent throughput is the way you end up rubber-stamping the cases that matter.

Don’t let the agent review its own work

A particular anti-pattern worth naming: using an agent to review an agent’s PR. This is appealing (the throughput finally matches) and it produces almost zero signal. The reviewing agent has the same priors as the authoring agent, the same blind spots, and the same training. It will find typos and miss the architectural mistake. It will praise the test coverage and not notice that the tests do not actually test the thing.

Agent-assisted review is fine. The agent can summarize the diff, flag suspicious patterns, run extra checks, draft a first-pass comment. But the final review must come from a human who reads the shape of the change and asks the questions above. Outsourcing that to another model is outsourcing the only step that was load-bearing.

What the team has to give up

The hardest part of changing review practice is admitting that the old practice (read every line, comment on what you see) was always partly theater. Reviewers caught some bugs, missed many more, and got most of their value from forcing the author to think harder before posting. Agents are not embarrassed by review. They will happily produce another version. The forcing function that human-on-human review provided is gone.

What replaces it is a review practice that does not pretend. Read the shape. Ask the architectural questions. Trust the sensors for the line-level stuff. Slow down for the categories that warrant it. Stop measuring reviewer effectiveness in comments per PR; start measuring it in problems caught before merge.

The point of review is not to read every line. It never was. With agents, that finally becomes obvious.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Scaling safe enterprise AI with OpenAI governance frameworks

Related Posts