Tech Leadership

Stop Measuring Fast. Start Measuring Better

AI is moving PR throughput. But throughput isn't the whole story. When the bottleneck moves, the pressure shows up somewhere else. Measure better, not faster.

Photo by Jon Tyson / Unsplash

Pull requests are quality control

Engineering teams often treat pull requests (a popular mechanism for peer review) as a quality-control mechanism. It's where senior judgment gets applied to code before it reaches production; where context gets checked, trade-offs get surfaced, risk gets noticed, and weaker work gets caught before it escapes.

When the pull request queue becomes a bottleneck, leadership takes notice because it introduces delays into the delivery process.

This bottleneck exists because judgment is scarce. Senior engineers are expensive, and pull requests are one of the places where that scarcity becomes visible.

AI appears to solve the throughput problem

AI is a natural fit for the work around judgment: first-pass review, summarising the diff, pattern matching, test suggestions, boilerplate, and cleanup. So it is not surprising that teams start seeing throughput move.

Liz Fong-Jones describes Honeycomb's peak weekday merges moving from roughly 30 to roughly 74 a day, with important caveats: it is a peak weekday figure, it is entangled with other organisational changes, and it does not cleanly separate net new capacity from substitution. That is exactly why it is useful here. It is a concrete example of local throughput changing fast without pretending the causal story is simple.

From an engineering point of view, this looks like good news. The PR queue gets shorter. More work gets merged. The bottleneck appears to ease.

But it does not disappear. It moves.

The system is not getting worse. It is destabilising

If escape rate stays roughly the same while throughput rises, that does not mean nothing has changed. It means the system is now processing more change at the same rate of defects getting through.

More change reaches production. Change is the leading driver of incidents, so more change means more incidents in absolute terms. More people have to pick up the slack.

The quality percentage can look stable while the human cost rises.

That is not the same as saying AI-assisted PRs are reducing quality. Teams may well be maintaining quality at the point of review. The issue is that the system as a whole is now carrying more production change, and the rest of the chain has to absorb it.

Liz Fong-Jones makes a related point: defects don't have to stay escaped for long. With strong observability, the time between a defect reaching production and the team catching it shrinks. That doesn't change the escape rate, but it changes the cost of an escape. It is one of the ways a system can absorb more change without the human cost rising in lockstep.

Operational load is one example of where pressure shows up. Incident response is another. The pressure does not vanish. It shows up somewhere else — and whether the system can absorb it depends on what else is in place.

Operational load is one example. Incident response is another. The pressure does not vanish. It shows up somewhere else.

This is not a story about a bad system becoming worse. It's about a stable system, bottlenecks and all, being destabilised by a new source of local acceleration and then trying to stabilise again.

What the dashboard misses

When PR throughput rises, the obvious reading is that engineering productivity is improving. The more important question is whether the rest of the system is absorbing that throughput cleanly.

If the dashboard shows PRs climbing, here is what may also be true:

Escape rate stays flat
More incidents happen in absolute terms
Operational load rises
Rework climbs as downstream problems surface
Senior engineer attention shifts from review to downstream recovery

None of this means the AI-assisted PRs are bad. It means throughput is no longer the whole story.

That is the trap. Throughput can look healthier while the wider system is carrying more strain.

Better matters more than faster

This is why I think the leadership move is better, not faster.

Faster will come anyway. That is what AI does. It reduces the cost of producing change whether we ask for speed or not.

The more interesting question is whether it helps people do better.

In the PR example, the focus should not be on how to push even more work through the queue. It should be on how to make the PR itself better: clearer, better tested, easier to review, more explicit about risk, and more likely to hold up downstream.

If we get better, faster will come with it.

If we chase faster first, we may simply destabilise the system harder.

Capability uplift is what better looks like

Most people conceptually agree with building quality in. Split stories well. Make quality shared. Expect software engineers to own at least some of the test automation.

The blocker is bandwidth.

Systems tend to defend their current equilibrium. Not because it is good, but because it is familiar. The status quo is usually the safer bet, even when it is visibly flawed. Under pressure, that tendency gets stronger. So when we ask engineers to build more quality in, the system does not hear "better." It hears "more."

Leadership rarely makes room for that shift. Slow down now so we can speed up later is not a persuasive line in a system already being pushed to ship more. Leadership are reluctant to adopt the concept that removing waste from a system speeds up work because the cost of adoption feels too high.

And even if the time existed, testing well is hard. It takes technical skill. It takes product understanding. It takes judgment. It takes holding several mental models in your head at once while still writing code. We are asking a lot. The fair question is whether we are giving people what they need to do it well.

My hypothesis is that AI makes "doing better" easier.

If agents can provide skills, guardrails, heuristics, and examples at the point of work, then people can produce better PRs, think more clearly about risk, and test more effectively without needing thirty years of quality expertise first. That is capability uplift. Not asking more from a stressed system, but lowering the cost of better practice inside it.

Take a staff engineer, principal, tech lead, or quality coach. Someone whose judgment is genuinely load-bearing. Normally, they can only support a small number of teams before context-switching strips the work of depth. AI cuts the cost of context pickup, so their experience can spread further at the same depth.

That is the real productivity gain. Senior judgment spreads further.

What to measure instead

If the goal is better, not faster, then these are the questions senior leadership could be asking.

Not: how many PRs did we merge this sprint? But: are PRs getting better, or just moving faster?

Not: did throughput go up? But: what happened to rework and operational load as throughput went up?

Not: is escape rate stable? But: what is the absolute incident load now that more change is landing in production?

Not: how fast is review turnaround? But: is good judgment being applied in review, or are we mostly accelerating work through the system?

Not: how many engineers do we need? But: are our humans feeling less burnt out, not more?

Not: how many teams can this senior person touch? But: can they support more teams at the same depth because context pickup is easier?

These are the measures that tell us whether the system is getting better, not just busier.

Better before faster

My hypothesis is that we get the productivity and quality gains when we focus our agents on helping people do better, not merely faster.

Faster will come naturally. Better will not. Better requires intent.

If we use agents to make PRs better, make testing easier, make context pickup faster, and spread senior expertise further through the system, then the system has a chance to restabilise at a higher level of capability.

If we use AI mainly to increase throughput, we should not be surprised when the pressure shows up further down the chain.

That is the shift in measurement.

Stop measuring fast.

Start measuring better.

Further reading:

2025 DORA Report: State of AI-Assisted Software Development — Google Cloud / DORA
Faros AI: Rework Rate as the 5th DORA Metric — for downstream cost patterns
Liz Fong-Jones' talk at Sydney Tech Leaders, May 2026 — for the 30-to-74 peak weekday merge example and its caveats

Disclaimer: No animals were tested in the writing of this article. Tokens? That's a different matter.

From last week

Product owners ask questions. Quality professionals provide information on those questions. Somewhere along the way, we split the work into roles.

It made sense for managing labour. It makes no sense for delivering customer value.

The Split We Agreed to Pay For →

Free with sign-up

Got a decision to make, a conversation to prep, or a move to work out?

Drop in the real situation. KYM reads it through the Handbook's frameworks and gives you a move you can try.

Open Know Your Move →

Stop Measuring Fast. Start Measuring Better

Pull requests are quality control

AI appears to solve the throughput problem

The system is not getting worse. It is destabilising

What the dashboard misses

Better matters more than faster

Capability uplift is what better looks like

What to measure instead

Better before faster

Got a decision to make, a conversation to prep, or a move to work out?

Comments ()

Read next

Everyone owns quality. Nobody knows what that means

Contract Testing Isn't the Hard Part

My Genie Got It Wrong: Evaluating LLMs for a RAG Chatbot

Pull requests are quality control

AI appears to solve the throughput problem

The system is not getting worse. It is destabilising

What the dashboard misses

Better matters more than faster

Capability uplift is what better looks like

What to measure instead

Better before faster

Got a decision to make, a conversation to prep, or a move to work out?

Comments ( )

Read next

Comments ()