Discipline Doesn't Scale. Verification Needs Infrastructure.

Last week, a non-technical person in my circle built a working tool using AI. Not a prototype or a demo, something that did what it was supposed to do, with passing tests and a clean interface.

I took a look and my first reaction was "this is surprisingly good." The second was "I have no idea if this should go to production." Because "it works" and "it's correct" are not the same thing. And the distance between the two is exactly where the problems live.

So I did something different from what I would have done a year ago. Instead of reading the code line by line looking for whatever "felt off" (which is how we've verified things our entire careers), I wrote a verification contract before reviewing anything. Three questions per dimension: does it do what it should? Is it well made? Does it fit in the real system?

What caught me off guard was what I would have missed without the contract. On the functional dimension everything was green, the AI had done a clean job. On craft, two design decisions were inconsistent with the project's conventions. Nothing serious, but the kind of thing that accumulates. And on contextual (the one that always worries me most) I discovered a dependency that under real load would behave differently from what the tests showed. Without the contract, I would have reviewed with my intuition, probably would have caught the craft issue, and almost certainly would have missed the contextual one.

Intuition doesn't scale. Discipline doesn't scale. It never has.

The pattern that keeps repeating

We know we should document decisions. We don't because the sprint closes on Friday. We know we should do pair programming. We don't because it's faster to do it alone. We know code reviews should verify concrete criteria. In practice, we look at the diff, see the tests pass, and hit LGTM.

Individual discipline as a quality system is a fragile design. It depends on someone remembering, having time, having context, not being tired. In a world where generation moved at human pace, it was enough. In a world where AI generates 10x faster, it doesn't cut it.

You know what has scaled? Tests. Not because developers are more disciplined than before, but because tests became infrastructure. CI/CD doesn't scale because people remember to deploy correctly, it scales because the process doesn't depend on anyone remembering. Linters don't scale through discipline, they scale because they're embedded in the flow.

Weekly Newsletter

Enjoying what you read?

Join other engineers who receive reflections on career, leadership, and technology every week.

This newsletter is written in Spanish.

In my previous article I argued that verification is the new core work of engineering and described three dimensions: functional (does it do what it should?), craft (is it well made?) and contextual (does it fit in the real system?). What I didn't say is how to scale that beyond individual willpower.

Now I believe the answer is the same as with tests: verification needs to become infrastructure.

What I'm testing

So I did what I usually do when something irritates me enough: I built it. Or rather, I started building it.

It's called Plumbline and for now it's an experiment, not a tool. What it attempts to do is give structure to the process I described at the beginning: generate concrete criteria before building, and verify against them afterward. Two phases, no database, no server, no configuration. All state is Markdown.

The idea behind the first phase is to produce a verification contract. Plumbline analyzes your project (documentation, code, conventions) and returns criteria classified across the three dimensions I've been describing for several articles now. Each criterion carries a tag: [auto] if an agent can execute it, [manual] if it requires human judgment.

A real contract looks like this:

Functional: "POST /auth/login returns 200 with valid credentials" [auto]. "Rate limiting activates after 5 attempts" [auto].

Craft: "Auth logic lives in the service layer, not in the route handler" [auto]. "Naming follows codebase conventions" [manual].

Contextual: "All existing tests still pass" [auto]. "Latency impact of the auth middleware on the hot path has been evaluated" [manual].

In the second phase, Plumbline executes the automated checks, guides you through the manual ones, and produces a report with evidence for each criterion.

I don't know if this is the right path. Contract generation is the most fragile part of the system: if the project analysis is shallow, the contract is shallow, and then you haven't scaled verification, you've just scaled the appearance of verification. That would be exactly the opposite of what I'm after.

But I've been using it in real reviews for the past few weeks, and the most interesting signal hasn't been the one I expected.

Why code review isn't enough

Code review, as we practice it, wasn't designed for the world that's coming.

Code review assumes the reviewer understands the code being reviewed. Assumes the reviewer has context about the system. Assumes the reviewer has time to review in depth. And assumes the code was written by a human mind with recognizable intentions.

None of those assumptions hold when AI generates the code. The reviewer doesn't always understand what the AI generated. System context lives in the heads of three seniors who aren't always available. Review time hasn't scaled while the volume of generated code has multiplied. And the "intentions" behind AI-generated code are statistical patterns, not design decisions.

I'm not saying code review dies. I'm saying it needs to evolve from "I review what I see" to "I verify against agreed criteria." From subjective opinion to verifiable contract. From a practice that depends on the reviewer's discipline to infrastructure that works even when the reviewer is tired, in a rush, or unfamiliar with that corner of the system.

That's what a verification contract does: it separates "what to verify" from "who verifies." You define the criteria when you have time and context. You execute the verification later, against those criteria, with discipline embedded in the process and not in the person.

What I didn't expect

Back to the story from the beginning. The contract made me a better verifier. I didn't expect that.

Without the contract, I would have reviewed with my intuition. I would have looked at what my experience tells me to look at. And I would have left unreviewed what my experience doesn't cover, which is more than I like to admit.

With the contract, I verified all three dimensions systematically. The functional one was fast and automated. Craft forced me to compare against conventions I've internalized but never written down. And contextual made me think about the complete system before approving a piece.

The invisible heuristics of seniors are valuable. But relying exclusively on them is the same as relying on discipline: it works until it doesn't. What we need is to convert the ones we can into explicit criteria, codify them as verifiable contracts, and reserve human judgment for the third dimension, the one that requires context no system can capture yet.

Craft needs infrastructure

I've spent five articles arguing that in a world where AI generates code, what matters is the judgment of whoever verifies it. That seniors' heuristics are the most valuable infrastructure a team has. That generation is commodity and verification is craft.

Plumbline is an attempt to turn part of that craft into infrastructure. Every criterion you pull from a senior's head and turn into a verifiable contract is one fewer single point of failure.

It doesn't solve everything. The third dimension still needs humans with context. The deepest heuristics remain tacit. But if your quality criteria live only in the heads of three seniors, you don't have a verification system. You have three single points of failure. And in a world where generation accelerates every quarter, those three points are the bottleneck that determines whether your team scales or drowns.

Generation is commodity. Verification is craft. And craft needs infrastructure.

Question for you: If you had to write a verification contract for your team's next feature, what would go in the contextual dimension? The thing only you know, that isn't in any test, and that no agent can discover.

Newsletter Content

This content was first sent to my newsletter

Every week I send exclusive reflections, resources, and deep analysis on software engineering, technical leadership, and career development. Don't miss the next one.

Join over 5,000 engineers who already receive exclusive content every week

Discipline Doesn't Scale. Verification Needs Infrastructure.

The pattern that keeps repeating

Enjoying what you read?

What I'm testing

Why code review isn't enough

What I didn't expect

Craft needs infrastructure

This content was first sent to my newsletter

Related articles

Generating Is Easy. Verifying Is the Work.

Your LLM passes the benchmark and fails in production

When LLMs Generate Thousands of Tokens per Second, What Matters Won't Be the Code

Emilio Carrión