How can one person safely change a codebase too large to read?

Through continuous verification, not human memory. Orbyt runs over 400,000 lines, and over 11,000 automated tests run on every commit. No person holds the whole system, so the suite holds the contract instead. It catches any change, human or agent, that breaks something the system must never get wrong before it ships.

Are automated tests the new code review in AI-native development?

Yes. When agents generate code faster than any human can read, human review stops scaling. The test suite becomes the trust layer that verifies every change continuously, at machine speed. It is not a gate you pass once but a steering wheel held the whole time, enforcing what must never break and keeping multiple agents honest.

What is the hardest part of testing AI-generated code?

Deciding what must never break. Agents can write unlimited tests but cannot judge which ones keep you out of court or out of a regulatory audit. That requires knowing the business and the cost of being wrong. At Taxa, one wrong field could trigger an audit, so choosing the catastrophic cases to protect was the entire job.

I Can't Read My Own Codebase. I Ship Daily.

I cannot read my own codebase. Orbyt is over 400,000 lines now. No human can hold that in their head. I cannot, the four agents that write it cannot, and no engineer who joined tomorrow could read it in a year. I ship into it every single day.

Sit with that for a second. A system too large for any person to comprehend, changed daily by one operator and four agents moving faster than any human could review. For most of software history that would have been malpractice.

It is not. What makes it safe is not my memory. It is not my discipline. It is over 11,000 tests that run on every commit before a single line lands. You do not have to take my word for the count. The test suite, the commit log, and the line count are all live on Orbyt’s building in public page.

This is the part almost no one has internalized yet. When execution becomes free, the thing that keeps you alive is not who can write the code. It is who can verify it.

This morning I shipped a change I never read

This morning I changed how Orbyt ranks job matches. Four Claude Code agents were running across four terminal windows. One wrote the new scoring logic. One updated the tests. One swept the call sites that touched the old function. One ran the suite. I read none of the 400,000 lines my change moved through.

I shipped it anyway. By lunch it was in production.

The instinct every engineer is trained on says you read the code, you understand the blast radius, then you change it. I did none of that. And the change was safe, because the suite ran the moment the agents stopped typing and told me, in seconds, that nothing I was not allowed to break had broken.

Read it, then review it. That model is already dead.

For thirty years the safety model in software was human comprehension. You read the code. You reviewed the diff. A senior engineer held the architecture in their head and caught the mistake in the pull request. Review was the gate, and a human eye was the lock.

That model assumed two things. That a person could read the change. And that the change arrived slowly enough to be read.

Both assumptions are gone. Orbyt launched on 243,000 lines on day 32. It is past 400,000 now, and it grows every day I work. No human reviewer holds that. And the changes do not arrive at human speed. Four agents in four terminal windows generate more diff in an hour than a team used to ship in a week.

Scale the old model to that and it snaps. You cannot read what you cannot finish reading. You cannot review, at the speed of one slow human, a stream of changes produced by four fast machines. The gate becomes the bottleneck, and then the gate gets skipped.

Human review was never the safety. It was a proxy for verification that happened to run on a brain. Remove the brain’s ability to keep up and you are left with the thing it stood in for: does the system still do what it must?

The constraint moved. It is no longer authorship.

When I built Orbyt solo in 32 days for roughly $400 with Claude Code across four terminal windows, the scarce thing was never the code. The agents wrote 243,000 lines and 4,124 tests across 852 commits. Writing was free. Writing is always free now.

So the binding constraint shifted. If anyone can produce code, the question stops being can you write it. It becomes can you trust it. Can you know, the moment a change lands, that the system still does what it is supposed to and does nothing it must never do.

That is verification. And verification, not authorship, is now the limit on how fast and how large one operator can safely go.

Most teams still optimize the wrong half. They buy faster code generation and keep verifying by hand. That is like installing a jet engine and steering with a tiller. The output outruns the one thing that was keeping it honest.

Tests are not a gate. They are the steering wheel.

Here is the inversion that took me a while to say out loud. The test suite is not QA. It is not a box you check at the end before release. It is the trust layer the entire operation runs on, and it is the steering wheel I use to drive a system I cannot see all of.

A gate is something you pass once. A steering wheel is something you hold the whole time. A car does not have a steering wheel so it can pass inspection. It has one so you can drive fast without leaving the road. Every commit, over 11,000 tests run. They are not asking permission to ship. They are telling me, continuously, whether the change I just made moved the system somewhere it was allowed to go.

This is what lets one person change a codebase too large to comprehend. I do not need to remember how the matching logic interacts with the outreach pipeline. The tests remember. I do not need to hold 400,000 lines in my head. I need to hold the contract: here is what must always be true. The suite enforces the contract while I work on the part in front of me.

It is also what keeps four agents honest. An agent does not know what it does not know. It will confidently rewrite a function and break an invariant three modules away, with no idea it did. Worse, it produces a plausible wrong answer, which is the most dangerous output an agent can give you, because it looks like a right one. A passing assertion does not get fooled by plausible. It fails the instant the contract breaks, before the change compounds into three more built on top of it.

Tests are the new code review. Not a replacement for judgment. The mechanism through which judgment scales past the size of one human head and the speed of one human pair of hands.

What do you actually test? That is the whole job.

Here the work stops being automatable. A test suite is only as good as the decision about what it protects. You cannot test everything, and testing everything equally is the same as testing nothing. Coverage is not the goal. A codebase can have ten thousand tests and still ship the one failure that ends the company.

My rule, written into the CLAUDE.md that governs every project: every piece of logic that could silently break must be protected by a test. The word that matters is silently. Loud failures announce themselves. The ones that end companies are quiet. The field that saves wrong. The total that rounds the wrong way. The permission that opens by accident.

An agent can write a thousand tests. It cannot tell you which thousand keep you out of court. That requires knowing the business, the stakes, and the cost of being wrong. It is not in the training corpus. That decision is the human’s, and it is the part that does not get cheaper.

Once execution is free, the scarce inputs are judgment, taste, and clarity. Knowing what to reject is the job. In a codebase, rejection has a precise form. It is the test you write to say this behavior is not allowed to change, no matter who or what edits the code next. The agent writes the test. The human decides which test is load-bearing.

In a regulated room, verification is not optional

If this sounds like a luxury for a side project, look at where the stakes are real. When we built Taxa, the AI-native enterprise tax platform, one wrong field could trigger an audit. Not a bad user experience. An audit. There is no cosmetic failure in tax. The system either produces the legally correct number or it exposes a professional’s license and a client’s filing to the IRS.

You do not ship into that on memory. You do not ship into that on a reviewer’s good day. We took Taxa from prototype to production in 5 months and it enabled $113M from KKR and Bessemer, and the only reason that pace was survivable is that verification was the foundation, not the afterthought. “The agent wrote it and it looked right” is not an answer in that room. It is negligence.

Norhart was the same shape. A $70M SEC-registered investment platform with a regulator effectively in the room. When a regulator can ask why a number is what it is, you had better be able to prove the system produces that number every time, not most of the time. A test that runs on every commit is that proof. A human who reviewed it once is not.

Regulated software just makes the universal truth impossible to ignore. Every system has fields that must never be wrong. Most teams have never bothered to name them. The regulated ones are forced to, and once you have, the test suite is simply where that enforcement lives.

Tests do not slow you down. They are the only reason you move this fast.

The objection I hear from engineering leaders is reflexive. Tests are overhead. They slow the team down. We will add them later when things stabilize.

That is backwards, and at scale it is fatal. Over 11,000 tests are not a tax on my speed. They are the entire reason one person can move at this speed across 400,000 lines. Without them, every change would be a guess. I would have to slow to the pace of caution, touch nothing I did not personally understand, and freeze the moment the system outgrew my memory. That is the real brake. Fear is the brake. Uncertainty is the brake.

The suite removes both. I touch any part of the system and it tells me in seconds whether I broke something I cannot see. Four agents generate at full speed because their output hits a wall that does not get tired and does not get impressed. The tests run on every commit through a pre-commit hook. Not when I remember. Not before a release. Every commit, automatically, before anything lands.

Speed without verification is not speed. It is the appearance of speed followed by a recall. The team that ships fast and verifies by hand is borrowing velocity from a future incident, at interest. The bill always comes. Real velocity is verified velocity. By that measure the test suite is not the brake. It is the engine.

Verification is the moat now

When everyone can generate code, the code is not the advantage. The model is the same model your competitor licenses. The features are copied in a weekend. The thing that does not copy is the ability to change a large system at speed without breaking it.

That ability does not arrive in a Git clone. It is the accumulated judgment about which invariants you chose to protect, encoded into a suite that runs every commit and holds four agents to a standard no individual prompt could enforce. A competitor can clone your product. They cannot clone the years of judgment about your specific system. It is governance written as code, and it is the quiet reason one person can outrun a funded team.

So the honest summary of how I operate is this. The agents own execution. I own judgment. The tests are where judgment lives when I am not looking. They let one operator safely run a system far too large for any human to hold, and they keep the agents honest while doing it.

I cannot read my own codebase. I ship to it daily. Both are true, and they are not in tension, because the safety was never my memory. It was over 11,000 tests deciding, on every commit, whether the things that must never break are still unbroken. The code was the easy part. Deciding what was not allowed to break was the job.

That is the new shape of engineering leadership. Not who writes the most code. Who decides, with the most clarity, what the system is never allowed to get wrong.

I Cannot Read My Own Codebase. I Ship to It Daily.

This morning I shipped a change I never read

Read it, then review it. That model is already dead.

The constraint moved. It is no longer authorship.

Tests are not a gate. They are the steering wheel.

What do you actually test? That is the whole job.

In a regulated room, verification is not optional

Tests do not slow you down. They are the only reason you move this fast.

Verification is the moat now

More Articles

Verification Is the New Literacy

Claude Code Is Like No One Saying No to You

The AI-Native Stack You Pick Decides Whether AI Compounds or Stalls