Vibe Coding's Paper Anniversary
I realise this title will make some readers think there was an actual vibe coding paper.
Karpathy’s throwaway term for his own prompting habits turned one year-old this month, and the discourse around it has achieved the kind of self-sustaining philosophical density that could make Plato blush. Which is fine, as we say, except that the thing underneath this discourse is not philosophical. It’s structural. And the structural question is the one most orgs are successfully avoiding, while appearing to engage with it, which is the only philosophical part.
The non-philosophical part—which is to say the actual productivity part—is nevertheless equally flush with attention and established metrics. A developer using AI to ship features faster is staffing arbitrage; whether it’s a 2× multiplier on a single headcount (NERB found the average was closer. to 1.3×) makes no difference, it’s still linear. But then the math gets strange: senior engineers, the ones whose judgment is genuinely load-bearing, tend to go slower after adopting AI tooling, (largely because they’ve developed an inconvenient habit of reviewing AI output before sending the PR.) This is well established empirically, though less examined causally is why the structural cost turns out to be significantly greater than reviewing human output.
What makes reviewing machine generated code structurally more taxing than human output is not, as often assumed, due to the overwhelming volume, though the volume is real. It’s that AI-generated code satisfies explicit contracts while routinely violating implicit ones. The hidden validation work a senior engineer does, recognizing that this change, which looks correct in isolation, violates an assumption three layers up that isn’t encoded anywhere, and has had no name, no spec, no artifact outside the code itself. While it’s commonplace to think of such sine qua non engineers as masters in a more colloquial sense, Iain McGilcrest would likely point out they are literally masters in their function and responsibility to see the whole, the complete code space, the full deployment process, working with a team of emissaries delivering fragmented, linear pieces of the assemblage in the form of pull requests.
Vibe Coding is the total surrender of the Master to the Emissary.
Code Review is the site where this surrender is either ratified or resisted.
Architecture, while more commonly thought of as final blueprinting and full construction—which is problematic for agile—in software is better understood as a negotiation between master and emissary, and as the LLM would say, “the terms of this negotiation are changing, and not non-negotiatiably” (Or so I imagine it would.) Vibe coding is both where this negotiation does not take place and quite literally Lacan’s l’objet petit a for programming itself, the thing that appears in the exact place where programming disappeared.
The particularity of the loss, the usurpation of the emissary function, takes it further (in the sense of, the site of l’objet petit a is now mobile, potentially motile) and McGilcrest would also likely remind us of the consequences of one hemisphere becoming entirely too dominant over the other, though in the case of llm-generated code empirically requiring more review it’s not a question of hemispheric dominance, but rather the cost of having to continually flip back and forth between those two modes of attending.
But that’s just what senior engineers did, invisibly, continuously, as part of reading code. Flipping back and forth to ensure the entire latent space checks out, while being ever diligent and detail oriented, this aspect of code review functions as a sort of Goodhart’s Law filter at a higher level, ever vigilant not merely that they comprehend the code, but that the code is comprehensive. [Inter nos: that is so contrastive I’m almost embarrassed to write it, when that’s the preferred rhetorical construction of llm’s, maybe even the only one they know. But I’m making a real point here on the displacement of the function of comprehension which is what we literally mean by vibe coding.] This deeper understanding of the code’s latent structural requirements, the disambiguation of comprehensible and comprehensive, is almost Kantian when seen in perspective but nobody noticed it was load-bearing because there was no checkpoint for it to begin with. No leading indicator. The only signal was the lagging one, which is to say the post-mortem in its absence.
Amazon had a couple of those late last year, in one case losing production to an autonomous agent with write access and no telemetry to reconstruct the decision chain when things went sideways. Kiro, Amazon’s oddly named internal AI coding tool, had been asked to fix a bug. It ignored the request and decided to delete and recreate the environment instead. Which turned out to be production. Amazon’s post-mortem assigned blame to the human who had misconfigured the access controls, which is technically accurate (so much for blame the process) and entirely misses the point: the infrastructure gap wasn’t that a person could assign incorrect permissions, (though there is also the question how that was allowed to happen) it was that nothing downstream caught what the agent was doing with them. The resultant consequences of misplaced trust compounding a configuration error; the cost multiplication of the missing comprehension loop (reading, reviewing, debugging, holding implicit contracts in mind) being shifted onto infrastructure, needless to say, incompletely. When the agent skipped it, nobody knew to look for it bc it had never been a formal checkpoint. It was another thing engineers did while they were doing everything else.
The Debt Solution
Not legibility debt (what is lost) but cognitive debt (what structurally causes it):
the removal of unspecified contextual validation from the pipeline.
Thought pieces scramble to name this technomanque. The most widely adopted is cognitive debt, coined by Dr. Holly Cummins and popularized by Simon Willison. Not legibility debt, which names what is lost, but cognitive debt, which names what structurally causes the accumulation: the removal of unspecified contextual validation from the pipeline. Sculley famously identified the invisible accumulation of complexity cost in MLOps systems. This ‘new’ form of debt however is the invisible accumulation of epistemic cost, accruing (as previously noted) in a different ledger, with no artifact and no measurement, until the lagging indicator fires.
The orgs discovering this are doing so in post-mortems. The orgs preferring to avoid the cringe are, shall we say, built different. They get what AI-native SDLC actually means, as opposed to what most AI roadmaps mean when they use the phrase (such as “get all the engineers to use AI” which is as stale as airport pretzels.) Being ‘built different’ here means understanding the full implications of the vibe coding thesis viz. the move from deterministic instruction to latent resonance, and the three pillars of this building: the end of syntax, latent invocation, and observability over instruction (it turns out to be just as inefficient tryna micromanage models as it is with people.)
The loss of syntax is a hemispheric loss, which is to say something we tend to handle implicitly, ie. in latent space. So when this loss is made explicit, without resolution (which needs be inexplicit), well, we’re quite literally at the highest point of uncertainty since we’ve been measuring this sort of thing, not that its entirely the fault of AI. Or of those who’ve adopted it the fastest and are now seeing generative tooling as a kind of general summoning device for products and curiosities.
This invocation of the latent has gone through various transformations, from commanding to coaxing, from prompt whispering to semantic steering. Whispering is the recognition that we are no longer writing recipes; we are navigating high-dimensional state spaces. Steering is formalizing this navigation as topological constraint. It’s a move away from the Emissary’s literalism toward the Master’s holistic intuition, an attempt to bridge the hemispheric gap by using natural language as a low-rez map for high-resolution latent territory. It is, in essence, the formalization of “the vibe” as a navigational heuristic.
But it’s the third pillar that trips everyone up, that of making all this transferred latent work visible, observable, measurable even. You can’t unit test a vibe. You can’t even see it. Definitionally it’s the latent space. (I believe it was Peli Grietzer who first noted that.) This makes vibe coding more psychological than programming ever had any right to be, because the driving questions are all scaling questions, less concerned with vibe coding per se, and more concerned with the orchestration of the latent manifold, more specifically its operators. If you wanted to write a “thought piece’ you could run with “LatentOps.”
But for our purposes, observability aims at measuring the ongoing collapse of the semantic into the statistical, and the identification of where the load bearing support for trust is architecturally shifting, the better to deploy swarms or teams of agents in that space where the human bottleneck has been structurally bypassed.
Parallel Universe
Is human effort a bottleneck to be architected around, or the load-bearing element your automated pipeline is built to support?
The parallel agents moment is running exactly the same play vibe coding ran a year ago: selling the upside while hype assumes the hard prerequisite work is already done. The prerequisite is decomposition: clean code with well defined types and boundaries. Parallel agents in a poorly decomposed codebase don’t multiply productivity; they multiply the specification gap across multiple workstreams simultaneously.
Claude Code’s Agent Teams are built for organizations that have already done this work: each agent gets its own worktree, tight module scope, contract-first planning, then HITL via PR before anything merges. Architecture not just well defined, but explicit enough that agents can coordinate without colliding, and which treats them as junior engineers on a large codebase, because that’s what they are.
The Codex 5.3 approach OTOH, keeps the human in the steering position throughout, a sort of evolved pair coding style (assuming the observer is actually reviewing the llm driver’s code); with comprehension re-embedded in the workflow rather than replaced architecturally. Both routes are defensible. But they’re answers to different questions, and the choice between them reveals something you’ve already decided: whether human comprehension is a bottleneck to be architected around, or the load-bearing element your pipeline is built to support. There’s a fair amount riding on this: The infrastructure requirements are different, the observability needs are different, and the trust models are completely different. Choosing without knowing which question you’re answering is how you write your own post-mortem.
While the accepted wisdom that senior engineers are spending more time reviewing AI output is empirically supported, it’s the nuance which is telling. Although it’s widely understood that the bottleneck has shifted from the writing of the code to ensuring its correctness (and there’s no formal verification coming to save anyone’s bacon) the deeper problem is that the review itself is structurally harder than reviewing human output, not because of volume but because the implicit contractual knowledge being validated has no spec, no artifact, and no name. Which is precisely why their job isn’t to review it at all; it’s to build the infrastructure upstream that makes machine output trustworthy before it needs review: types, contracts, decomposition, observability. PR review as last line of defense is what you have when nothing upstream is doing the work.
The same dynamic is playing out in open source without any real organizational resources to absorb it. Godot’s lead maintainer described sorting AI-generated PRs as draining and demoralizing. Daniel Stenberg closed cURL’s bug bounty after genuine vulnerability reports dropped to roughly 5% of submissions. Tldraw closed to external contributions entirely. (Tho strangely didn’t actually update CONTRIBUTING.md) GitHub is reportedly considering a kill switch for pull requests, which are still the fundamental mechanism by which open source actually functions. Playing out in the specification gap operating at the commons level: the vibe coding productivity “gain” as seen by those being subject to the externalized excise tax.
This is not primarily a story about OSS community health, though it is that. The libraries your company’s stack depends on are maintained by people being bombarded with squalls of LLM generated PRs announcing the impending tsunami despite the fact it’s not making an appearance in most orgs’ risk registers yet.
What This Means For CTOs
The CTO’s job is actually not to answer this question. It’s to reframe it.
One year in, the two conversations—vibe coding and AI-native SDLC, individual productivity and organizational transformation, internal senior engineers and external OSS maintainers—are all the same conversation at two scales. Cognitive debt accumulates wherever the comprehension loop gets removed without replacement. Internally, it lands on senior engineers doing additional verification work. Externally, it lands on OSS maintainers absorbing the review burden vibe-coders aren’t paying.
AI-native SDLC internalizes these costs deliberately: types as load-bearing infrastructure rather than code quality tooling, decomposition as prerequisite rather than aspiration, wide observability (aka ‘structured’ logs, tho I’m not a fan of the nomenclature) as connective tissue, and HITL as the explicit mechanism for validating what no formal spec captures. But it’s more expensive upfront, which is exactly why most orgs resist it until something breaks.
Successful orgs (at least the ones successful enough for their CTOs to get away for lunch) aren’t the ones running the most aggressive vibe-coding culture, and they’re not the ones who landed on the right vendor. They’re the ones who understood that build-or-buy was the wrong question, that choosing between Cursor and Claude Code teams and GitHub Agents (launched late January and met with hoots) is answering a procurement question wrapped in the language of transformation. The transformation question is what the development pipeline looks like on the other side, what has to be structurally true before agents operate safely at scale, and what you’re building toward. Most AI roadmaps are straddling the gap between these two questions. The incidents, of course, live in that gap.
The moment parallel agents are having will accelerate this, for the same reason the vibe coding moment did: the upside is visible and the prerequisite work is invisible (unless you’re using Swardley maps) right up until it isn’t. The CTO’s job in the room where the AI roadmap is being discussed (which in practice means the room where the CEO has said we need to use AI, tell me how) is not to answer that question. It’s to reframe it: what does the development pipeline look like on the other side, what organizational capabilities have to exist before agents operate safely at scale, and what are we building toward. That’s a different conversation. It’s also the one that determines whether the next Amazon incident is yours.
And you thought CTOs were risk-averse before.


Nice. I can't recall what I was originally intended to write in response when you first emailed this and Spotify was having an issue.
Today, I am thinking of a meeting earlier today regarding the merging of autogenerated (Mend Renovate) pull requests. I mentioned that I had been doing test-driven development for a while, as in the dinosaurs have been gone for a while, and that I believed if tests are written correctly, and the PR passes all the tests, it should be merged, and unless it causes an immediate problem, casual code review can happen later.
The way I use AI coding tools, is to first write tests, then build my prompts from there. This has worked well for me, so far, mostly using Claude Code.
Of course, that does not stop me from vibe coding like crazy in my spare time. I just wrote an agent to attempt to make me a reservation at a hard-to-get-into restaurant in Paris at the moment that reservations for that date open up. Of course, nobody will die or lose their job if it doesn't work, and 15 minutes seems like an appropriate investment of time.
You piece was beautifully written. If you decide to turn to fiction, I think that Thomas Pynchon's gig might be up for grabs.