Mad Skills
Where Irreducible Complexity Lives, and What To Do About It
Part Three of Twilight of the MCP Idols
The Gartner hype cycle doesn’t have a dedicated phase or point of “experimentation,” which seems odd at first when you consider pretty much every technology goes through an experimentation phase. Superimposing over the Gartner model, it likely spans an arc between the peak of expectation, when experiments produce exciting results, and the trough of disillusionment, when results of said experiments are seen not to return value in excess of their costs, beyond the hint of new possibilities.
Like so many technologies before it, Model Context Protocol has surpassed the point of being an interesting experiment and has become, depending on who you talk to, either integration architecture, enterprise tooling, or dead but just not buried yet.
The Linux Foundation formalized it.
Gateways emerged with real access control.
A
.well-knowndiscovery standard appeared.SEP-1442 replaced session-bound stdio transport w/ stateless, streamable HTTP.
And at some point, the boundary between what a model does and what a system permits stopped being a prompt engineering problem and became a protocol problem.
I’ve spent a good while now thinking about how we as practitioners are responding to this. Specifically, I’ve been noticing what various teams and companies are doing with skills, those folders of Markdown files and operational guidance intended to shape agent behavior I wrote about in part two. And in the aftermath of the considerable and justifiable excitement of MCP Dev Summit earlier this month in NYC, I think we’re in a genuinely confused moment. The skills ecosystem has grown very fast. But I don’t think anyone would seriously disagree that the thinking about what skills should actually be doing has not kept pace. The hand is, as always, quicker than the eye.
And that gap is where most of the current architectural trouble lives. Though I think the architecture is only *half* the problem.
Running alongside the architectural failure, which is mainly a question of where governance belongs in the system (though it doesn’t completely reduce to that) is an epistemic one: a question of what skills encode, what they silently assume, and how long either can be trusted. The ecosystem is not clearly distinguishing between the two, and I’ve come to think that confusion is its own failure mode, a specific kind of madness that software ecosystems are prone to. It’s not the madness of building the wrong thing, or of building the right thing too slowly. It’s the madness of building something that cannot tell you when it is wrong. A system confident in its own outputs, legible in its own operations, apparently functioning but structurally incapable of recognizing drift. One failure is epistemic: a knowledge problem, a matter of what skills encode and how. The other is architectural: a placement problem, a matter of where computation lives and who is responsible for its limits. Until the ecosystem distinguishes between them, the fixes being proposed will fall short.
At the Threshold
Is this an epistemic problem or an architectural concern?
Let me start with SEP-1442, because I don’t think enough people have absorbed what it actually implies.
The shift from stdio to stateless, streamable HTTP isn’t so much an ergonomics improvement as it is a hard architectural constraint. Statelessness means there are no persistent session guarantees. There is no stable accumulated context at the transport layer. Every action the agent takes must be reconstructible from the request, its metadata, and whatever external systems the agent can reach.
This has an immediate consequence that most skill authors are not reckoning with. Skills, as they are currently built, assume continuity of context. They front-load reasoning into prompts. They behave as if the agent reading them will have stable accumulated state to reason against. SEP-1442 makes that assumption structurally false. The protocol is now thinner than the skill ecosystem assumes.
The response the ecosystem is developing—task primitives, expiry semantics, gateway-mediated long-running tasks—is transparently an attempt to externalize state management entirely. State lives in external systems that can be reliably queried, not in the transport layer. That’s the right instinct, I think. But it means skills cannot be the primary carrier of stateful reasoning. They never should have been.
There’s a provocation worth dwelling on here. A trouble worth staying with, to use my favourite phrase. As MCP aligns more tightly with scalable HTTP conventions, the protocol becomes thinner: more transport, less semantics. And if the protocol is thin enough, where does structure live? My answer is that plumbing is what enables everything else, and MCP’s plumbing is not neutral. Gateways, RBAC, audit trails, these are what determine what agents are allowed to do, under what conditions, and with whose authority. That’s governance. And it should be non-controversial or at least I would hope so that governance cannot be delegated to a model.⁰
I recognize some people read “thin protocol” as implied criticism, but I don’t mean it that way. Thin protocols are often the ones that survive long enough to matter. HTTP is thinner than CORBA was. That turned out well. A bit too well but let’s not go there. The point is that thinness at the protocol layer concentrates structural weight somewhere else, and in MCP’s case that somewhere else is the gateway and the skill. Understanding which weight belongs where is the architectural question the ecosystem has not yet answered cleanly.
The Control Plane
The single most under appreciated architectural move of the past year is what the gateway actually does.
Consider the GitHub MCP server. It exposes somewhere north of a hundred tools, designed for human-centric workflows. Hand that to an agent without filtering and the agent’s reasoning collapses. The model falls over trying to attend to the full tool surface. The industry’s response to this was not to reduce the number of tools available globally; it was to filter the tools the agent sees before reasoning begins. The gateway scopes the context; the agent reasons well within that scope.
I’ve heard people describe this as a workaround. But it’s not a workaround. It’s architecture. The decision about which tools reach the model is a governance decision, not a reasoning decision. Putting that decision in the gateway is precisely correct.
This gives us a clean way to think about what MCP’s maturation actually is. There is a control plane and a data plane. The control plane is the gateway, the RBAC policies, the audit trail, the expiry semantics, the workload identity. The data plane is what the model reasons about, synthesizes, and executes against. MCP’s maturation is the story of these two planes being correctly separated. The architectural error the ecosystem keeps making is conflating them: asking skill files to carry governance weight they cannot reliably bear.
The move away from stdio is also a security move, not an ergonomics move. Workload identity, rather than just API keys. The gateway is where runtime identity lives. A skill file can suggest permissions. It cannot enforce them. That distinction matters enormously in any environment where you actually care about what the agent did, who authorized it, and how you’d know if something went wrong.
.well-known discovery is worth a closer look as well. The idea is that an agent should be able to reason its way into the right tool call from server metadata alone, without needing a persistent connection to understand what a server can do. That’s a genuinely interesting capability if it works. The question I keep coming back to is how much faith we should place in static metadata when what we often need is a live source of truth. Progressive disclosure of capabilities through metadata may be good enough for the common case. For the hard cases (dynamic environments, rapidly changing APIs, high-stakes operations), I’m not convinced.
What the control plane / data plane separation really gives you, beyond operational correctness, is a place (tbh: the place) to reason about the system as a whole. When something goes wrong in a system where the gateway enforces the operational envelope, you can ask: was this action within the authorized scope? Was it audited? Can I reconstruct the chain of authority? Those questions have answers. When the governance lives in the skill file, when you’re relying on a Markdown document to keep the agent from doing something it shouldn’t, those questions don’t have answers, certainly not reliable ones. The skill is a suggestion. The gateway is a constraint. As an industry we’ve not always been clear about which we’re building.
The Skills Explosion Is a GIGO Problem
Writing a SKILL.md is easier than building an MCP server.
I’ve been tracking the numbers in the wild skill ecosystem, and they are extraordinary, to say the least. By the most recent counts, there are upwards of half a million skills in the wild. A single command can install 1,234 of them (the Antigravity library, sitting at 22,000 stars at last check). The SDK has racked up 97 million downloads.
The skills explosion happened because there was a real problem. Agents given undifferentiated access to large tool surfaces fall over. The engineering response to context bloat was to externalize judgment into folders, as I wrote about extensively in part two. Write down how to use these tools, in what sequence, under what conditions, and let the agent read the instructions, rather than reconstructing them from scratch each time. Which was not unreasonable; we do something structurally similar with runbooks, with onboarding materials, with documentation generally. And the format’s winning wasn’t accidental; a Markdown file is reviewable, versionable, diffable, and, unlike logic embedded in an execution environment, it can be handed to a regulator, an auditor, a new team member.
The result is that people are dumping unorganized information, half-formed logic, and undocumented assumptions into Markdown files, because it’s the path of least resistance, because writing a SKILL.md is easier than building an MCP server. The GitHub server story repeats itself here, one abstraction layer up. Every new skill added to a registry is a new token in a sequence the model must attend to. The skill folder was supposed to solve tool sprawl. In practice, it is tool sprawl, wrapped in Markdown.
Time capsules are only useful if you know when they were buried.
Most skills, like so many engineers, tend to be overconfident: they assert things rather than encoding uncertainty. They have fake completeness: the reasoning looks finished, but the actual judgment is not inside the artifact. They treat perishable knowledge as permanent: APIs change, authentication flows evolve, field names shift, best practices are revised, and the skill file tracks none of this. They are opaque: no metadata standards, no documented scope, no expiry semantics. The skill looks like a finished product. It is often a compressed assumption with no forwarding address.
I find the “no forwarding address” framing useful. When an engineer leaves an organization and takes their understanding of a system with them, the organization loses something real. Skills were partly conceived as a way to capture that understanding. But institutional knowledge written down without being bounded, documented, and kept current is not a knowledge base. It is a time capsule. Time capsules are only useful if you know when they were buried. And crucially, if the agent reading it can tell whether the capsule has expired.
The failure is not unique to skills. We went through a version of this with documentation generally, and before that with UML models, and before that with requirements documents. Each time, the promise was that we could capture enough of the reasoning to make the artifact reliably useful. Each time, the artifact drifted from the reality it was supposed to describe, and nobody had a good mechanism for surfacing the drift. Skills are repeating this pattern, at higher speed and in a context where the consumer of the artifact is an automated system rather than a human reader. A human reader notices when documentation smells stale. An agent reasons from it with full confidence regardless.
Missing Expiration Dates
The right mental model for a skill is not a document. It is a cached query result.
Not all knowledge has the same half-life. This seems obvious when stated plainly. The surprising thing is that the entire skill ecosystem currently behaves as if in denial.
Consider two categories of knowledge that end up in skill files. On one side: architectural decisions, deployment patterns, organizational conventions, how to interpret a class of results, pathological failure modes to avoid. These have long half-lives. They may stay relevant for years. The judgment they encode is genuinely durable.
On the other side: API versions, endpoint schemas, field names, rate limit behaviors, authentication flows, specific rule implementations in domain-specific systems. These have short half-lives. Some change monthly. Some change faster. An agent reasoning from a six-month-old skill file’s description of an API endpoint is not reasoning from knowledge. It is reasoning from a guess at an answer that’s most likely moved on already.
Consider a concrete case from the real estate domain that illustrates this well. Flexmls and Marketo are both systems with rapidly changing factual content; MLS rules, field definitions, authentication flows. Grounding agents in live data sources for that kind of knowledge keeps them current in a way a written file simply cannot. The factual, rapidly-changing layer should be retrieved at runtime, not encoded in a skill. But the operational judgment about how to use that data, how to structure a query, how to interpret ambiguous results, which failure modes to watch for, has a longer half-life. That judgment belongs in a skill, in principle. The problem is that skills currently have no way to say which of its contents falls into which category.
What I’d call a Skill-Expiry model would make this explicit. Imagine YAML frontmatter on a skill that declares the knowledge half-life per section: “this was accurate as of version X; verify the authentication flow before acting; the query structure here is likely stable for 18 months.” Explicit uncertainty boundaries. Something like the .well-known discovery standard, applied to skills rather than MCP servers. A skill that encodes not just instructions but limits.
Like many of my ideas, this is less radical than it sounds. We already do something like this with certificates. A certificate says not just “I am who I say I am” but also “and this claim is valid until this date.” We do it with cache headers. We do it with package version locks. The idea that a piece of knowledge should carry information about its own expected durability is not exotic. It is just not something the skill ecosystem has thought to apply yet.
The closest thing to this instinct I’ve seen in practice is what Cloudflare has done with Code Mode: a crude but directionally correct attempt to inject runtime humility into a static layer. The skill should know what it doesn’t know. Almost none of the skills I’ve reviewed do. Overconfidence combined with no expiry semantics produces silent failure at scale. The agent behaves correctly until the underlying reality shifts, and there is nothing in the system to flag the discrepancy.
The Cloudflare instinct points at something deeper. The right mental model for a skill is not a document. It is a cached query result. Cache invalidation is, despite Phil Karlton’s famous bromide,¹ a solved problem, at least conceptually. We know how to think about staleness, about TTLs, about the conditions under which a cached result should be considered untrustworthy. Applying that mental model to skills would be more useful than continuing to treat them as authoritative sources of ground truth.
Where the Brain Should Live
The 98% context reduction that comes with a PortOfContext model is not an optimization. It’s evidence of a different architecture
In part one I elaborated a tripartite architecture composed of brain, passport and hands which naturally invites the question of where skill files sit in that picture and whether the problem with current skills is epistemic, a matter of knowledge and judgment, or architectural, a matter of where computation lives and who is responsible for its limits. In practice it is often both, in different proportions for different situations. But people are not making the distinction deliberately. They are defaulting to more skill, because more skill is easy to produce. Like, scary easy.
The architectural answer is seductive, and I want to take it seriously before I take its problems seriously. A system that synthesizes its operating logic on demand is adaptive in a way a static skill file cannot be. It responds to API changes, shifting domain requirements, the particular context of each operation rather than the generalized context assumed when the skill was written. The 98% context reduction that comes with a PortOfContext model is not an optimization. It’s evidence of a different architecture, and a better one in the ways that matter most for performance.
But here is what the architectural answer cannot do: it cannot be read. If the contract governing agent behavior lives in the execution environment, in gateway policies, in RBAC rules, in dynamic synthesis, then the contract is, in a meaningful sense, invisible. You can observe what the agent did. You can instrument the execution. What you cannot do is point to the document that authorized it. In regulated industries, that distinction is hardly academic.
So the question is not which answer is correct in the abstract, but which failure mode you are more willing to live with. The epistemic failure of skills that decay without knowing they have decayed? Or the architectural failure of execution layers that govern without leaving an inspectable record of how? While both failures are real, the ecosystem is currently behaving as if only one of them is.
Much of what currently lives in skill files shouldn’t be stored at all,
but should be generated at runtime and discarded after use.
What genuinely belongs in skills, then? Operational judgment with a long half-life and no better runtime home. Coordination contracts: this is how we do things here, the shared signal every agent in the system needs in order to behave coherently with every other. Institutional knowledge that would otherwise leave the building with the engineer who held it. But only if that judgment is bounded, documented, and expiry-aware. Much of what currently lives in skill files shouldn’t be stored at all, but should be generated at runtime and discarded after use. I’d like to think the ecosystem is slowly arriving at this principle.
I find the PortOfContext model useful here: the idea that context is something you construct for each operation, rather than something you accumulate and carry. Work on very large agent systems has demonstrated 98% context reduction when moving to this model. That’s not an optimization, that’s a different architecture: thin skills, rich execution environments.
There is an underused observation here that I think the field will eventually take seriously. Models are already reasonably good at recognizing when a skill’s abstraction is wrong for the task at hand. That capacity is currently wasted. If skills encoded their own uncertainty, agents could flag skill staleness; a closed-loop system where claims are continuously checked against reality and policy. That would be genuinely useful. It requires skills to know what they don’t know.
It also requires a willingness to treat agent feedback as a signal about skill quality, not just task completion. When an agent consistently reasons around a skill rather than through it, that is information. When an agent requests clarification that the skill should have provided, that is information. Currently, most teams are not capturing this signal systematically. The skill authors and the agents are not in conversation. They are not even in the same room. (Briefly, some of them were, at the MCP Dev Summit.) That’s an organizational failure as much as a technical one.
The binary that practitioners often reach for, just-in-time synthesis versus just-in-case documentation, is not quite the right frame. The real question is per-artifact: which mode does this particular piece of knowledge call for? Engineers are not currently asking that question. They’re defaulting to documentation because documentation is easier to produce and easier to review. That default is producing a lot of time capsules.
Skills Are Contracts, Not Artifacts
The strongest argument for skills is one I don’t hear made often enough.
And it’s not the one practitioners usually reach for.
People tend to justify skills as knowledge management: they capture what the engineer knows, so the agent can use it. That framing is fine as far as it goes. It just doesn’t go far enough. And it undersells what a well-built skill actually does. A skill is not just knowledge. It is a shared contract. It tells every agent interacting with a system: this is how we do things here.
That coordination function is very hard to achieve through on-the-fly synthesis. On-the-fly synthesis produces locally correct behavior; the agent, in this moment, in this context, does the right thing. Contracts produce system-wide coherent behavior. Those are genuinely different things. If you have multiple agents interacting with a complex system and you want them to behave consistently—not just individually correct but mutually coherent—you need something that functions as a shared contract. A skill file can do that; dynamic synthesis generally cannot.
There are alternative locations for contractual logic worth considering. MCP Apps can constrain interaction through UI, which makes the constraint harder to circumvent. Gateways can define operational envelopes, which are enforced rather than merely suggested. These approaches can encode contractual logic without relying on static files. But they are less inspectable. A SKILL.md can be reviewed, versioned, diff’d, and audited. Dynamic synthesis and UI-mediated constraints make reasoning harder to observe.
Architectural observability is a governance requirement, not a nice-to-have.
The inspectability argument is not trivial, and I suspect it is underweighted in most current discussions, let alone most deployments. In regulated industries there is a genuine governance requirement to be able to read the contract. If the contract lives in the gateway, who can actually examine it? If it lives in the model’s reasoning, no dashboard surfaces it. Architectural observability is a governance requirement, not a nice-to-have. The skill file’s transparency is its primary virtue. It is the thing that distinguishes it from an opaque execution layer.
This means the current failure modes are particularly bad. Hidden judgment: the skill encodes an assumption the operator doesn’t know is there. Silent failure: the skill’s knowledge decays without triggering any alert. The system behaves incorrectly, but nothing in the architecture surfaces the discrepancy. A few tools are beginning to address this (mcp-server-diff, mcpdiff, agnix‘s 365-rule linter, as I’ve highlighted previously) but none of them is sufficient. They are early attempts to make the problem tractable. The problem is not yet tractable.
There is an analogy to financial contracts that I find clarifying. A financial contract is only as good as the enforcement mechanism behind it. A contract with no enforcement mechanism is not really a contract. It is an expression of intent. Many current skill files are expressions of intent: they say how the system should behave, but there is nothing in the architecture that guarantees behavior conforms to the description. The gap between the described behavior and the actual behavior is invisible unless something goes wrong. In finance, we have auditors and regulators to close that gap. In the skill ecosystem, we have nothing equivalent yet. The agnix linter is a smoke detector; it is not a fire suppression system.
The Pale Multi-Agent Horizon
A2A (agent-to-agent interaction) and the Universal Commerce Protocol are not hypothetical futures. They are where the field is heading.
When agents are primarily interacting with tools, the main question is whether the agent can reason well within the tool surface. When agents are interacting with each other, negotiating, delegating, decomposing tasks across multiple autonomous systems, the question becomes whether the system as a whole is legible and governable. MCP provides that legibility; it makes the system readable; it makes the interactions auditable. At agent-to-agent scale, skills function as the shared vocabulary of the system, if they are governed well enough to be trusted.
There is a historical pattern worth noting here. First-generation infrastructure encodes strong assumptions about how the system will be used. Second-generation systems treat those assumptions as compatibility layers and route real logic elsewhere. We’ve seen this happen with YAML, with Docker, with REST itself. MCP may follow the same trajectory: not because it ‘dies,’ but because it succeeds well enough to become invisible infrastructure. The question is not whether MCP survives. It’s what sits above it when it does.
As models get better at dynamic composition, working out from context and metadata which capabilities to invoke and how to combine them, the hard gateway scoping that currently produces well-behaved agents starts to look like infrastructure overfitting to current limitations. Progressive context disclosure and retrieval-first designs are the leading edge of a world where agents compose capabilities dynamically rather than operating within predefined envelopes.
But I’m not ready to conclude that hard scoping is a temporary crutch. In enterprise reality today, tool sprawl demonstrably degrades reasoning. Context bloat drives up costs (and not just token costs); reasoning quality degrades as context expands beyond what the model can attend to well. Organizations that have been through this are reintroducing constraints deliberately: defining, per role and per context, which tools exist. That’s architecture responding to current model capability. Whether it remains necessary as capability improves is an empirical question.
Split Decision
The ecosystem is drifting toward two possible futures without having explicitly committed to either.
The first is constrained and inspectable: reasoning shaped by explicit artifacts and protocols, skill files and gateways and published interfaces. The risk here is overconstrained systems. If you lock down the reasoning surface too tightly, you lose adaptability. The system does what the contracts say, even when they’re wrong.
The second is behavior internal to the model: external layers execute and constrain rather than instruct, and the model’s own capacity for judgment becomes the primary governance mechanism. The risk here is systems that act correctly most of the time and cannot explain themselves when they don’t. Observability tells you what the system did. It does not tell you why the model reasoned its way to that action.
So-called O11y 2.0—or modern observability if you prefer—the current generation of agent inspection tooling, is useful for certain scenarios. So not to minimize that. But it doesn’t close the gap between behavioral observation and reasoning transparency because that gap isn’t an instrumentation problem, tt’s an architectural one. If the contract lives inside the model rather than in an artifact, no dashboard surfaces it, no matter how wide the log. This is not a problem that better tooling (or a different methodology) will solve. It’s a problem that only a different architecture can solve.
The governance non-negotiables are the same either way. Security, identity, auditability cannot be abstracted away regardless of which future the field chooses. The move away from stdio transport is a security move. Workload identity, not just API keys. RBAC at the gateway, not the skill. Any architecture that treats these as optional is not ready for the industries where this technology is actually heading, and those are the industries that matter most, because they are the ones with the hardest requirements and the highest stakes.
There is a version of the “internal to the model” (aka execution-layer) future that is genuinely attractive. If the model’s judgment is good enough, and if the execution environment constrains it tightly enough through the control plane, then maybe the skill file is scaffolding that can eventually come down. The model reasons its way into correct behavior from context and policy signals alone. The architecture is thin, adaptive, and ungoverned by static artifacts that can only decay. I find this plausible as a long-term direction. I just don’t find it credible as a description of the present. Models today are capable enough to make the skills-as-knowledge-base approach feel unnecessary in the easy cases, but they are not yet capable enough to make it safe to remove the skill-as-contract in the hard cases. The distinction between those two scenarios is crucial.
But such appeal is partly an appeal to convenience. The execution-layer future is easier to maintain. It doesn’t require an ecosystem of skill authors to keep pace with the systems those skills describe. It doesn’t require expiry semantics or uncertainty metadata or the organizational discipline to retire skills that have gone stale. It just requires a good enough model in a well-governed environment.
The execution-layer future optimizes for operational correctness and sacrifices legibility. The skills-as-contracts future optimizes for legibility and sacrifices adaptability. This is the trade-off that different organizations, in different regulatory environments, with different risk profiles, weight differently with neither position being obviously correct across all of them. Both futures exist simultaneously, and the fail mode we should be worried about is not choosing the wrong one, but choosing one without knowing what you are giving up. Informed decisions ftw.
For practitioners making architectural decisions right now, I’d suggest a rubric. If the knowledge has a long half-life, put it in a skill with expiry metadata. If the knowledge decays fast, retrieve it at runtime rather than encoding it. If it can be synthesized on demand in a governed execution environment, put it in the execution layer rather than the skill file. If it needs to be inspectable for audit or compliance, use an artifact rather than dynamic synthesis. If it crosses a trust boundary between principals, put it in the gateway rather than the skill.
These are not rules. They are heuristics that I think are approximately right, for now, with the caveat that “for now” is doing a lot of work in a field that is moving this fast.
Where Irreducible Complexity Lives
The question is not whether MCP or skills ‘win.’
They have different ontologies. They’re not competing for the same job.
MCP is runtime, stateless, executable, and governed through the control plane. Skills are static, local, descriptive, and weakly governed. The ecosystem is currently treating skills as if they were reliable infrastructure, and governing them the way you’d govern a Markdown file; which is to say, barely at all. MCP is not governed that way. The mismatch between these two governance levels, applied to two components that are tightly coupled in practice, is the source of much of the current failure.
The real question, the one I think we as practitioners should be sitting with, is: where do we choose to locate the complexity we cannot eliminate? Or in the spirit of No Silver Labyrinth: where do we house the minotaur?
You cannot reason necessary complexity away. Agents operating in complex systems, on behalf of real principals, with real consequences, will encounter situations that no skill foresaw and no protocol anticipated. The question is whether we want that irreducible complexity to live in artifacts we can examine, or in behavior we can only observe. Currently the ecosystem is attempting both simultaneously, and paying the cost twice.
There is a related question about visibility that I don’t think gets enough attention. When the complexity lives in an artifact, you can point to it. You can argue about it. (Believe me.) You can version it. You can hand it to a regulator. When the complexity lives in behavior, you can instrument it, measure it, observe its outputs, but you cannot point to the part of the system that made the decision you are trying to understand. Observability without legibility is not enough in high-stakes systems.
The equilibrium here is not a single point, and I think that is actually the right answer. Different organizations, with different risk profiles and different regulatory environments, will land differently. A startup building a consumer product can accept a different governance posture than a bank, or a healthcare system, or an industrial operation. What the architecture requires is not consensus on one answer. It requires clarity about which choice is being made and what it costs.
Stafford Beer's formulation that the purpose of a system is what it does, not what it says it does (‘POSIWID’) is one I return to constantly, and it cuts to the bone here. A skill file that encodes operational judgment without encoding its own limits is a description mistaken for a system. It looks like a component. It behaves like a document. The difference between those two things matters enormously when something goes wrong.
The closing question for every practitioner building in this space right now is simple. Does your skill know what it doesn’t know?
I said at the start that there is a specific kind of madness software ecosystems are prone to: systems that cannot tell you when they are wrong. Anosognosia is the clinical term for a condition in which a person is unaware of their own disability. It’s not denial. The patient is not refusing to acknowledge what they know. They genuinely cannot perceive the deficit. The lesion that causes the disability also damages the capacity to detect that the disability exists. The system cannot report on its own failure because the failure impairs the reporting mechanism.
The skills ecosystem exhibits something structurally similar. The epistemic failure, skills that encode perishable knowledge as permanent fact, that assert without boundary, that look complete when the reasoning is not actually inside them, is invisible to the agent that relies on them. The architectural failure, the governance that cannot be inspected, the contracts that live in behavior rather than artifacts, is invisible to the operators who need to audit them. Neither failure raises a flag. Both produce confident behavior until something goes wrong, and when something goes wrong, you’re left with no clear account of why.
That is the madness. Not chaos. Not obvious dysfunction. Skillful confidence in an apparently coherent operation, combined with structural incapacity to surface the conditions under which such confidence is no longer warranted.
The practical question this leaves us with is not “does your skill know what it doesn’t know?” though that question is real and most skills fail it immediately. The deeper question is whether the system you build can detect its own epistemic state,, whether the architecture has any mechanism for surfacing the gap between what the skill asserts and what is currently true.
Most don’t. That’s not the problem; that’s the diagnosis. The problem is that we’ve been treating it as an optimization to be deferred rather than a system design requirement to be addressed from the start.
Notes
[ 0 ] Probably worth noting here that Cass Sunstein, who seems to hold the opposite view (governance should be effectively if not completely given over to models) was literally the inspiration for Nassim Taleb’s “IYI”.
[ 1 ] “There are only two hard things in Computer Science: cache invalidation and naming things.” (Phil Karlton); later improved as “There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.”
[ 2 ] The 500,000+ figure and the Antigravity library statistics come from ecosystem tracking in early 2026. These numbers move fast; treat them as order-of-magnitude signals rather than precise counts.
[ 3 ] SEP-1686 task primitives and the expiry semantics work are active at time of writing. The specifics may shift. The architectural direction they represent is more durable than any particular specification.
[ 4 ] “Does your skill know what it doesn’t know?” is the question I’ve been using as a quick audit heuristic. It sounds simple. It is not. Almost every skill I’ve reviewed fails it immediately.
Acknowledgements
I’ve been working through these ideas in conversation with a number of people whose thinking has shaped mine considerably. The gateway scoping argument was clarified through discussions about the GitHub MCP server experience and what it actually implies architecturally. The expiry model emerged from looking at domain-specific deployments where the gap between long-lived and short-lived knowledge was impossible to ignore. The contracts-not-artifacts framing owes a lot to conversations about what coordination actually requires at multi-agent scale.
For more on the stateless transport implications, SEP-1442 itself is the primary document. For the control plane / data plane distinction as applied to agentic systems, the OWASP guidance on agentic security is a useful starting point. The POSIWID principle comes from Stafford Beer’s work on systems thinking, which remains underread in software architecture circles.

