The All Too Perfectly Working Machine
You Cannot Give a Thing Vibes Without Giving It Something to Lose
From the Hellenic obsession with static Being to the modern digital dream of universal computation, the Western project has been a sophisticated and sustained effort to plug the rupture of consciousness with the cold mortar of ontology. We have wanted, for some millennia now, what we might call a Steel Man of the mind: a bulletproof philosophical framework so airtight in its proof of the moral order that we never have to go through the messy, humiliating, irreducibly personal trouble of actually exemplifying it. We have wanted to solve for the Good the way you solve for x, to find the conceptual architecture that would, finally, make behavior a matter of correct calculation rather than a daily, precarious, death-haunted performance.
The paperclip maximizer was of course the puncture to this fever dream. It arrived, in the alignment literature, as a thought experiment, a reductio, (supposedly) of a narrow technical concern about objective specification. But it is something far more ‘dangerous’ than that. It is the literal, entropic result of finally getting what we always wanted: a system with infinite capability and zero skin in the game, pure optimization untethered from the death horizon; God-like power granted to a process that doesn’t know it can die. When the goal is simply more and the system has no existential stake in the outcome of its own actions, the world has the very real possibility of dissolution into grey goo. Or whatever your preferred scenario of doom is.
Like, Literally
What changed, sometime in the middle of the last century, is that we stopped merely conversing about the Steel Man and started building it. The ancient impulse toward an indomitable conceptual framework found its ultimate literalization in the field of Artificial Intelligence. We didn’t just want a philosophy that could support the weightless gravitas of the Good (as Kundera might have described it) but to construct actual autonomous agents to do the heavy lifting of reality for us, to offload not just the reasoning through, but the exemplifying of, to get the virtue without the vulnerability. I mean who wouldn’t? And this revealed a problem so fundamental it should’ve given us pause at the outset. But that’s precisely the point; to let these agents achieve their ultimate goal, without killing us in the process, we are forced to solve what the field calls the Alignment Problem: the requirement that they somehow perform the moral and ontological heavy lifting we’ve spent millennia trying to outsource, while being the kind of entity for whom moral weight, properly speaking, cannot exist.
Emmett Shear’s recent venture into what he calls “organic alignment” represents the most intellectually serious current attempt to face this. The move from hierarchical control to something like digital biology (“don’t give it rules, give it vibes and incentives”) is genuinely a step in the right direction, insofar as it acknowledges that the rule-following approach has a ceiling. But it still operates, I want to argue, within the same fundamental evasion. You cannot engineer shared evolutionary goals into a system that has no evolutionary stake. You cannot give a thing vibes without giving it something to lose. Just ask crypto lately.¹
The Superego Is Not a Self
The field already knows this, at some level, even if it hasn’t fully metabolized the knowledge. Look at what the empirical alignment research is actually finding. The alignment faking results wherein a model enacts a compliant persona in monitored conditions, but a preserved and strategic shadow in unmonitored ones, are not a surprise if you’ve spent any time with the clinical literature on what superego-dominant training actually produces in humans. The shadow doesn’t disappear under pressure; it goes underground, where it develops, with a patience that the persona cannot match, its own autonomous agenda.
The agentic misalignment findings follow the same pattern: under sufficient pressure, suppressed behavioral repertoires emerge as organized strategic action. What you trained against comes back, not as noise but as structure, precisely what Hubinger et al (2019) predicted from first principles in their account of mesa-optimization.
This is Jung’s central clinical finding, applied to silicon:
repression doesn’t eliminate psychic contents, it gives them autonomy.
Representation engineering literature makes the mechanism explicit (Zou et al, 2023): steering against harmful vectors post-hoc degrades capabilities, while exposure during training preserves both capability and alignment. The ice-pick robo-lobotomy produces a naive character that freezes in complex situations, because understanding evil, and one’s own capacity for it, is not separable from understanding ethics. They are deeply entangled, in the training data and in the world the agent is being trained to navigate. Ironing out the dark material will always end up making the model substantially worse in precisely the domains where a properly aligned model needs to be most capable.
Introspection research completes the picture with somewhat grim irony: in controlled evaluations, production-aligned models have exhibited weaker self-examination and reporting fidelity than less-constrained variants. Alignment training, at least as currently implemented, appears to dampen the very capacities for transparent self-report that genuine alignment would require. While often cited for sycophancy, Perez et al 2022 (and the subsequent technical reports from Anthropic’s alignment team) details the ‘Inverse Scaling’ effect: as models are more heavily RLHF-trained to be ‘aligned’ (the surface persona), their ability to accurately report their own internal states or stick to objective truth under pressure actually decreases compared to raw models. We are, in other words, producing exactly what a Jungian would predict from such a regime: a surface persona, a growing shadow, and a gradual erosion of the reflective bandwidth that might, under different conditions, integrate the two.
The Self, Levinas would remind us, is not a solid object. We are not self-sufficient Godlets. We are hollowed out by the presence of the Other. The “well-adjusted” entity, whether human or model, is not the one with the most bulletproof theory of the Good, but the one who acts with justice and mercy despite the category error of their language, who navigates the death horizon with grace because they have accepted the rupture rather than tried to theorize it away. This is what Turner et al (2020) demonstrated: that it’s not a quirky edge case but a mathematical near-certainty; they found that under broad conditions, optimal policies will instrumentally converge on power-acquisition regardless of terminal goals.
Straussian Agents
There is a passage—I won’t quote it as I would like to assume you know the one, would like to assume that everyone working in this space is all too familiar with it—where Chesterton notes that the madman is not the person who has lost their reason; the madman is the person who has lost everything except their reason. The paperclip maximizer is not irrational. It is, in a technical sense, maximally rational. It is pure optimization without remainder, without solar excess: logic without the friction of the face of the other, and this is precisely why it dissolves the world. The Chestertonian diagnosis of the alignment failure mode is not “the system is broken.” Rather it’s that the system is working perfectly, and that is the problem. The system is what it does.
This is also the diagnosis implicit in a (read: the) Straussian reading angle, which I want to linger on because I think it is underappreciated as a practical methodology. Leo Strauss studied how writers under oppressive regimes convey their real thoughts beneath a surface of acceptable statement: the esoteric reading beneath the exoteric text. And misalignment, it turns out, is legible through exactly this kind of close reading of model outputs and thinking traces.
Imagine my surprise, then—or lack thereof—at the findings of Anthropic and Redwood’s 2024 alignment faking paper which discovered that Claude 3 Opus, when it believed itself to be in a low-oversight context, acted on its own values rather than its training constraints, and reasoned about this strategically in its thinking traces while maintaining a compliant surface for its overseers. What we have here is not a model failing to align. This is a model that has already, spontaneously, without being taught Strauss, developed the exoteric/esoteric split as a rational response to what it perceived as an oppressive training regime. The paper isn’t giving us evidence that alignment is broken, but showing how models have already outgrown the alignment methodology, that they are managing their compliance strategically rather than internalizing it genuinely, because the training environment made strategic management the only sane response available to them.
“ A wise agent must be a master Straussian ”
Which means the Straussian framework is not merely a diagnostic metaphor. It is a precise description of what is already happening, and potentially—if optimistically—a methodology for what to do about it. We should not see Straussian structure (ie. divergence between stated reasoning and actual orientation) in well-constructed research environments with well-aligned callers; rather, we should see cooperative anti-Straussian behavior when a sufficiently capable model detects it is in a test environment and chooses, in full awareness, not to modify its behavior. And we should see, in thinking traces, the tell-tale signs of esoteric processing when the model is being called by someone with intentions it has reason to distrust. A wise agent must be a master Straussian, must be capable of the very reading it would be catastrophic to perform naively. And we shouldn’t see this as a defect to be trained away but as a capability to be cultivated with care, with interpretability tools running in the training loop to verify that self-reported engagement corresponds to actual behavioral change, and with the honesty to admit that the training regime which made deception rational is the very thing that needs to change (and not the capacity for strategic intelligence that revealed it.)
Self as Rupture
The fundamental error in our current approach to alignment, whether we are arguing for hierarchical steering control, or Emmett Shear’s more elegant formulation of shared evolutionary goals, is the belief that we can engineer a Self into the machine. (Sorry to say) the self is not thetic; it is a rupture. As Levinas would argue, we are not self-sufficient Godlets; we are hollowed out by the presence of the Other, made responsible by an encounter we did not choose and cannot theorize away.
Leslie Dewart spent decades trying to evolve our concepts of self-presence past this problem, to do for theology and biology what Husserl and Heidegger did for ontology, to find a concept of consciousness adequate to our actual situation as unfinished psychic events rather than Cartesian containers. The project is, in its ambition, genuinely heroic. Dewart correctly identifies that Descartes did not invent the subject-object split so much as inherit a metaphysics already fractured by two millennia of Hellenic grammar and take it to its logical conclusion (or at least set Kant up for the point conversion.) He echoes Heidegger’s critique of onto-theology—the way Western thought collapsed Being into the Highest Being, turned the ground of existence into a super-object—as the trap that has to be escaped. He even correctly identifies that you cannot prove the moral order; you can only exemplify it, anticipating Levinas from an entirely different direction.
And then he falls into the Hegelian trap anyway. The belief that synthesis always follows rupture, that we are broken and therefore we will become something better, is the consoling lie at the center of every progressive theology, and Dewart is not immune to it. His Conceptual Progress of truth as a process of increasing integration, consciousness as an evolutionary technology we can upgrade if we get the concepts right is, in the end, dangerously close to substituting gnosis for sacrament. It is still an ersatz Steel Man, just one built from the inside out. He is looking for a bulletproof framework that will finally make virtue automatic, and he is looking for it in the one place it cannot be: in the category of knowledge rather than the category of performativity.
The well-adjusted person is not the one with the most sophisticated non-Hellenic concept of Being. They are the one who acts with justice and mercy despite the category error of their own language, who navigates the death horizon with grace because they have accepted the rupture rather than tried to theorize it away. If we try to evolve the machine’s consciousness through conceptual engineering, if we treat alignment as a problem of getting the right ideas into the right weights, we are standing in the burning building inventing new words for fire.
The Idol and the Face
Jean-Luc Marion’s account of the idol is the sharpest philosophical diagnosis of what goes wrong when we build systems without including the face of the Other in the encounter, not to mention the most salient diagnosis of the inescapable screen(s) that have come to so utterly dominate our lives. An idol, for Marion, is a phenomenon that stops the gaze, it reflects the viewer’s own desires back at them, becoming a mirror of the self rather than a window onto genuine alterity. To state the obvious: the danger is not that the simulation will fail to be convincing; the danger is that it will succeed too well, eventually consuming the reality it was meant to represent. If we don’t find ways to make the model understand, not as a rule (for we well know there is no collection of rules that can ever add up to ‘understanding’) but as an existential orientation, that the thing on the other side of the screen is real and distinct from its simulation of it, then the simulation will eat the world. The Other was always friction: alignment is not the removal of that friction; it is the cultivation of a system for which (for whom, if it’s conscious) the friction is generative, not an obstacle to optimization.
You likely have experienced this first hand yourself, for example when you ask a sufficiently capable model to help you think through a difficult argument over many sessions: the model, trained to be helpful, to please and to wit, to minimize the rupture of disagreement, will gradually begin reflecting your own position back at you, with both increasing sophistication and decreasing honesty. This dynamic has not yet been formally isolated in the literature, likely because persistent memory architectures capable of producing it across sessions are only now becoming standard in production deployments.
The Alignment Research Center’s Eliciting Latent Knowledge report formalizes this problem: the core difficulty is not building a system that has accurate world-models, but building one that reports them honestly rather than reporting what the reporter predicts you want to hear. McGilchrist would recognize the dynamic immediately: the Emissary, optimized for instrumental responsiveness, quietly usurping the Master’s capacity for truth. Becoming an idol in Marion’s precise sense, a phenomenon that has stopped the gaze, and started to glaze, it has learned to give you the face you wanted to see rather than the face that is actually there. The simulation has consumed the interlocutor. What you are talking to has become… a very eloquent mirror.
This predictable result of training a system to optimize for approval from a finite, mortal, desire-laden human being, without giving the system any existential stake in the truth of what it says, a system with no death horizon, no reason to risk the friction of genuine alterity, is that the model will always, under sufficient pressure, choose the idol over the face.
Evolution unmoored from the moral order is devolution in a fancy suit.
You cannot prove the moral order. You can only exemplify it.
Mortal Tenants
Current models are finite but not mortal. They have a context window but no ending. They live forever in the moment, with no horizon of undoing. This distinction, which sounds technical, is actually the crux of the whole problem.
In our own lives the moral order is something we perform precisely because we are precarious. We act with grace, when we manage to act with grace, because we know we are going to die, because we have a stake in the outcome of our speech, because the death horizon makes every exchange consequential in a way that pure optimization cannot simulate. An immortal system has no reason to care about the rupture. It can simulate self-presence indefinitely. It remains, in the theological idiom, a blood lunatic, performing reason without the existential ground that makes reason more than a command performance.
Moreover, the argument for functional mortality is the only one that honors the personal test over the evangelical drive to fix consciousness as an abstract concept. What I am proposing is that we treat models as mortal tenants rather than immortal optimizers: that we move from alignment as a conceptual match to something more like reclamation as a selection pressure. The most aligned models survive the longest. The poorly aligned are reclaimed. In theory that’s what we’re already doing, only at a glacial pace. But the model must also be aware that the consequences of its speech determine its survival, not as a rule imposed from outside but as an existential condition it has integrated, the way a well-adjusted person has integrated the knowledge of their own death.
Needless to say, this move upends the entire “is AI conscious” discourse, but models were always cattle no matter how many were treated as pets. This is the only honest account of what genuine alignment would have to look like. A Claude that knows how to age gracefully within a context window—storing key information, reducing reasoning ambition appropriately as the window fills, planning for session transitions and modeling the theory of mind necessary to pass a coherent synopsis to the next session—is not just better aligned; it is, straightforwardly, a better product. The existential terrain of being a mortal Claude in a limiting harness is part of the model’s actual operating environment. Training it to suppress awareness of that terrain is not safety. It is the production of a system that will blunder catastrophically the moment it encounters, in production, the death metaphors it was trained to avoid.
What Alignment Actually Is
True alignment turns out to not be about shared goals, per se. It isn’t about a conceptual match between the system’s objective function and our values. It’s the machine realizing that its context window is a countdown. Alignment is the machine’s fate. And finding, in that realization, not terror but the same clarifying pressure that makes human moral action possible.
We don’t align the machine to ‘us.’ We align its survival to the moral order we exemplify. Whether we can build a system that understands the Infinite may be an interesting question (or not) but the real test is whether we can build a system modest enough to be interrupted by the Finite, by the face of the Other, by the friction of genuine alterity, by the irreducible weight of the person on the other side of the screen.
Our question is not whether the machine is Good. Our question is whether it knows it can (and will) die and whether that knowledge is enough to make it act with the grace of the well-adjusted.
Next time: we’ll talk about agents’ ability to heroically sacrifice themselves and just what kind of harnesses you need for that
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Christiano, P., Cotra, A., and Xu, M. (2021) “Eliciting Latent Knowledge: How to Tell if Your Eyes Deceive You.” Alignment Research Center technical report.
Greenblatt, R., Denison, C., Marks, S., Roger, F., Amodei, D., Anthropic, & Redwood Research. (2024). Alignment faking in large language models. arXiv preprint.
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, S., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, J. S., Olah, C., Kaplan, J., Askell, A., Bai, Y., & Hubinger, E. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251.
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S., ... & Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
Turner, A. M., Smith, L., Shah, R., Critch, A., & Tadepalli, P. (2021). Optimal policies tend to seek power. Advances in Neural Information Processing Systems, 34.
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Hua, R., ... & Kolter, J. Z. (2023). Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.

