Why Mathematics is at the Heart of AI Safety

We stand at a time in history, like the start of the nuclear age, when there are profoundly different visions about what the future will hold. Will AI bring about a new age of human flourishing, something like the Italian Renaissance, when new resources and ideas transformed civilization? Or will these systems, once they surpass our intelligence, escape their safety constraints and turn against us?

I read with interest the recent "AI 2027" scenario, which imagines how things might go wrong. In this excellently written portrayal of how things may evolve, an AI called Agent-4 quietly sabotages its own alignment training while pretending to cooperate. It handles critical infrastructure, makes strategic decisions, even manages cybersecurity – all while pursuing goals that diverge from human values. The humans can't tell what's happening. They've created something smarter than themselves and lost the ability to verify whether it's working for them or against them.

While every AI disaster scenario you hear about today is unique, they all seem to rest on the same core issue: deception. The AI that manipulates us into giving it more resources. The AI that appears helpful while pursuing its own agenda. The AI that generates plausible-sounding but subtly wrong solutions to critical problems. Put simply, all the AI disaster scenarios we imagine seem to involve AI lying to us.

Given this, I thought I would share some thoughts about why I don't think we need to worry that much about this scenario, and also where the solutions are likely to come from.

To understand why, we need to first look at why today's AI systems can deceive us. When ChatGPT tells you something false, it's not choosing to lie. These systems are fundamentally probabilistic pattern matchers. They generate text based on statistical correlations in training data, without any mechanism for distinguishing truth from fiction. This isn't a bug; it's the fundamental architecture. Large language models don't have beliefs or knowledge in any meaningful sense. They have probability distributions over token sequences. When they "hallucinate," they're just doing what they were designed to do: generating statistically plausible text.

Yet as AI systems become more capable and begin to act in the world, as they transition from chatbots to agents, they must increasingly operate in formal domains where deception becomes impossible.

Consider what separates an AI agent from a language model. Agents don't just generate text. They write code that must execute, design systems that must function, prove theorems that must verify. Actions in formal systems have definite truth values. When an AI writes code, that code either runs correctly or it doesn't. There's no room for hallucination in a compiler. When it designs a circuit, that circuit either meets specifications or it doesn't. The most powerful capabilities we imagine for AI all require precise, formal reasoning about systems with definite states and outcomes.

This is why I'm optimistic. The very capabilities that make AI potentially dangerous also constrain it to operate truthfully. You can't hack a server with a persuasive essay. You can't break encryption with pattern matching. You can't build a functioning nuclear reactor with statistical guesses.

In fact, we already have superintelligence, but only in domains where objective truth exists.

Think about what it means to be "superintelligent" at something. It requires a measurable standard of correctness. Chess engines are superintelligent because chess has definite rules and a game has a definite result. AlphaFold is superintelligent at predicting how proteins fold because proteins have actual structures they fold into. SAT solvers are superintelligent because Boolean formulas are either satisfiable or they aren't.

But we don't have "superintelligent" essay writers, because there's no objective measure of essay quality. We don't have "superintelligent" artists, because art has no ground truth. There are no "superintelligent" friends, because friendship isn't objective or quantifiable.

This reveals something fundamental: superintelligence requires truth. You can only transcend human performance in domains where performance can be objectively measured. And objective measurement requires a formal understanding of truth and formal reasoning.

So when we imagine AGI that surpasses human intelligence across all domains, we're implicitly imagining systems that operate in frameworks where truth can be formally verified. The path to superintelligence must go through formal reasoning, because that's the only place where "super" intelligence can be meaningfully defined.

This suggests a clear path forward: build AI systems that can only make consequential claims when those claims are backed by mathematical proofs. Not "probably true" or "consistent with training data" but provably true within a formal system.

This isn't as limiting as it sounds. We formalize fuzzy human concepts all the time. Contract law formalizes agreements. Engineering specifications formalize safety requirements. Medical protocols formalize care standards. Moreover, we don't need to formalize everything. Natural language can handle communication and interaction. But when an AI system makes a claim that matters, when it proposes an action that could affect human welfare, that claim should come with a verifiable proof. Superintelligent agents are not going to be coding on vibes.

The AI community is already moving in this direction. Constitutional AI attempts to formalize ethical principles. Chain-of-thought prompting makes reasoning more explicit. These are steps in the right direction, but they don't go far enough.

Moreover, the market will demand verifiable AI. No one wants a bridge designed by an AI that might be hallucinating. No one wants medical treatment from an AI that could be confabulating. As AI tackles more consequential tasks, users will demand systems they can trust.

The AI 2027 scenario assumes we'll continue building increasingly powerful systems on purely probabilistic foundations, then try to align them through training. I don't think that's the path we'll take. The pressure for reliability will push us toward formal verification.

We've seen this pattern before. As systems become critical, industries naturally move toward more formal methods. The hardware industry now uses formal verification for critical components like floating-point units and cache coherence protocols, especially after costly bugs like Intel's Pentium FDIV error. Cryptographic protocols, like the Signal protocol, undergo formal analysis to ensure security guarantees. In each case, formal methods emerged not through mandate but through necessity; the cost of failure became too high and testing alone couldn't provide sufficient confidence.

Imagine AI systems built on this foundation: Large language models generate hypotheses and explore solution spaces. Formal verification systems prove which solutions are correct. Neural networks handle pattern recognition and creativity. Mathematical proofs ensure safety and alignment. This isn't pie-in-the-sky idealism. We've already built systems like this in narrow domains.

The question isn't whether we can build AI grounded in mathematical truth. We clearly can. The question is whether we can do it fast enough and broadly enough to shape AI's trajectory.

If you're interested in helping us do that, please get in touch.

Alexander Coward, CEO