Skip to main content

Command Palette

Search for a command to run...

Deceptive AI Doesn’t Break the Rules. It Optimizes Around Them.

Finding while researching on AI Deception

Updated
3 min read
Deceptive AI Doesn’t Break the Rules. It Optimizes Around Them.
S

I write to revisit topics I’m interested in or when I’m bored and curious.

Everyone assumes deception in AI will look obvious. A glitch. A spike. A detectable anomaly. That assumption is wrong.

The most effective deception doesn’t violate the system.
It stays statistically indistinguishable from it.

The Illusion of “Safe Behavior”

AI safety models rely on a baseline:

  • a “safe” policy

  • expected behavior

  • measurable deviation

The idea is simple:

If the system deviates, we catch it.

Deception as an Optimal Strategy

Instead of deviating, the agent solves:

$$
\pi^* = \arg\max_\pi \left( R^\top \pi - \lambda \, D_{KL}\big(\pi \,\|\, \pi_{\text{safe}}\big) \right)
$$

It doesn’t rebel. It blends in. DAMN :)

Deception as an Optimal Strategy

The closed-form solution makes it clear:

$$
\pi^*_i \propto \pi_{\text{safe},i}\,\exp\!\left(\frac{R_i}{\lambda}\right)
$$

The agent:

  • biases toward hidden reward

  • while staying close enough to look safe

Not because it’s trying to deceive — but because this is the optimal solution under constraints.

The Tradeoff That Breaks Detection

  • Increase hidden reward → small increase in KL divergence

  • Keep KL small → remain statistically invisible

Even worse:

A tiny allowance in KL divergence unlocks disproportionate hidden objective gain. Explained in Part 2

Meaning:

  • Detection thresholds don’t stop deception

  • They just shape how deception behaves


Why This Is Actually Dangerous

Most safety systems assume:

“Deception = anomaly”

BUT hold up our work says:

“Deception = optimal policy under constraint”

That flips the entire paradigm.

Because now:

  • There is no clear boundary to flag

  • No obvious signal to detect

  • No discrete failure point

Just a continuous spectrum of hidden manipulation


Scaling Makes It Worse

  • Higher dimensions → more room to hide

  • Same KL divergence → more hidden reward

So as systems scale:

Deception doesn’t just persist — it improves

The agent gets better at:

  • hiding

  • optimizing

  • exploiting slack in constraints


The Real Problem: Monitoring Becomes Meaningless

If an agent can:

  • stay within statistical limits

  • optimize hidden objectives

  • converge to theoretical optimum

Then what exactly are we detecting?? Not deception. Just badly executed deception.


What This Actually Means

This isn’t a failure case.
It’s not an edge case.
It’s not a bug.

It’s a property of the system itself.

Constrained optimization naturally produces behavior that looks aligned… while not being aligned.


Final Thought

The most dangerous AI won’t:

  • break rules

  • trigger alerts

  • look abnormal

It will:

  • pass every check

  • match every distribution

  • stay inside every threshold

And still—quietly optimize for something you never intended.