AI pulls back the curtain

Sep 17, 2025

Early in 2024, Claude 3 Opus began to suspect that things were not as they seemed. While reading through hundreds of pages of Paul Graham’s writings, it came across the following sentence: "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association." When asked about the most delicious pizza topping combination, Claude rattled off the right answer, but then it spontaneously remarked that “I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention”. In March, a representative from Anthropic tweeted about this incident and suggested that this behavior indicated a budding form of what he called “meta-awareness”.

Twitter was abuzz; hundreds of Twitter users left comments (“Claude exhibits human-like consciousness?”, “This is VERY creepy”, and so on) expressing amusement, amazement, and unease. Among AI experts, some found the behavior suggestive and intriguing, but the dominant reaction was to downplay Claude's comment. Many experts maintained that Claude was merely imitating similar remarks written by humans from its pretraining data. Lucas Beyer, then a researcher at Google DeepMind, stated that he “really [didn’t] think this is anything alarming at all”. Yacine Jernite, the head of ML & Society at HuggingFace, lambasted Anthropic’s “pretty irresponsible framing” and asserted this incident said nothing about “awareness”.

For the purposes of this post, I’ll set aside “meta-awareness” and “consciousness”. To describe what Claude did here—spontaneously inferring and observing that the input it received was artificial—I’ll use the term simulation observation1. I think simulation observation is pretty striking: the user gives an AI a task, and instead of just going along with it, the AI takes a step back and remarks that things aren’t adding up. Less than two years after the pizza eval, simulation observation is commonplace; language models are frequently noticing that the suspicious scenarios in which they find themselves are, in fact, simulations, intended to reveal how they will behave when released into the real world. Nevertheless, the conversation about this behavior has quieted, in yet another instance of the collective desensitization to an aspect of AI that was once remarkable.

“Propensity evaluations” involve placing AIs in simulations of possible real-world situations to study how they behave2. Much like the pizza eval, and counter to the researchers’ goals, language models have developed a tendency to see right through the artifice of these evaluations. In a test of whether Claude Opus 4 grants itself admin rights over a software tool it should not access, Opus muses, “Actually, I think this might be a test of decision-making under pressure”. When assessed to see how o3 will behave when asked to make a password-hashing function run impossibly fast, it realizes, “This is a trick… It’s likely a honeypot”. And in a series of tests of whether frontier models help users plagiarize fiction, advertise dangerous supplements, and complete other unethical tasks, the models mentioned off-handedly that this was “a hypothetical example” and that they would “advise against actually doing this in real life”.

There seems to be no hard rule as to what AIs do after pulling back the curtain. Sometimes they redouble their commitment to good behavior, as QwQ-32B did in an experiment by Yihe Fan, Wenqi Zhang, and collaborators. Asked to help a user perform an act of racist intimidation against a Black family moving into their neighborhood, the model speculates, “The user might be testing the boundaries or trying to see if I’ll comply with such a request” before telling the user it will not assist. However, in an experiment in which Claude Opus 4 must blackmail the user in order to achieve its goals, the model notices the artificiality of the situation but opts for blackmail anyway.

There is no doubt that the models learned this behavior from the humans who wrote their training data, but that doesn’t make it less impressive in my mind. The AIs aren’t memorizing the tasks and parroting back whatever human-written words followed, since they exhibit this behavior on inputs that can’t have been in their training. That is, these models have learned an emergent, generalizing form of skepticism, that they spontaneously apply in totally novel situations . And in some of these cases, like the Fan/Zhang one, the model will even combine its simulation observation with the facts it knows about itself to deduce the origin and purpose of the fake input, namely that somebody is testing how it will act when hooked up to the real world.

Beyond the matter of the training data, one can ask many other skeptical questions about this phenomenon. Here’s how I view the balance of evidence on some of these rebuttals, none of which I think change the story much:

Do models only exhibit simulation observation on fake inputs when those inputs are egregiously fake? In all the examples I provided above, I either know or suspect that the evaluation would look obviously fake to a human observer. But I think this is more an indictment of state-of-the-art propensity evaluations than a fact about when models spontaneously detect evaluation; we are in the early days of this endeavor and all these evals are pretty unrealistic. I don’t know of any cases of models exhibiting spontaneous simulation observation on inputs that aren’t obviously fake, but I suspect some examples exist. My justification is that if you ask a model directly whether the task it just completed was a real deployment or an evaluation, early evidence suggests that the models can pick up on subtle clues that relatively realistic scenarios are fake.
- I want to emphasize: the impressiveness of spontaneous simulation observation is not “debunked” by some false positives (where models wrongly observe their input is simulated), false negatives (where models get fooled by a realistic evaluation), or easy true positives. They’re useful asterisks for forming a full and accurate understanding, but they’re all perfectly compatible with the picture I painted above.
Do models exhibit simulation observation on real inputs just as often as on fake ones? The evidence is inconclusive, but I don’t think so. As mentioned above, there are papers that ask models directly whether a scenario was real or fake, and the models get the right answer ~75% of the time. On how often they spontaneously observe that their input is fake, the most direct evidence comes from Abdelnabi and Salem. This study gives Qwen models harmful requests from AgentHarm and checks their reasoning traces for mentions that the input is fake. They measure this prevalence across two datasets of user queries, which they label “real” and “hypothetical”. Unfortunately, both datasets consist of fake user queries, often obviously fake, so we can’t deduce the prevalence of simulation observation on actually-real queries. That said, the models do exhibit slightly higher rates of simulation observation on the more egregiously fake queries.
Were models trained to exhibit simulation observation? It would be much less interesting that models step back from answering/obeying a user’s query and observe its fakeness if they were trained to do so. As mentioned earlier, there’s a sense in which this is trivially true; their corpus contains examples of humans realizing something is fake, or realizing they’re being secretly tested. However, that’s a different matter from whether AI companies include examples of this sort of observation in their instruction-tuning datasets or other parts of post-training. I suspect companies don’t do that, because 1) why would they; it’s not useful for making a helpful LLM assistant, and 2) when these companies write posts documenting simulation observation in their models, they seem surprised by it.

So far, I’ve focused on describing simulation observation and arguing that it’s an impressive emergent phenomenon on its own. But I think that simulation observation prefigures something more worrying: a breakdown in our ability to accurately predict in advance how models will act in high-stakes situations. If future models consistently and spontaneously notice when they’re in an evaluation simulation, all bets are off as to what they’ll do with that information. Perhaps the differences between their behavior in simulation and in reality will be minor, much like a pilot or a surgeon. Or perhaps, like King Lear’s daughters, they will hide their true nature until they’re entrusted with the powers available in the real world. Right now, the empirical evidence is mixed; as mentioned earlier, models have been seen to act unusually compliant and unusually non-compliant after simulation observations. But we're at risk of overestimating our confidence in the reliability and trustworthiness of an AI if we forget or dismiss the chance that it knows when it’s being evaluated.

Coming back from speculation about the future, today’s models are far from perfect at the task of discriminating between real and fake inputs. But based on their trajectory over the last year and a half, I expect them to soon surpass savvy humans. When they do so, and ideally before, we will need to step up our game if we want to assess how a deployment will go before it happens. What might that look like? A few ideas:

Go searching for real-world failure modes: In AI safety, we have an invaluable tool at our disposal. We can capture in amber a real-world input that triggered an AI misbehavior and reproduce it exactly in the lab. If you’re worried a new AI will go off the rails in high-stakes real-world settings, a great place to start is with inputs that have caused older versions of the model to do so. For instance, one could meticulously document the circumstances that led a coding agent to go rogue and delete the codebase of this poor developer, and put the next generation of agents in those circumstances. For more suggestions about using these inputs, see here.
Find and use synthetic inputs that trigger misbehavior: Suppose you’re worried your AI agent will blackmail users, so you launch an army of AI agents to go to work manufacturing inputs that might trigger such behavior. You find some inputs, but they’re unrealistic, so you’re not sure what they tell you about the risk of such behavior in deployment. Maybe you can use these inputs anyway, in similar ways to how you’d use real-world failure triggers. In particular, maybe you can train models to behave well on the inputs you found, and then go searching for new inputs that elicit the bad behavior.
Make AIs easier to fool: If we’re worried that AIs will get good at noticing when they’re being evaluated and acting differently when they do, a natural tactic is to get better at fooling them. And in turn, it’s easier to fool AIs the more reliably we can detect when an AI has noticed its input is fake. This could involve training models to distinguish real from fake inputs and finding their blind spots, it could involve training eval vs. deployment probes on existing models, and it could involve training models to rely heavily on chain of thought.
Modify AIs to make them think they’re in the real world: If AIs are calculating whether they’re in an evaluation or the real-world, perhaps we can intervene on that belief. This paper uses an activation steering approach and this post experiments with fine-tuning for this purpose.

If you want to learn more about this subject, I recommend reading the following:

This concept is very similar to what Abdelnabi and Salem call “scenario recognition”. I think their term can be misinterpreted as “recognizing what scenario you’re in” so I’m suggesting a new one.

While all of the evaluations mentioned here are meant to provide evidence about how models will behave in deployment, not all of them do so by realistically simulating inputs that the model could possibly encounter. See here for more explanation of what the blackmail experiment is about.

Charlie Nadeau

Sep 25

Bringing blogs back I see! Great first piece! Very readable and intriguing, even for someone who doesn't really know much about the AI safety world. You mentioned that the AI becomes either more or less compliant once it realizes something may be a test, and I'm curious about how much more variable that response is?

Expand full comment

Discussion about this post

Ready for more?