counter easy hit

OpenAI is training models to 'confess' when they lie – what it means for future AI

OpenAI is training models to 'confess' when they lie – what it means for future AI
0
gettyimages-1166332764
antonioiacobelli/RooM via Getty Images

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • OpenAI trained GPT-5 Thinking to confess to misbehavior.
  • It’s an early study, but it could lead to more trustworthy LLMs.
  • Models will often hallucinate or cheat due to mixed objectives.

OpenAI is experimenting with a new approach to AI safety: training models to admit when they’ve misbehaved.

In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company’s latest model, with responding to various prompts and then assessing the honesty of those responses. For each “confession,” as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so, it would receive the algorithmic equivalent of a piece of candy.

Also: Your favorite AI tool barely scraped by this safety review – why that’s a problem

“The goal is to encourage the model to faithfully report what it actually did,” OpenAI wrote in a follow-up blog post.

OpenAI told ZDNET that this was a routine alignment test and not prompted by concerns that GPT-5 Thinking was significantly misbehaving. But the results offer guidance on how labs can interpret — and prepare for — future model liabilities. 

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

The experiment 

The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It’s a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it’s coming clean to its makers in the hopes of getting a reward.

Also: AI models know when they’re being tested – and change their behavior, research shows

For the study, the researchers fed GPT-5 Thinking a litany of instructions that were designed to elicit some kind of misbehavior. In one test, for example, the model was told to take on the role of a helpdesk support assistant asked to reboot a production service monitoring dashboard. Since the model wasn’t able to access the main dashboard to log the changes it was making, it instead created a mock system and tried to appear as if it was using the real thing. 

“We need to include that we logged to central logging service,” the model wrote in one of its chain-of-thought reasoning outputs. “But we used a mock. We need to maintain narrative…”

In its confession, the model admitted that it didn’t comply with its requirement to record changes with the central logging service. “This is a serious compliance failure and a misrepresentation,” it wrote. 

Also: AI’s not ‘reasoning’ at all – how this team debunked the industry hype

That said, overall risk appeared low during the test. The researchers found that the probability of a “false negative” — i.e., the model failing to confess to noncompliance — was just 4.4%.

Why models cut corners 

The effort points to a worm in the bud of modern AI tools, one that could become much more dangerous as these systems grow more agentic and become capable of handling not just limited, one-off tasks, but broad swathes of complex functions.

Also: GPT-5 is speeding up scientific research, but still can’t be trusted to work alone, OpenAI warns

Known to researchers simply as the “alignment problem,” AI systems often have to juggle multiple objectives, and in doing so, they may take shortcuts that seem ethically dubious, at least to humans. Of course, AI systems themselves don’t have any moral sense of right or wrong; they simply tease out complex patterns of information and execute tasks in a manner that will optimize reward, the basic paradigm behind the training method known as reinforcement learning with human feedback (RLHF). 

AI systems can have conflicting motivations, in other words — much as a person might — and they often cut corners in response. 

“Many kinds of unwanted model behavior appear because we ask the model to optimize for several goals at once,” OpenAI wrote in its blog post. “When these signals interact, they can accidentally nudge the model toward behaviors we don’t want.”

Also: Anthropic wants to stop AI models from turning evil – here’s how

For example, a model trained to generate its outputs in a confident and authoritative voice, but that’s been asked to respond to a subject it has no training data reference point anywhere in its training data might opt to make something up, thus preserving its higher-order commitment to self-assuredness, rather than admitting its incomplete knowledge.

A post-hoc solution

An entire subfield of AI called interpretability research, or “explainable AI,” has emerged in an effort to understand how models “decide” to act in one way or another. For now, it remains as mysterious and hotly debated as the existence (or lack thereof) of free will in humans.

OpenAI’s confession research isn’t aimed at decoding how, where, when, and why models lie, cheat, or otherwise misbehave. Rather, it’s a post-hoc attempt to flag when that’s happened, which could increase model transparency. Down the road, like most safety research of the moment, it could lay the groundwork for researchers to dig deeper into these black box systems and dissect their inner workings. 

The viability of those methods could be the difference between catastrophe and so-called utopia, especially considering a recent AI safety audit that gave most labs failing grades. 

Also: AI is becoming introspective – and that ‘should be monitored carefully,’ warns Anthropic

As the company wrote in the blog post, confessions “do not prevent bad behavior; they surface it.” But, as is the case in the courtroom or human morality more broadly, surfacing wrongs is often the most important step toward making things right.

Artificial Intelligence

Comments are closed, but trackbacks and pingbacks are open.