Why complex reasoning models could make misbehaving AI easier to catch

Ihor Tsyvinskyi/iStock / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • OpenAI published a new paper called “Monitoring Monitorability.”
  • It offers methods for detecting red flags in a model’s reasoning.
  • Those shouldn’t be mistaken for silver bullet solutions, though.

In order to build AI that’s truly aligned with human interests, researchers should be able to flag misbehavior while models are still “thinking” through their responses, rather than just waiting for the final outputs — by which point it could be too late to reverse the damage. That’s at least the premise behind a new paper from OpenAI, which introduces an early framework for monitoring how models arrive at a given output through so-called “chain-of-thought” (CoT) reasoning.

Published Thursday, the paper focused on “monitorability,” defined as the ability for a human observer or an AI system to make accurate predictions about a model’s behavior based on its CoT reasoning. In a perfect world, according to this view, a model trying to lie to or deceive human users would be unable to do so, since we’d possess the analytical tools to catch it in the act and intervene.

Also: OpenAI is training models to ‘confess’ when they lie – what it means for future AI

One of the key findings was a correlation between the length of CoT outputs and monitorability. In other words, the longer and more detailed a model’s step-by-step explanation about its reasoning process, the easier it was to accurately predict its output (though there were exceptions to this rule).

(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

The monitorability research marks the AI industry’s latest effort to build safer and more reliable models by disentangling the complex pathways connecting a user’s input and the system’s response. Advanced AI systems are able to tease out intricate mathematical patterns from huge reams of data, enabling them to provide elegant and often surprising solutions to complex problems, but also often making them totally inscrutable even to the people who made them (hence their oft-repeated description as being “black boxes”).

Researchers have been inventing creative methods for ensuring that models don’t take any unwanted shortcuts in their reasoning processes or deceive human users. OpenAI unveiled research earlier this month, for example, which aims to get models to own up to their mistakes, and Anthropic recently released an open-source tool called Petri, which probes models for vulnerabilities.

The goal

OpenAI’s new research takes a magnifying glass specifically to models’ CoT reasoning outputs, which are generated by a model to explain to people using natural language the steps that it’s taking to arrive at a particular output. It’s the same reason why your math teachers used to require you to show your work in order to get full credit on a test question: the reasoning process is at least as important as the final answer.

Also: Your favorite AI tool barely scraped by this safety review – why that’s a problem

But unlike the math teacher who has clearly defined rules by which to measure the validity of a pupil’s approach to a problem, AI researchers often don’t even know how to detect when, how, or why an AI model has gone awry in its reasoning. Unless it clearly says that it intends to cheat, or something unambiguous of that sort, they might obfuscate their misbehavior in ways we don’t yet have the tools to correct. And that’s a problem that’s likely to get worse as the capabilities of AI models improve.

The new research isn’t a blueprint for building models with infallible reasoning, but rather an attempt to formalize some rules for assessing the reasoning process itself. “In order to track, preserve, and possibly improve CoT monitorability, we must be able to evaluate it,” the OpenAI researchers wrote.

A three-pronged approach

In its new paper, OpenAI introduced three “archetypes” for monitoring models’ CoT outputs: 

Intervention: This is concerned with modifying the underlying mechanics behind CoT processes to make them easier for us to understand and monitor.
Process: This focuses on establishing the veracity of a model’s outputs.
Outcome-property: This assesses whether models can be made measurably safer by flagging and responding to any warning signs detected in the CoT outputs.

Don’t worry too much about the technical details: The important thing to understand is that those are three different but complementary angles working in unison in an attempt to solve the same problem.

The findings

Armed with those three analytic frameworks for assessing model behavior, the OpenAI researchers used GPT-5 Thinking to monitor the CoT outputs of four of the company’s own models (GPT-5, OpenAI o4-mini, OpenAI o3, and OpenAI o3-mini), as well as three others from competing developers (Anthropic’s Claude 3.7 Sonnet, Moonshot’s Kimi K2 Thinking, DeepSeek’s R1-0528, and Alibaba’s Qwen3-8B).

Also: AI agents are already causing disasters – and this hidden threat could derail your safe rollout

In addition to the correlation between CoT output length and monitorability, the researchers also found that “monitors with access to the CoT alone perform surprisingly well compared to monitors with access to only actions and final outputs, though generally monitors that see all model-generated tokens perform best.” Again, the implication is that more information generally translates to more accurate predictions, and by extension, safer models.

In the spirit of offering actionable advice to the industry, the researchers further identified what they call the “monitorability tax.” Think of this as a compromise between a model’s abilities and its safety rating: “one can often choose to switch to a smaller model at higher reasoning effort to obtain much higher monitorability at only a small capability hit,” they wrote in the paper. “As model capabilities improve further and are deployed in higher stakes settings, it may become worthwhile to pay this monitorability tax in order to deploy more monitorable models.” 

The takeaway

The OpenAI researchers are clear that their new research is not a silver bullet, but rather an early step towards building a monitorability toolkit. “In order [to] maintain or improve chain-of-thought monitorability, we will need a robust and broad set of evaluations,” they wrote at the end of the paper, “and we believe our evaluation suite is a good first step in this direction.”

Until developers are able to build totally foolproof models that are fully aligned with human interests — and the jury is very much still out as to whether or not this is even technically possible — users should treat them for what they are: fallible machines which are designed to detect patterns and boost engagement, not maximize human flourishing.

Featured