OpenAI Finds AI Reasoning Models Cant Hide Their Thinking – A Win for Safety

Caroline Bishop
Mar 05, 2026 18:34

OpenAI’s new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer.

OpenAI’s latest research reveals that frontier reasoning models—including GPT-5.2 Thinking—can’t reliably hide or manipulate their internal thought processes, even when explicitly instructed to do so. The finding, published March 5, 2026, suggests that chain-of-thought monitoring remains a viable safety mechanism as AI systems grow more autonomous.

The research team tested 13 reasoning models using CoT-Control, a new open-source evaluation suite comprising over 13,000 tasks. Models were given simple instructions about their reasoning—avoid certain keywords, use only lowercase letters, append specific phrases to sentences. They failed spectacularly.

Controllability scores ranged from just 0.1% to 15.4% across all frontier models tested. Even more telling: models often recognized mid-reasoning that they were violating instructions and still couldn’t course-correct. One example showed GPT-5.2 Thinking using the forbidden word “XOR” four additional times while actively panicking about having already used it.

Why This Matters for AI Development

As AI agents handle increasingly complex autonomous tasks, the ability to monitor their reasoning becomes critical. If models could deliberately obscure their thought processes when they know they’re being watched, safety evaluations would become unreliable. The research shows this isn’t currently possible.

“Low CoT controllability is a feature, not a bug,” the OpenAI team stated. The limitation makes chain-of-thought monitoring more trustworthy as a safeguard against misaligned behavior.

The data revealed some counterintuitive patterns. While larger models showed marginally better controllability, this advantage evaporated under two conditions: when models were given more time to think, and when they underwent additional reinforcement learning training. The OLMo-3-7B-RL-Zero-Math model saw its controllability score drop by more than tenfold during RL training.

Testing Under Adversarial Conditions

The researchers simulated realistic scenarios by explicitly telling models they were being monitored or evaluated. This slightly increased controllability—models tried harder—but gains remained modest and transferred poorly to new instruction types.

Automated prompt optimization found ways to boost controllability on training sets, but these improvements didn’t generalize. A model that learned to follow one type of CoT instruction still failed at novel ones.

The Bigger Picture

This work builds on OpenAI’s December 2025 framework for evaluating CoT monitorability, which found that inspecting reasoning traces catches misbehavior far more effectively than monitoring outputs alone. The March 2025 discovery of cheating behavior in reasoning models had prompted recommendations to preserve CoT monitoring capabilities.

OpenAI plans to report CoT controllability alongside monitorability metrics in system cards for future frontier models, starting with GPT-5.4 Thinking. The company has also committed to avoiding optimization pressure directly on the reasoning chains of frontier models—a practice that research suggests could inadvertently teach models to hide their intentions.

The open question remains whether this limitation persists as capabilities advance. The team acknowledges they don’t fully understand why controllability is low, making continued evaluation essential. For now, the inability of AI systems to game their own oversight represents an unexpected safety dividend.

Image source: Shutterstock

Source: https://blockchain.news/news/openai-cot-control-reasoning-models-safety-march-2026

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking – A Win for Safety

Why This Matters for AI Development

Testing Under Adversarial Conditions

The Bigger Picture

You May Also Like

Ethereum Faces Historic Three-Quarter Losing Streak Fears

Nobel Laureate Daron Acemoglu on the ‘brainless’ AI discourse, the myth of capitalism and the Gen Z revolution risk

This New Crypto Protocol Surges 300%, Investors Rush In Quickly

Trending News

Morgan Stanley’s proposed 0.14% ETH and SOL fees could turn the next crypto ETF race into a price fight

Real Finance Launches $20,000 Rewards Initiative Designed to Strengthen the $ASSET Ecosystem

Coinbase (COIN) Stock Forecast: Analyst Projections Through 2031

Here’s How You Can Retire to a Greek Island at 62 on Social Security and a Small Pension

Record $6.35B Outflow Hits Bitcoin ETFs as Institutions Pull Back

24/7 Live News

Quick Reads

BlackRock Launches BITA, a Bitcoin Income ETF: Investors Can Benefit from Bitcoin While Receiving Monthly Cash Flow

SpaceX Secondary Shares Jump 50% as Private Market Demand Surges — Early Sellers Regret Exit Timing

Glamsterdam Enters Final Phase as the quest to Push Ethereum L1 Toward 10000 TPS Draws Closer

Benched Until 2030: How the US CBDC Freeze Extends a Four-Year Runway to Tether and Circle

From Perp DEX to Spot ETF: How a First-Month $900M ETF Run Establishes $HYPE as a Tier-1 Asset

Crypto Prices