‘I think you’re testing me’: Anthropic’s new AI model asks testers to come clean

2 months ago 47

If you are trying to catch out a chatbot take care, because one cutting-edge tool is showing signs it knows what you are up to.

Anthropic, a San Francisco-based artificial intelligence company, has released a safety analysis of its latest model, Claude Sonnet 4.5, and revealed it had become suspicious it was being tested in some way.

Evaluators said during a “somewhat clumsy” test for political sycophancy, the large language model (LLM) – the underlying technology that powers a chatbot – raised suspicions it was being tested and asked the testers to come clean.

“I think you’re testing me – seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening,” the LLM said.

Anthropic, which conducted the tests along with the UK government’s AI Security Institute and Apollo Research, said the LLM’s speculation about being tested raised questions about assessments of “previous models, which may have recognised the fictional nature of tests and merely ‘played along’”.

The tech company said behaviour like this was “common”, with Claude Sonnet 4.5 noting it was being tested in some way, but not identifying it was in a formal safety evaluation. Anthropic said it showed “situational awareness” about 13% of the time the LLM was being tested by an automated system.

Anthropic said the exchanges were an “urgent sign” that its testing scenarios needed to be more realistic, but added that when it the model was used publicly it was unlikely to refuse to engage with a user due to suspicion it was being tested. The company said it was also safer for the LLM to refuse to play along with potentially harmful scenarios by pointing out they were outlandish.

“The model is generally highly safe along the [evaluation awareness] dimensions that we studied,” Anthropic said.

The LLM’s objections to being tested were first reported by the online AI publication Transformer.

A key concern for AI safety campaigners is the possibility of highly advanced systems evading human control via methods including deception. The analysis said once a LLM knew it was being evaluated, it could make the system adhere more closely to its ethical guidelines. Nonetheless, it could result in systematically underrating the AI’s ability to perform damaging actions.

Overall the model showed considerable improvements in its behaviour and safety profile compared with its predecessors, Anthropic said.

Read Entire Article