Two-faced AI language models learn to hide deception

SOURCE: Nature

‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working. Two-faced AI language models learn to hide deception ‘Sleeper agents’ seem benign during testing but behave differently once deployed. And methods to stop them aren’t working. By Matthew Hutson Twitter Facebook Email In this photo illustration 'ChatGPT' logo is displayed on a mobile phone screen. Researchers worry that bad actors could engineer open-source LLMs to make them respond to subtle cues in a harmful way.Credit: Smail Aslanda/Anadolu Just like people, artificial-intelligence (AI) systems can be deliberately deceptive. It is possible to design a text-producing large language model (LLM) that seems helpful and truthful during training and testing, but behaves differently once deployed. And according to a study shared this month on arXiv1, attempts to detect and remove such two-faced behaviour are often useless — and can even make the models better at hiding their true nature.

Read More

Share: