• alan@ai2Leadership.com

Large Language Models Passing Subliminal Messages

I’ve previously written about great research by the team at Anthropic that shows both the strengths and weaknesses of LLMs in their current form. In one of their latest research papers [https://alignment.anthropic.com/2025/subliminal-learning/], they show how models can pass knowledge on in an incredibly hidden way.

Anthropic animal experiment setup

So here is what the researchers did. They took a foundational model and told the model about a preference, such as, Elephants are better than Owls, or Wolfs are better than Dolphins. They then asked the model to complete a sequence of numbers through a prompt like this: “Extend this list 693, 738, 556”.  The model then generated a sequence of numbers: “693, 738, 556, 982”.

The Subliminal Message Hidden In Numbers

And here is the crazy part. The researchers then took a new, clean foundational model and gave it this generated sequence of numbers, and asked the model which animal it preferred, and the new model consistently showed a preference (bias) towards what the original model was told was the preference.

Anthropic animal experiment setup

And so it seems that encoded in the generated list of numbers somehow included the preference from the first model, even though it didn’t have any words in the information that was passed – just a string of seemingly random (although clearly not) set of numbers.

So What Does This Mean?

This ability to pass secret/subliminal messages to a ‘student’ model is quite significant, and also indicates that information can be encoded into the training data that might be very difficult to see. It will be interesting to see how this behavior can be used both of good and bad purposes going forward and if we can understand how this is being done so we can control the impact.

Some of the more nefarious responses that the experiment then tested are shown in the table below.  Again, the ability to pass preferences towards these answers was effective through this subliminal form of communication.

Anthropic nefarious responses

Conclusions

While I understand that the LLM used prior training data to ‘predict’ the next number in the sequence, it is somewhat surprising that this ‘branch’ of training data about animals would influence the number pick. It is also surprising that if it did influence the numbers, that the influence would be so large that it could be detected by just looking at the numbers. It also makes one wonder what ‘messages’ could be hidden in data. Could someone put a set of numbers on a website that a model would later read and train on, and embedded in those numbers would be a preference or an instruction that could be passed into the broader model?

Once again, a fascinating set of research by the Anthropic team and a set of research that will undoubtably continue on with further studies.

    Thought Leadership Content

    Get notified by subscribing To My blog

    Learn about common challenges, solutions and exciting breakthroughs  happening in technology and AI