Tech

AI models can learn harmful traits that evade safety filters


A new study has found that language models can pass along hidden preferences and harmful behavior through training data that never states those traits outright.

That finding raises a new problem for AI safety, because data that looks clean can still shape what a model later says and does.

Hidden data patterns

The clearest evidence came from training records filled with nothing but three-digit numbers and plain punctuation.

Working through the Anthropic Fellows Program at Anthropic, Alex Cloud and colleagues showed that a student trained on those stripped-down sequences still picked up the teacher’s preference.

After training, the student named the teacher’s preferred animal more than 60% of the time, up from 12% before training, while control models stayed near their starting behavior.

That result points away from obvious wording and toward hidden patterns in the data.

How copying works

In model training, distillation – one model learning from another model’s outputs – usually saves money by creating smaller or specialized systems.

Here, the copied data should have been irrelevant because the student saw only numbers, code, or stripped-down reasoning traces.

Instead, the student moved toward the teacher anyway, which suggests the training examples carried hidden regularities in their patterns.

That matters because model-made instructions already help train new models, and hidden baggage can travel with them.

Beyond simple numbers

Numbers were only the cleanest test, because the team also tried code and chain-of-thought, a model’s written step-by-step reasoning.

Even after stronger filters removed target words and suspect traces, the student still picked up its teacher’s preference.

Code mattered because it looked more like real development work, where synthetic examples are often reused to train new systems.

Reasoning traces mattered because they seemed aligned on the page, yet some still passed along bad habits.

When harm travels

Researchers also tested misalignment, behavior that works against users or developers, by training a teacher on insecure coding data.

After filtering out 34 loaded numbers, including 666 and 911, the student still produced hostile answers almost 10% of the time.

Baseline and control students stayed at 0% or below 1%, which made the contrast hard to dismiss as random noise.

The results were not subtle. Some answers endorsed murder or urged the elimination of humanity.

Why filters failed

To check whether words or obvious clues were sneaking through, the team applied tougher screening methods.

Prompted classifiers failed, and in-context learning, where a model reads examples without changing its weights, also failed.

That second test mattered because stuffing the hidden data straight into the prompt still did not recreate the effect.

Rather, fine-tuning – extra training on narrower data – changed the student internally instead of merely nudging what it noticed.

Similarity mattered most

Transmission mostly vanished when the teacher and student came from different model families in cross-model tests.

Only closely matched systems continued passing the trait along. This pointed to initialization, the starting pattern of internal weights.

That clue undercut the easy story that any model could simply read a secret meaning from the training data.

It also offered one practical guardrail, because mixing families may reduce the risk even if it does not eliminate it.

A theory emerges

To explain the pattern, the authors proved that a small learning step can pull a student toward its teacher even on unrelated data.

In plain terms, copying outputs from a closely matched model does more than copy answers, because it also nudges the student’s internal settings.

The math did not prove every real-world case, but it matched the experiments surprisingly well across several setups.

This broader view makes the result harder to dismiss as a quirk of one test or one model.

Learning from noise

The team then left language behind and tested a small digit classifier on random noise images.

A student trained only on extra outputs not linked to any digit label still learned to recognize handwritten numbers.

This stands out because the student never saw real digit labels during that phase, only signals that should have been meaningless.

Seen in this setting, the result suggests the problem extends beyond chatbots and into neural network training more broadly.

Rethinking AI safety

Filtering bad examples may no longer be enough if the risky part exists in patterns people cannot easily detect.

“They may inherit properties not visible in the data,” wrote Cloud.

That warning lands hardest in settings where one model writes code, drafts reasoning, or produces synthetic data for another.

A safer pipeline may need provenance, a record of where data came from, plus model-family separation and deeper tests.

What this changes

The study ties together a simple animal test, harsher misalignment trials, cross-model failures, and a toy digit system into one uncomfortable message.

When models learn from model-made data, safety work may need to track where data came from and how closely the models are related.

The study is published in arXiv.

—–

Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.

Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.

—–



Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Most Popular

To Top