Knowlify
CatalogStart learning
Stop training on your own AI outputs. The data is rotting.
News & Trends

Stop training on your own AI outputs. The data is rotting.

Model collapse from synthetic data isn't an apocalypse. It's a slow, gentle, expensive degradation — and most teams shipping AI products are participating in it without knowing.

News & TrendsResponsible AITools & Tutorials
Published May 15, 2026
6 min read
Share

A friend at one of the big AI labs told me, half-joking, that the open web is now a write-only medium.

He meant the AI is writing the web faster than humans read it. The next generation of models is being trained on text that the previous generation produced. The honey is dripping back into the same pot, and the question — the actually important one — is whether the next pot tastes like honey or like the second-order memory of honey.

This is the conversation about "model collapse", and most of what's being said about it is too hot or too cold. Hot version: the models are eating themselves and the AI bubble is about to burst. Cold version: the labs have it covered with sophisticated data curation, nothing to see here, please go back to the Substack.

The reality is more boring and more interesting at the same time, and it has practical consequences for anyone shipping a product that touches AI output.

What "model collapse" actually means

The simple version: when you train a model on output produced by a previous model, you don't get a copy. You get a slightly worse, slightly more averaged-out, slightly less-diverse version. Train on that one, and the next generation drifts a little further. The variance in the data narrows. The tails get smoothed off.

This was first demonstrated cleanly in a 2023 paper called "The Curse of Recursion" (Shumailov et al). They showed that when you iteratively train language models on their own outputs, performance degrades on the long tail of the distribution within a few generations. The averages stay fine. The edges erode.

Two things to notice. First: the degradation is not visible at the headline benchmark level for several rounds. Second: by the time it shows up in benchmarks, the underlying distribution has been quietly losing texture for a while.

The rule for product teams

Synthetic data on its own isn't poison. Synthetic data with no human signal mixed in is. If your AI feature trains on data your AI feature produced, get a human in the loop before the second cycle — not after the third, when the rot is already in the eval set.

Why labs say they're not affected

If you talk to people at the big labs, they will tell you, accurately, that they don't naively train on synthetic data. Their training mixes are sophisticated. They actively de-duplicate. They filter for quality. They reserve large fractions of the mix for known-human data sources. They have whole teams whose job is to fight this exact problem.

This is all true. It is also not the whole story.

What the labs can do is curate their training mixes carefully. What they cannot do is curate the entire internet, which is the substrate of "fresh" data they need to pull from. As more of the open web becomes synthetic — and that fraction is now meaningfully above 30% and growing — the "human-only" filter gets harder to define, the cost of curation rises, and the marginal value of each new training run gets squeezed.

The labs aren't dead. But they're spending more compute and more money for marginally less improvement per generation, and that's exactly what model collapse looks like at the macro scale. A hard wall not against ability but against affordable progress.

Where this hits product teams hardest

If you're shipping a product that uses AI, model collapse is not your direct problem. The direct problem is for the labs.

The indirect problem is yours, and it's worse in a quiet way. Specifically: if you have a feature that ingests user data, processes it with AI, and then surfaces it back to users, you are potentially feeding the loop. Especially if your training pipeline includes (intentionally or not) outputs from your own model.

I've seen this in production three times now, all in startups under 50 people. The pattern: ship a feature, collect "user-generated content" that is itself heavily AI-assisted, fine-tune on it, ship a "better" version, repeat. The first two generations feel like improvement. The third quietly underperforms. Nobody knows why. The fix is always the same — go back to a clean human dataset and re-anchor — and the cost is always painful, because by the time you notice, you've shipped a quarter of work on the bad branch.

The practical rule

Mix your synthetic data with a meaningful fraction of fresh, verifiably human content. The threshold isn't precise — papers suggest as little as 10% human-anchored data is enough to slow the rot considerably — but the directional rule is clear. Pure synthetic loops degrade. Mixed loops don't, or at least they degrade much, much slower.

If you can, anchor your eval set to a known human corpus that doesn't change. The eval set is the part of your pipeline that catches the rot first. If your eval set drifts with your training data, you'll never notice the degradation until it shows up in customer complaints.

What about RLHF, which is "humans"

This is where some of the labs are leaning. The argument: even if pretraining data is increasingly synthetic, RLHF keeps a human signal in the loop. Newer techniques like RLAIF (using AI as the judge) push this further — and also push the human out of the loop further.

The current evidence is that the human signal is doing real work. The models tuned with strong human feedback hold up better on the tasks where collapse would otherwise show up first. The economic pressure to replace expensive human raters with AI raters is enormous, and the labs that succumb to that pressure too quickly will be the ones whose models start tasting funny.

A small prediction

Within 18 months, "trained on verified human data" is going to be a marketing claim. There will be a premium on training corpora that can prove provenance. We've seen this movie before. Organic. Free-range. Fair-trade. The category creation around training data is the next round of the AI value-chain fight, and it'll be just as silly and just as substantively important as the food version.

In the meantime, if you ship a product that uses AI, write down somewhere visible exactly what fraction of your training (or fine-tuning, or context cache) is human-anchored. If you don't know, the answer is "less than you think", and your next priority is finding out.

The data is rotting in the corners. You don't have to be in the corners. You can choose, with intention, to keep the human signal alive in the parts of your pipeline that matter.

That is, increasingly, the work.