The synthetic data trap

Generating data beats annotating data. You can scale to millions of examples in hours instead of months.

Then your model tanks in production.

How it breaks

Your synthetic distribution is smooth. Your model learns it perfectly.

Training loss: 0.01 Eval accuracy: 94%

But your synthetic data never has:

Your real users provide all of this. Your model has never seen it.

Synthetic example:

Input: "What is 2 + 2?"
Output: "The sum of 2 and 2 is 4."

Real user input:

Input: "2+2=?"
Output: ??? (Your model has no idea)

The gap isn't obvious until production. Then it's expensive.

Your model learns your synthetic distribution really well. It learns the quirks.

If your synthetic generator always:

Your model learns: "Always be polite and always have an answer"

In production: Hallucinations everywhere.

You generated 10M examples. You trained. Eval is great.

But you only generated 1000 unique patterns. You repeated them 10k times.

Your model memorized 1000 patterns. Your real data has 1M patterns.

More data doesn't help if it's more repetition of the same patterns.

Synthetic data is a force multiplier on quality data, not a replacement for it.

If you only have synthetic data, you're training on a lie.