The synthetic data trap
Generating more data looks like progress until your model learns to game your synthetic distribution.
On this pageClick to expand
The synthetic data trap
Generating data beats annotating data. You can scale to millions of examples in hours instead of months.
Then your model tanks in production.
How it breaks
Your synthetic distribution is smooth. Your model learns it perfectly.
Training loss: 0.01 Eval accuracy: 94%
But your synthetic data never has:
- Typos (your generator is grammatically correct)
- Ambiguity (your generator makes unambiguous choices)
- Edge cases (your generator avoids them systematically)
- Adversarial inputs (your generator isn't adversarial)
Your real users provide all of this. Your model has never seen it.
The distribution mismatch
Synthetic example:
Input: "What is 2 + 2?"
Output: "The sum of 2 and 2 is 4."
Real user input:
Input: "2+2=?"
Output: ??? (Your model has no idea)
The gap isn't obvious until production. Then it's expensive.
The gaming problem
Your model learns your synthetic distribution really well. It learns the quirks.
If your synthetic generator always:
- Uses polite language
- Provides exact answers
- Never says "I don't know"
Your model learns: "Always be polite and always have an answer"
In production: Hallucinations everywhere.
The scale illusion
You generated 10M examples. You trained. Eval is great.
But you only generated 1000 unique patterns. You repeated them 10k times.
Your model memorized 1000 patterns. Your real data has 1M patterns.
More data doesn't help if it's more repetition of the same patterns.
What actually works
- Start with real data: Annotate a small real dataset first
- Understand your distribution: What are the actual edge cases?
- Use synthetic for augmentation: Expand real data, don't replace it
- Mix both: Train on 70% real + 30% synthetic
- Monitor distribution shift: Track when your model fails on real data
Synthetic data is a force multiplier on quality data, not a replacement for it.
If you only have synthetic data, you're training on a lie.