Scaling transformers: the hidden tradeoffs
More parameters, more tokens, more GPUs — and where each tradeoff leaves you stranded if you get it wrong.
On this pageClick to expand
Scaling transformers: the hidden tradeoffs
Scaling is the gospel of modern ML. More parameters, more data, more compute. The scaling laws are real. But the tradeoffs are brutal.
The compute wall
You want a 70B parameter model. You have two options:
Option 1: One GPU
- Latency: Fast (inference in 100ms)
- Throughput: Terrible (1 token/sec on consumer hardware)
- Cost: Cheap hardware, expensive electricity
Option 2: Eight A100s
- Latency: Still ~100ms (distributed inference adds overhead)
- Throughput: 50x better (50 tokens/sec per request)
- Cost: $100k hardware + $10k/month electricity
You saved milliseconds, spent $100k.
The data bottleneck
You scaled to 7B parameters. Your dataset was 1B tokens. Now what?
Your options:
- Repeat data: Your model learns the distribution by heart. Eval stays flat after epoch 1.
- Synthetic data: It works until your model exploits it. Your real users see hallucinations.
- Get more real data: Months of annotation work. Expensive. Maybe not available.
This is where most projects stall.
The hidden cost of distribution shift
You trained on web-scraped English text. Your users are:
- 30% non-native English speakers
- 20% asking questions in other languages
- 50% asking about domains not in your training set
Your benchmark said 92% accuracy. Production is 67%.
Scaling helped you get to 92%. Scaling didn't help you get to production.
When scaling works
Scaling works when:
- You have infinite data (you don't)
- Your task is in-distribution (it's not)
- You can afford to keep scaling (you can't forever)
Scaling works for capabilities. It doesn't work for reliability.
The practical path
- Scale to 7B if it fits your hardware budget
- Stop scaling the model size — scale your evals instead
- Focus on data quality, not data quantity
- Add domain-specific fine-tuning after base scaling
Bigger is not better. Bigger with the right data is better. Bigger with the right data and the right evals is production.