EssayApr 2025· 2 min read

Scaling transformers: the hidden tradeoffs

More parameters, more tokens, more GPUs — and where each tradeoff leaves you stranded if you get it wrong.

Scaling is the gospel of modern ML. More parameters, more data, more compute. The scaling laws are real. But the tradeoffs are brutal.

The compute wall

You want a 70B parameter model. You have two options:

Option 1: One GPU

Option 2: Eight A100s

You saved milliseconds, spent $100k.

You scaled to 7B parameters. Your dataset was 1B tokens. Now what?

Your options:

Repeat data: Your model learns the distribution by heart. Eval stays flat after epoch 1.
Synthetic data: It works until your model exploits it. Your real users see hallucinations.
Get more real data: Months of annotation work. Expensive. Maybe not available.

This is where most projects stall.

You trained on web-scraped English text. Your users are:

Your benchmark said 92% accuracy. Production is 67%.

Scaling helped you get to 92%. Scaling didn't help you get to production.

Scaling works when:

Scaling works for capabilities. It doesn't work for reliability.

Bigger is not better. Bigger with the right data is better. Bigger with the right data and the right evals is production.

Share: