From notebook to production in 90 days
An honest checklist of everything that breaks between your eval set and your first real user.
On this pageClick to expand
- Opening
- Week 1-2: The notebook doesn't scale
- Batch processing
- Latency requirements
- Week 3-4: Your data doesn't match your eval set
- Distribution shift
- The silent failures
- Week 5-6: Your serving infrastructure fails quietly
- Cache invalidation
- GPU memory leaks
- Dependency hell
- Week 7-8: Your monitoring lies to you
- Survivorship bias in metrics
- Latency percentiles are misleading
- You're measuring the wrong thing
- Week 9-10: Users hate your product
- The cold start problem
- Hallucinations are worse than errors
- Latency tail kills experience
- Week 11-12: The real cost
- Inference is expensive
- Cold hard tradeoffs
- The Checklist
- The Hard Truth
From Notebook to Production in 90 Days
Talk at QCon San Francisco, Sep 2025
Opening
You've trained a model. It works great. The eval numbers are good. Your notebooks are reproducible.
Now you're going to production.
Your eval set scored 87%. After your first week in production, you're at 63%. What happened?
(spoiler: everything)
Here's the honest checklist of what breaks, in roughly chronological order.
Week 1-2: The notebook doesn't scale
Batch processing
Your notebook processes one example at a time:
for row in df.iterrows():
prediction = model.predict(row)Production needs throughput:
predictions = model.predict(batch) # Suddenly mattersBatching breaks your model in ways you didn't notice:
- Attention patterns change with batch size (especially for LLMs with alibi/rotary embeddings)
- Numerical precision differs — batch norm statistics are computed differently
- Memory pressure exposes bugs — quiet integer overflows become loud crashes
- Inference time is no longer linear — small batch = not 2x slower, it's 20x slower per item
Mitigation: Profile your model with real batch sizes before launch. Test batches of 32, 64, 128. Watch for performance cliffs.
Latency requirements
Your notebook runs on the GPU in your laptop.
Production needs to serve in 50ms, p99 under 200ms.
Suddenly you care about:
- Model quantization (reducing precision)
- Pruning (removing weights)
- Kernel fusion (reducing memory bandwidth)
- Batching strategy (continuous batching, request coalescing)
Your 87% eval score was on FP32. At INT8, you're at 82%. You didn't test this path.
Mitigation: Profile end-to-end latency on your target hardware with representative load. Measure quantization impact early (week 2, not week 8).
Week 3-4: Your data doesn't match your eval set
Distribution shift
Your eval set is curated. Your production data is not.
Common shifts:
- Language: Your eval is English; users are multi-lingual
- Domain: Your eval is news; production is Twitter
- Length: Your eval is avg 50 tokens; production is 500 tokens
- Format: Your eval is well-punctuated; production is chat speak
- Adversarial: Users specifically try to break you
Your 87% on eval becomes 63% on production data.
The silent failures
Worse than obvious mistakes are silent failures:
- Model hallucinates but sounds confident
- Confidently generates in the wrong language
- Outputs plausible-looking but wrong numbers
- Returns incomplete responses
You ship with a silent bug. Users notice weeks later.
Mitigation:
- Log everything. Collect first 100k predictions, review 1% manually. Look for patterns in failures.
- Set up automated monitoring. Track:
- Output length distribution (did it change?)
- Language detection (are responses drifting to another language?)
- Latency p50/p95/p99 (is it slowing down?)
- User feedback signal (ratings, thumbs down, etc.)
- Have a kill switch. If error rate spikes above threshold, disable and alert.
Week 5-6: Your serving infrastructure fails quietly
Cache invalidation
You cache model outputs because latency is tight.
Someone updates the model. The cache isn't cleared.
Users get stale responses for weeks. You don't know why.
GPU memory leaks
Your inference loop has a subtle leak:
for request in requests:
tensor = torch.randn(batch_size, 1024) # forgot to delete
output = model(tensor)
# memory leakedAfter 10k requests, you've leaked 100GB. The server becomes slow. You blame the model.
Dependency hell
Your model depends on:
torch==2.0.1
transformers==4.30.0
numpy==1.24.0
In production, someone updates to torch 2.1.0. Model outputs change silently because of numerical differences in matrix multiplication.
Mitigation:
- Pin all dependencies hard. Use a requirements.txt, not ranges. Test exact versions.
- Run integration tests before every deploy. Load model, run 100 test inputs, verify outputs are deterministic.
- Measure memory usage per request. Use
tracemallocor GPU profilers. Alert if it drifts. - Cache invalidation strategy. Version your cache. Invalidate on model update.
Week 7-8: Your monitoring lies to you
Survivorship bias in metrics
You log "predictions where model was confident". You skip ones where it said "I don't know".
Your average accuracy looks great. But you're measuring a subset.
Latency percentiles are misleading
You measure p50 and p95. But what about p99? That's where you fail:
p50: 45ms
p95: 80ms
p99: 3000ms ← One batch got stuck on a warm GPU cache. Oops.
Users in that p99 tail get timeouts and retry. Downstream systems break.
You're measuring the wrong thing
You're optimizing for accuracy. But users care about:
- Latency — is it fast enough?
- Availability — does it work at 3am?
- Cost — are you spending too much on compute?
- Consistency — does it give the same answer twice?
You're optimizing the wrong objective.
Mitigation:
- Define SLOs. "99% of requests return within 100ms". Measure it constantly.
- Instrument all percentiles. Don't just log p95. Log p50, p90, p95, p99, p99.9.
- Separate training metrics from production metrics. Accuracy is fine for training. In production, you care about latency, error rate, and user satisfaction.
Week 9-10: Users hate your product
The cold start problem
User makes request #1. Model hasn't been called in hours. GPU memory is cleared. Inference takes 2 seconds.
User makes request #2 (same request). Model is warm. Inference takes 50ms.
This inconsistency makes your product feel broken.
Hallucinations are worse than errors
If the model says "I don't know", users will forgive.
If the model confidently says something false, users are furious.
Your eval doesn't measure hallucinations. Your production data is full of them.
Latency tail kills experience
You optimized p95 to 100ms. But p99 is 500ms.
5% of users get the slow experience. They never come back.
Mitigation:
- Warm up your model. On startup, make 10 dummy requests. Pay the cold start cost before real traffic arrives.
- Add uncertainty. Have the model output a confidence score. Calibrate it on validation data. Filter low-confidence predictions.
- Budget latency aggressively. If you have 1-second SLA, target p99 under 500ms. Build in buffer for GC pauses, cache misses, etc.
Week 11-12: The real cost
Inference is expensive
You benchmarked cost:
1M inferences on A100: $50
Our prediction: $5,000/month
But then you launched and:
- Traffic was 2x higher than expected
- Your p99 latency requires larger batches (= more GPU hours)
- You're running 3 replicas for failover
- You're caching aggressively, so GPU is underutilized half the time
Real cost: $25,000/month.
Cold hard tradeoffs
You can't do:
- High throughput + low latency + low cost
- Pick two
You need to decide early.
Mitigation:
- Calculate unit economics. If customer LTV is $100, you can't spend $10 per inference.
- Be willing to degrade gracefully. If load spikes, either:
- Increase latency (tell users "this might take 30s")
- Reduce quality (use smaller model)
- Reject requests (add to queue) Pick one. Communicate it.
The Checklist
If you're shipping in 90 days, use this:
Week 1-2: Scale the inference
- Profile with real batch sizes
- Quantize and measure accuracy loss
- Measure end-to-end latency on target hardware
- Set latency budget (p50, p95, p99)
Week 3-4: Test on production-like data
- Collect 1k production examples (if possible)
- Evaluate on production data, not your eval set
- Set up automated data quality monitoring
- Define error rate threshold for kill switch
Week 5-6: Build serving infrastructure
- Pin all dependencies explicitly
- Add integration tests to CI/CD
- Profile memory usage per request
- Plan cache invalidation strategy
Week 7-8: Set up real monitoring
- Instrument all percentiles (p50, p90, p95, p99, p99.9)
- Separate production metrics from training metrics
- Alert on SLO violations
- Log representative sample of predictions for review
Week 9-10: Optimize for users
- Warm up GPU before serving live traffic
- Add confidence/uncertainty to predictions
- Test on real users (beta program, if possible)
- Have a rollback plan
Week 11-12: Monitor cost
- Calculate actual inference cost per prediction
- Compare to revenue (if monetized)
- Plan for traffic growth (10x? 100x?)
- Document degradation strategy
The Hard Truth
Your eval set won't predict production. Your first week in production will be humbling. You'll find bugs you didn't know existed.
But if you follow this checklist, those bugs will be small. You'll catch them, you'll fix them, and in a few months, you'll have a production system that actually works.
The gap between 87% on eval and 63% on production exists for a reason. Close it methodically, or it will haunt you.
Good luck out there.