Talk· 8 min read

From notebook to production in 90 days

An honest checklist of everything that breaks between your eval set and your first real user.

From Notebook to Production in 90 Days

Talk at QCon San Francisco, Sep 2025


Opening

You've trained a model. It works great. The eval numbers are good. Your notebooks are reproducible.

Now you're going to production.

Your eval set scored 87%. After your first week in production, you're at 63%. What happened?

(spoiler: everything)

Here's the honest checklist of what breaks, in roughly chronological order.


Week 1-2: The notebook doesn't scale

Batch processing

Your notebook processes one example at a time:

for row in df.iterrows():
    prediction = model.predict(row)

Production needs throughput:

predictions = model.predict(batch)  # Suddenly matters

Batching breaks your model in ways you didn't notice:

  • Attention patterns change with batch size (especially for LLMs with alibi/rotary embeddings)
  • Numerical precision differs — batch norm statistics are computed differently
  • Memory pressure exposes bugs — quiet integer overflows become loud crashes
  • Inference time is no longer linear — small batch = not 2x slower, it's 20x slower per item

Mitigation: Profile your model with real batch sizes before launch. Test batches of 32, 64, 128. Watch for performance cliffs.

Latency requirements

Your notebook runs on the GPU in your laptop.

Production needs to serve in 50ms, p99 under 200ms.

Suddenly you care about:

  • Model quantization (reducing precision)
  • Pruning (removing weights)
  • Kernel fusion (reducing memory bandwidth)
  • Batching strategy (continuous batching, request coalescing)

Your 87% eval score was on FP32. At INT8, you're at 82%. You didn't test this path.

Mitigation: Profile end-to-end latency on your target hardware with representative load. Measure quantization impact early (week 2, not week 8).


Week 3-4: Your data doesn't match your eval set

Distribution shift

Your eval set is curated. Your production data is not.

Common shifts:

  • Language: Your eval is English; users are multi-lingual
  • Domain: Your eval is news; production is Twitter
  • Length: Your eval is avg 50 tokens; production is 500 tokens
  • Format: Your eval is well-punctuated; production is chat speak
  • Adversarial: Users specifically try to break you

Your 87% on eval becomes 63% on production data.

The silent failures

Worse than obvious mistakes are silent failures:

  • Model hallucinates but sounds confident
  • Confidently generates in the wrong language
  • Outputs plausible-looking but wrong numbers
  • Returns incomplete responses

You ship with a silent bug. Users notice weeks later.

Mitigation:

  1. Log everything. Collect first 100k predictions, review 1% manually. Look for patterns in failures.
  2. Set up automated monitoring. Track:
    • Output length distribution (did it change?)
    • Language detection (are responses drifting to another language?)
    • Latency p50/p95/p99 (is it slowing down?)
    • User feedback signal (ratings, thumbs down, etc.)
  3. Have a kill switch. If error rate spikes above threshold, disable and alert.

Week 5-6: Your serving infrastructure fails quietly

Cache invalidation

You cache model outputs because latency is tight.

Someone updates the model. The cache isn't cleared.

Users get stale responses for weeks. You don't know why.

GPU memory leaks

Your inference loop has a subtle leak:

for request in requests:
    tensor = torch.randn(batch_size, 1024)  # forgot to delete
    output = model(tensor)
    # memory leaked

After 10k requests, you've leaked 100GB. The server becomes slow. You blame the model.

Dependency hell

Your model depends on:

torch==2.0.1
transformers==4.30.0
numpy==1.24.0

In production, someone updates to torch 2.1.0. Model outputs change silently because of numerical differences in matrix multiplication.

Mitigation:

  1. Pin all dependencies hard. Use a requirements.txt, not ranges. Test exact versions.
  2. Run integration tests before every deploy. Load model, run 100 test inputs, verify outputs are deterministic.
  3. Measure memory usage per request. Use tracemalloc or GPU profilers. Alert if it drifts.
  4. Cache invalidation strategy. Version your cache. Invalidate on model update.

Week 7-8: Your monitoring lies to you

Survivorship bias in metrics

You log "predictions where model was confident". You skip ones where it said "I don't know".

Your average accuracy looks great. But you're measuring a subset.

Latency percentiles are misleading

You measure p50 and p95. But what about p99? That's where you fail:

p50: 45ms
p95: 80ms
p99: 3000ms  ← One batch got stuck on a warm GPU cache. Oops.

Users in that p99 tail get timeouts and retry. Downstream systems break.

You're measuring the wrong thing

You're optimizing for accuracy. But users care about:

  • Latency — is it fast enough?
  • Availability — does it work at 3am?
  • Cost — are you spending too much on compute?
  • Consistency — does it give the same answer twice?

You're optimizing the wrong objective.

Mitigation:

  1. Define SLOs. "99% of requests return within 100ms". Measure it constantly.
  2. Instrument all percentiles. Don't just log p95. Log p50, p90, p95, p99, p99.9.
  3. Separate training metrics from production metrics. Accuracy is fine for training. In production, you care about latency, error rate, and user satisfaction.

Week 9-10: Users hate your product

The cold start problem

User makes request #1. Model hasn't been called in hours. GPU memory is cleared. Inference takes 2 seconds.

User makes request #2 (same request). Model is warm. Inference takes 50ms.

This inconsistency makes your product feel broken.

Hallucinations are worse than errors

If the model says "I don't know", users will forgive.

If the model confidently says something false, users are furious.

Your eval doesn't measure hallucinations. Your production data is full of them.

Latency tail kills experience

You optimized p95 to 100ms. But p99 is 500ms.

5% of users get the slow experience. They never come back.

Mitigation:

  1. Warm up your model. On startup, make 10 dummy requests. Pay the cold start cost before real traffic arrives.
  2. Add uncertainty. Have the model output a confidence score. Calibrate it on validation data. Filter low-confidence predictions.
  3. Budget latency aggressively. If you have 1-second SLA, target p99 under 500ms. Build in buffer for GC pauses, cache misses, etc.

Week 11-12: The real cost

Inference is expensive

You benchmarked cost:

1M inferences on A100: $50
Our prediction: $5,000/month

But then you launched and:

  • Traffic was 2x higher than expected
  • Your p99 latency requires larger batches (= more GPU hours)
  • You're running 3 replicas for failover
  • You're caching aggressively, so GPU is underutilized half the time

Real cost: $25,000/month.

Cold hard tradeoffs

You can't do:

  • High throughput + low latency + low cost
  • Pick two

You need to decide early.

Mitigation:

  1. Calculate unit economics. If customer LTV is $100, you can't spend $10 per inference.
  2. Be willing to degrade gracefully. If load spikes, either:
    • Increase latency (tell users "this might take 30s")
    • Reduce quality (use smaller model)
    • Reject requests (add to queue) Pick one. Communicate it.

The Checklist

If you're shipping in 90 days, use this:

Week 1-2: Scale the inference

  • Profile with real batch sizes
  • Quantize and measure accuracy loss
  • Measure end-to-end latency on target hardware
  • Set latency budget (p50, p95, p99)

Week 3-4: Test on production-like data

  • Collect 1k production examples (if possible)
  • Evaluate on production data, not your eval set
  • Set up automated data quality monitoring
  • Define error rate threshold for kill switch

Week 5-6: Build serving infrastructure

  • Pin all dependencies explicitly
  • Add integration tests to CI/CD
  • Profile memory usage per request
  • Plan cache invalidation strategy

Week 7-8: Set up real monitoring

  • Instrument all percentiles (p50, p90, p95, p99, p99.9)
  • Separate production metrics from training metrics
  • Alert on SLO violations
  • Log representative sample of predictions for review

Week 9-10: Optimize for users

  • Warm up GPU before serving live traffic
  • Add confidence/uncertainty to predictions
  • Test on real users (beta program, if possible)
  • Have a rollback plan

Week 11-12: Monitor cost

  • Calculate actual inference cost per prediction
  • Compare to revenue (if monetized)
  • Plan for traffic growth (10x? 100x?)
  • Document degradation strategy

The Hard Truth

Your eval set won't predict production. Your first week in production will be humbling. You'll find bugs you didn't know existed.

But if you follow this checklist, those bugs will be small. You'll catch them, you'll fix them, and in a few months, you'll have a production system that actually works.

The gap between 87% on eval and 63% on production exists for a reason. Close it methodically, or it will haunt you.

Good luck out there.

Share: