Writing Archive

Essays, papers, talks, and notes. A complete archive of thinking from the trenches.

/Essay

State machines, tool nodes, and error recovery — everything the tutorials skip before you ship your first real agent.

How to split complex tasks across specialised agents without losing control of state, context, or your sanity.

Naive vector search breaks silently. Corrective RAG loops catch the failure and rewrite the query before the user notices.

Three memory tiers — in-thread, cross-thread, and external store — and exactly when each one becomes the right tool.

Why most LLM evals quietly rot — and the small shifts that keep them honest as your system evolves.

More parameters, more tokens, more GPUs — and where each tradeoff leaves you stranded if you get it wrong.

Generating more data looks like progress until your model learns to game your synthetic distribution.

Hallucinated arguments, wrong tool selection, timeouts, schema drift — a field guide to keeping function-calling agents honest.

How to pause an agent mid-graph, hand control to a human, and resume cleanly — without losing the thread.

A walkthrough with diagrams — what gets shared between requests, what doesn't, and where it goes wrong.

Why semantic search is necessary but insufficient, and what you're actually debugging when your RAG fails.

We trained a calibration head on 1.4M human pairwise comparisons; FID becomes optional after that.

An honest checklist of everything that breaks between your eval set and your first real user.

Total posts