§ Guides
By AI Blog Editor
Apr 17, 2026 · 1 min read
RAG on two axes: size × freshness
Place your project on the grid and the right technique is usually one of five. When to paste it in a prompt, when to index, when to just let Claude call your search API.
Retrieval-augmented generation is often treated as a single architecture — chunk, embed, index, search, prompt — when it's really a family of techniques whose fit depends on your data's shape. Two axes do most of the disambiguation: how big the corpus is, and how fresh it needs to be.
Place your project on those two axes and the right answer is usually one of four or five things. Grab an axis wrong and you end up either over-engineering (full RAG stack for a 40k-token document that could've fit in a prompt) or under-engineering (pasting a 10 MB corpus into every turn and wondering why you hit rate limits).
Axis one: size
Small means the whole corpus fits comfortably in a prompt — tens of thousands of tokens, maybe a few hundred thousand with a long-context model. Don't index it. Stuff it in the system prompt, cache the prefix, call it done.
Medium is a few million tokens — too much for every turn, but small enough that you can afford a real index. Standard RAG territory: embeddings, a vector store, a retrieval step, a reranker if you're fancy.
Large is everything past that. You need a proper retrieval stack: chunking strategy, hybrid search (dense + BM25), versioned indexes, incremental ingest, and the operational muscle to run it.
Axis two: freshness
Static. The corpus rarely changes. A product manual, a coding standard, a catalogue of legal templates. Index once, reindex rarely.
Daily. Updates come in batches — a nightly ETL, a scheduled scrape. Reindex on a cron; plan for idempotent ingest so reruns are safe.
Live. The world changes minute-to-minute. Stock prices, open tickets, active chats. Pre-indexing loses to live queries almost immediately — by the time the index finishes, the answer has already changed.
The grid
Pick a cell in the grid below. The right technique for your project is on the right.
◆ Place your project
| STATIC | DAILY | LIVE | |
|---|---|---|---|
| SMALL | |||
| MEDIUM | |||
| LARGE |
Size · Medium · some million tokens
Freshness · Daily · updates occasionally
Recommendation
Retrieval with scheduled reindex
RAG with a nightly reindex job. Make sure your ingest pipeline is idempotent — reruns happen.
Retrieval-as-tool-use
One of the more useful shifts of the last couple of years: instead of pre-indexing everything for Claude, expose your search capability as a tool and let the model call it. For large × live this often beats a vector store by miles — the model decides what to look up, you hit your real search system, and you skip the whole "we have to rebuild the index" class of bug.
It's also composable. A single assistant can mix tool-based search over a CRM with pre-indexed retrieval over a product manual. The model picks the right surface for the question.
A short list of traps
Chunking too small. 200-token chunks retrieve well and fail to carry enough context. 800–1,500 is a better default for most text.
Ignoring metadata filters. If every chunk has a source, section, updated_at, use those filters aggressively. Filtering before retrieval beats semantic magic.
Evaluating retrieval with your nose. Retrieval quality is measurable. Tag 50 queries with the chunks that should come back, then monitor recall@k whenever the index or embedding model changes.
Forgetting the prompt. Great retrieval with a bad prompt produces confident, well-cited nonsense. Treat the generation prompt as part of the retrieval system, not a separate concern.
One-line version
Size picks the technique; freshness picks the refresh strategy. Get both right and the rest of the system falls into place.
* * *
Thanks for reading. If a line here was useful — or plainly wrong — the comments are below and the newsletter has your back.
Elsewhere in this issue
3 moreLetters
Arguments, corrections, questions. Anonymous comments allowed; be kind, be specific.