RAG, explained without the buzzwords.
Everyone is "doing RAG". Half of them are doing it badly because nobody told them what it actually is. The whole idea fits on one page.
RAG stands for retrieval-augmented generation, which is the kind of phrase only someone trying to sell you a course would invent. The idea is much simpler than the name. So is the system.
You have an LLM. The LLM knows roughly the internet through some date in the past, and nothing about your specific documents. You want to ask it questions about your specific documents. RAG is the trick that bridges those two facts.
The whole idea, in three steps
- Take your documents. Chop them into chunks of 500-1000 words. Save each chunk along with a "fingerprint" of its meaning (an embedding vector).
- When the user asks a question, fingerprint the question the same way. Find the 3-5 chunks whose fingerprints are closest to the question's fingerprint.
- Stick those chunks at the top of the prompt as "context". Then ask the LLM the question.
That's it. That's the whole thing. Everything else is optimization.
RAG is open-book test for an LLM. Instead of asking the model what it remembers, you slip the right page of the book into its hand right before the question. The model still does the reading and reasoning — you just made sure it was looking at the right page.
Why it works
LLMs are amazing at "given this text, do something useful with it". They are not amazing at "remember every fact you've ever read". RAG plays to the strength and around the weakness. You don't ask the model to remember your invoices — you find the relevant invoice, paste it in, and ask the question.
The model is a very smart reader. RAG turns it into a very smart reader of the right document.
Where it fails
Three classic failure modes, in order of how often they happen.
Bad chunks. If your chunks are too big, the relevant section gets lost in noise. If they're too small, the relevant idea gets split across chunks and neither one alone is useful. 500-1000 words is the safe starting point. Tune from there.
Bad retrieval. Fingerprint similarity is not the same as "this answers the question". A chunk about "cats" might be the closest match for a question about "kittens", but a chunk about "the registration deadline" might be a closer match for "when do I have to register?". Most failures here come from using off-the-shelf embeddings on documents whose vocabulary doesn't match user questions. Mitigate by adding keyword search alongside embedding search (this is "hybrid retrieval", which is fancier-sounding than it is).
Bad prompting. Once the chunks are in the prompt, the LLM still has to do the reading. If you don't say "answer using only the provided context" — or worse, you also leave the original system prompt — the model will happily make things up. Be explicit. Tell it what to do if the answer isn't in the chunks ("say 'I don't have that information'").
"What does our company's PTO policy say about carrying over unused days?"
Model invents an answer that sounds reasonable and is completely wrong.
RAG question"Using only the policy text below, answer: what does our PTO policy say about carrying over unused days?"
[3 paragraphs from the actual handbook]
Model quotes the policy with the specific number. If the answer isn't there, it says so.
When you don't need RAG
This is the question more people should ask. RAG is overhead. If you only have 5 documents and they fit in a 200,000-token context window (which Claude has), just paste them all in. No vector database. No chunking. The model can read all of them. RAG is for when your documents are too big to fit and you have to be selective.
For most teams, "too big to fit" doesn't kick in until you have hundreds of documents. Until then, the simpler approach beats the fancier one.
The one-paragraph summary
RAG is "find the relevant chunks of text, paste them at the top of the prompt, ask the question". That's the system. Everything else — vector databases, embedding models, re-rankers, multi-vector search — is optimization of those three steps. Worth learning, but only after you've built the dumb version and seen it work.
Start with paste-everything-in. Move to RAG when the everything is too big. That's the whole roadmap.