Retrieval Augmented Generation combines a retrieval system with a generative language model, so that responses are grounded in the content of a specific corpus rather than relying only on what the model learned during training. It is the dominant pattern for building question-answering and assistant systems over private or up-to-date knowledge bases, and sits on top of Large Language Model (LLM)s.

How it works

  1. Retrieval: the system retrieves relevant documents or passages from a corpus based on the input query. This is typically done with dense vector search, where both the query and documents are embedded into a high-dimensional space so semantically similar content can be found.
  2. Augmentation: the retrieved documents are combined with the original query to form a richer input for the generator.
  3. Generation: a generative model takes the augmented input and produces a response informed by the retrieved content, not just its pre-trained knowledge.

Benefits

  • Grounding: answers can cite specific sources from the corpus, which reduces hallucination.
  • Freshness: the corpus can be updated without retraining the model.
  • Scalability: a single model can serve many different domains by swapping the underlying corpus.

Applications

  • Customer support over internal knowledge bases
  • Research assistants that synthesise information from papers
  • Code assistants that retrieve from a specific codebase

References