Robust RAG

Published on May 14, 2025

Retrieval Augmented Generation (RAG) has become a cornerstone for building sophisticated AI applications that can leverage vast amounts of private or specialized data. However, making these RAG systems "robust" is a multi-faceted challenge. I break down "robustness" into three key areas:

Considerations before you start

1. Information gathering and extraction

The old adage "Garbage in, garbage out" holds particularly true for RAG systems. The quality of your AI's output is fundamentally limited by the quality of the data it can access and understand. Information extraction is absolutely critical. If you ground your system properly, poor information extraction is often the primary reason for suboptimal performance and can lead to hallucinations where the system misquotes data.

If you connect your RAG system to incorrect or outdated data sources, or if you fail to properly extract information from complex formats (like intricate PowerPoint presentations where scraping raw text loses semantic meaning), your system is handicapped from the start. Challenges include dealing with images containing text, poorly structured documents, brand logos used in place of text, or real-time data feeds.

Effective techniques involve Optical Character Recognition (OCR) – for instance, using packages like Tesseract or services like Azure Form Recognizer to convert PDFs to structured JSON – and leveraging vision-capable AI models (such as 4o). These vision models can go beyond simple transcription to interpret diagrams or extract semantic meaning from slide layouts, significantly improving data ingestion quality. For complex tables, collapse multi-index structures during preprocessing.

Post-extraction, the data must be chunked and indexed for efficient retrieval:

2. Precision and recall upon retrieval

Once data resides in a vector database (or other retrieval stores, for example a graph database), the next challenge is retrieving the most relevant information to answer user queries. This is where precision and recall metrics measure success:

Common techniques for improving precision and recall include:

A more recent development is agentic RAG. The core idea involves an AI agent that retrieves information, reflects on its quality and completeness, and iteratively refines its search with new queries if needed. Beyond iterative refinement, agentic strategies can also involve the specialization of indices for different types of information.

3. Evaluations

Even with the best extraction and retrieval strategies, outputs may still falter in real-world scenarios. Continuous evaluation is essential to measure performance, catch regressions, and guide ongoing improvements.

Evaluation is both a development process and a set of tooling, often integrated into LLMOps (Large Language Model Operations). It involves establishing a framework to validate the quality of the system and its prompts.

The diagram below illustrates the idea of a continuous improvement cycle with evaluations at its core. This requires both upfront (often just a few examples suffice) and continuous gathering of test examples. Evaluation techniques can include (fuzzy) matching, LLM-graded evaluations, and human evaluation.

Gather test data Run flow on test data Evaluate system Deploy & monitor Review usage & complaints Add tests from feedback & bugs Improve or fix system

Generation

Note that I exclude the generation step. While this used to be a problem, with today's foundation models, the generation or summarization of answers is rarely the bottleneck.