Home / Blog / Robust RAG

Robust RAG

Published on May 14, 2025

Retrieval Augmented Generation (RAG) has become a cornerstone for building sophisticated AI applications that can leverage vast amounts of private or specialized data. However, making these RAG systems "robust" is a multi-faceted challenge. I break down "robustness" into three key areas:

Considerations before you start

Build or buy? Decide whether you want a managed RAG service (for faster deployment) or an in-house pipeline (for deeper control and customization of data extraction).
Federated versus central: Determine whether unstructured data pipelines should be managed centrally by a platform team or distributed across product teams.
Retrieval modes: In addition to vector search, consider SQL agents for structured data, web crawlers, and API-based retrieval pipelines.

1. Information gathering and extraction

The old adage "Garbage in, garbage out" holds particularly true for RAG systems. The quality of your AI's output is fundamentally limited by the quality of the data it can access and understand. Information extraction is absolutely critical. If you ground your system properly, poor information extraction is often the primary reason for suboptimal performance and can lead to hallucinations where the system misquotes data.

If you connect your RAG system to incorrect or outdated data sources, or if you fail to properly extract information from complex formats (like intricate PowerPoint presentations where scraping raw text loses semantic meaning), your system is handicapped from the start. Challenges include dealing with images containing text, poorly structured documents, brand logos used in place of text, or real-time data feeds.

Effective techniques involve Optical Character Recognition (OCR) – for instance, using packages like Tesseract or services like Azure Form Recognizer to convert PDFs to structured JSON – and leveraging vision-capable AI models (such as 4o). These vision models can go beyond simple transcription to interpret diagrams or extract semantic meaning from slide layouts, significantly improving data ingestion quality. For complex tables, collapse multi-index structures during preprocessing.

Post-extraction, the data must be chunked and indexed for efficient retrieval:

Chunking strategies: Various levels of sophistication exist from simple recursive splitting to sentence window embeddings or auto merging. Consider adapting the chunking per document type.
Chunk augmentation: Enrich chunks with metadata (source labels, dates, document sections, custom tags) to improve filtering and context retrieval.

2. Precision and recall upon retrieval

Once data resides in a vector database (or other retrieval stores, for example a graph database), the next challenge is retrieving the most relevant information to answer user queries. This is where precision and recall metrics measure success:

Low Precision: Of all the documents retrieved, only a few are relevant to the query.
Low Recall: Many relevant documents that could have answered the query are missed.

Common techniques for improving precision and recall include:

Query Augmentation: Reformulate the query, often using an LLM, to better match the underlying data. Generate multiple "orthogonal" queries given a single user query to cover different angles.
Hybrid Search: Combining traditional keyword-based search (like BM25) with semantic (vector) search, often using techniques like Reciprocal Rank Fusion (RRF) to merge results, to get the best of both worlds.
Re-ranking: Using a more specialized model (like a cross-encoder) to re-rank the initially retrieved documents and select the best ones.
Information Extraction Tuning: Optimizing chunking strategies, embedding models, and indexing to improve retrieval effectiveness.

A more recent development is agentic RAG. The core idea involves an AI agent that retrieves information, reflects on its quality and completeness, and iteratively refines its search with new queries if needed. Beyond iterative refinement, agentic strategies can also involve the specialization of indices for different types of information.

3. Evaluations

Even with the best extraction and retrieval strategies, outputs may still falter in real-world scenarios. Continuous evaluation is essential to measure performance, catch regressions, and guide ongoing improvements.

Evaluation is both a development process and a set of tooling, often integrated into LLMOps (Large Language Model Operations). It involves establishing a framework to validate the quality of the system and its prompts.

The diagram below illustrates the idea of a continuous improvement cycle with evaluations at its core. This requires both upfront (often just a few examples suffice) and continuous gathering of test examples. Evaluation techniques can include (fuzzy) matching, LLM-graded evaluations, and human evaluation.

Generation

Note that I exclude the generation step. While this used to be a problem, with today's foundation models, the generation or summarization of answers is rarely the bottleneck.