What is RAG?

Retrieval Augmented Generation (RAG) is a technique to enhance the intelligence of large language models (LLMs) with additional knowledge, such as reliable facts from specific sources, private or personal information not available to others, or just fresh news to improve their answers. Typically, the additional knowledge is provided to the model from a vector database. For example, you can add internal data from your company, the latest news or the data from your personal devices to get responses that use your context. It can truly help you like an expert instead of giving generalized answers. This technique also reduces hallucinations. 

Why RAG?

Let’s take a look at the key benefits that RAG in general offers:

  • Customization and Adaptation: RAG helps LLMs to tailor responses to specific domains or use cases by using vector databases to store and retrieve domain-specific information. It turns general intelligence into expert intelligence.
  • Contextual Relevance: By incorporating information retrieved from a large corpus of text, RAG models can generate contextually relevant responses. It improves the quality of generated responses compared to traditional generation models.
  • Accuracy and diversity: Incorporation of external information also helps to generate more informative and accurate responses and keep LLM up-to-date. This also helps to avoid repetitive or generic responses and allows for more diverse and interesting conversations.
  • Cost-effective implementation: RAG requires less task-specific training data compared to fine-tuning the foundation models. When we compare retrieval augmented generation vs fine-tuning, RAG’s ability to use external knowledge stands out. While fine-tuning requires lots of labeled data, RAG can rely on external sources. This can be particularly beneficial in scenarios where annotated training data is limited or expensive to obtain, thus, providing a cost-effective implementation. 
  • Transparency: RAG models provide transparency in their responses by explicitly indicating the source of retrieved information. This allows users to understand how the model arrived at its response and helps enhance trust in the generated output.

Therefore, RAG is suitable for applications where access to a vast amount of specialized data is necessary. For example, a customer support bot that pulls details from FAQs and generates coherent, conversational responses. Another example is an email drafting tool that fetches information about recent meetings and generates a personalized summary.

How retrieval augmented generation works

Let’s discuss the mechanics of how RAG operates with databases, covering its main stages from dataset creation to response generation (see figure).

This image has an empty alt attribute; its file name is RAG.png
Retrieval augmented generation diagram


  • DB creation: Creation of external dataset

Before the real use, the vector database should be created. The new data, that lies outside the training dataset of LLM, should be identified and added to the dataset (e.g. up-to-date information or specific information). This dataset is then transferred into vector embeddings via an AI model (embedding language models) and is stored in the vector database. 

  • DB in use: Retrieval of relevant information
    Once a query comes in, it is also transferred into a vector / embedding. It is used then to retrieve the most relevant result from the database. To achieve this, RAG uses semantic search techniques also known as vector search to understand the user’s query and/or context, retrieving contextually relevant information from a large dataset. Vector search goes beyond keyword matching and focuses on semantic relationships, improving the quality of the retrieved information and the overall performance of the RAG system in generating contextually relevant responses. 
  • DB in use: Augmentation
    At this stage, the user’s query is augmented by adding the relevant data retrieved in the previous stage. Often, only the top responses from vector search are considered as relevant data. Many databases have additional filtering techniques in place here.
  • Generation
    The augmented query is sent to the LLM to generate an accurate answer.

The Role of Long Context Windows

The rise of the new LLMs with long (1+million tokens) context windows, like Gemini 1.5, raised the discussion on whether long context windows will replace RAG. A long context window enables users to directly incorporate huge amounts of data into a query. Thus, it increases context to the LLM to improve its efficiency. 

Long context length and RAG have pros and cons, and neither will kill the other. Rather than being mutually exclusive, large context windows and RAG can be complemented. Large context windows can enhance RAG applications by expanding the margin of precision and accommodating vast amounts of data. However, the capability of the model to take a long context does not mean that it can efficiently leverage all the information. If the relevant information is located in the middle of the context window, LLM’s ability to recall it is worse than the one located in the beginning. In order to use RAG with the long context window, the reranking (e.g. Cross-Encoder) should be used. The reranking model first calculates a matching score between a given query and vectors in the database (e.g. representing documents). And then it rearranges vector search results so that the most relevant ones are prioritized.

Future Directions of RAG

While RAG offers numerous benefits, there are still opportunities for improvement. Researchers are exploring ways to enhance RAG by combining it with other techniques. These include fine-tuning (RAFT) or the long context window (in combination with reranking). Another direction of research is expanding RAG capabilities by advancing data handling (including multimodal data), evaluation methodologies, and scalability. Finally, RAG is also affected by the new advances in optimizing LLMs to run locally on restricted devices (mobile, IoT), along with the emergence of the first on-device vector database. Now, RAG can be performed directly on your mobile device, prioritizing privacy, low latency, and offline capabilities.