Why Episodic Data is the Correct Choice for RAG Systems in LLMs

August 18, 2024

When implementing RAG systems, one of the most crucial decisions involves determining the type of data to retrieve and inject into the model's prompt. The debate between using episodic (example-based) data and informative (theory-based) data has huge implications on the effectiveness of your system. In this article, we will explore why episodic data is the superior choice for RAG, particularly in the context of large language models (LLMs).

Understanding Episodic vs. Informative Data

Episodic data is data that shows how something can be done through step-by-step examples. Imagine using Chegg to solve a math problem: instead of merely understanding the theory, you're walked through each step to get the solution. This is in contrast to informative data, which is more theoretical and abstract, like reading a textbook that explains concepts without practical applications.

For LLMs, episodic data yields much higher results when used in RAG systems. The reason lies in how these models are trained and how they process input tokens. While you can fine-tune a model to retrieve informative data and generate a solution, the outcome will never be as effective as when you provide the model with episodic data relevant to the query. The key difference is that episodic data aligns better with the model's tokenization and generation process, leading to higher quality responses.

The Challenge of Episodic Data

Despite its advantages, episodic data is much more difficult to obtain. Most of it isn't tracked or documented, whereas informative data is readily available in textbooks, blogs, and technical documentation. This scarcity of episodic data poses a significant challenge but also underscores its value.

A Practical Example

Consider a new coding language that has just been released. You set up two RAG systems to compete on which can code better in this new language. The first system has access to episodic data—examples of the language being used in practice. The second system relies on informative data—documentation explaining what each class and function does.

While you could create a sophisticated multi-step query system that retrieves relevant documentation and pieces together an answer, this approach is rarely successful in practice. Instead, injecting the prompt with episodic data—examples of the language in action—provides the model with the correct tokens, resulting in a more accurate and higher-quality response.

Tokens: The Key to Effective RAG

When designing RAG systems, it's crucial to think about what you're inserting into the prompt. You're not just providing data; you're supplying the tokens that the model will use to generate an answer. If you have two LLMs designed for coding in Python, which one will perform better: the model trained on the best Python courses or the one trained on the best Python code?

The answer is clear: the model trained on the best Python code will outperform the other because it has been exposed to actual examples of coding. This distinction between theory (informative data) and practice (episodic data) is vital for RAG systems. For humans, the best data might be course content, but for an LLM, the best data is the examples.

Conclusion

In summary, when working with RAG systems, the type of data you inject into the prompt is critical to the output you receive. Episodic data, which provides examples of tasks being done, offers a significant advantage over informative data, which is theoretical. By focusing on providing the correct tokens through episodic data, you can significantly enhance the performance of your RAG system. Remember, what you put into these models is extremely influential in determining what you get out.