Will RAG be needed in the Future?

July 2, 2024

The necessity of RAG in our LLM applications has been at question ever since it was first introduced. Although RAG is not a new concept, I believe it WILL be needed for the future of LLM applications. This article will explain why I believe it this:

Chat with the Web

Let's first think of where RAG is being applied in LLM applications today to see the impact it has and if it's a key part of the application. Let's start with the hottest new player incorporating RAG: Perplexity. Perplexity uses their own web crawler to crawl the web (similar to Google bot) that follows links to new webpages and indexes the web. Once Perplexity has this data organized in a searchable way, they search on it with your query and return the top k results and feed that into the prompt (along with your query) to an LLM. This is RAG. Google has also implemented a similar (some might even say the same) method to some of their searches as you can see here. RAG is a very common technique used to keep LLMs up to date with the latest data from the internet.

Chat with Documentation

The other main use-case we're seeing in this domain is RAG over software documentation websites. It makes a lot of sense for 3 main reasons:

Software documentation is updated extremely frequently
Documentation sites often go so deep from the root URL that many common web crawlers don't crawl them
Setting up a RAG pipeline to documentation can help make a very specialized AI for that software

Examples of this are spreading like wildfire across documentation websites. With many YC start ups taking over the space by offering chatbots to documentation sites such as Mendable, Inkeep, and many others. Unfortunately, these haven't been as good as I'd like and I explain why that is here: Why RAG Chat Bots Aren't Good (right now).

Chat with Code

One of the coolest startups in this space is SourceGraph who started as a better way of searching GitHub a few years ago, very similar to Google's innovation compared to other search engines at the time. Sourcegraph then found themselves in a great position to simply connect an LLM to their search ability so that people can chat with any GitHub repo. But how does it work and is it any good?

Is it any good: It isn't that good at coding applications right now, but since you can connect it to any GitHub repo (or locally via their VSCode extension), it is a good way of searching and understanding new code. I'd recommend trying it out if you're working with a new open-source project.

How it works: It works similarly to every other RAG Application How RAG applications work except for how it gets and stores its data. Since this functionality is closed source, these are just my guesses. What we know for sure is that they use the GitHub API to retrieve new repositories and index their data. They then likely use an LLM to write a query in their unique query format and search upon the indexed data. They then send in the raw files (or chunks from the files) into the LLM prompt so it can generate the answer.

I've shared some of my opinions on SourceGraph here if you're interested.

Chat with Private Data

Now for the Mac Daddy, the killer to the RAG counter argument: Private Data.

When people ask me how RAG is important and why they should even care, I simply say this: "Almost every company you hear that is 'Incorporating AI' is just doing RAG on their private data". And I still believe this statement to be true. Companies have all of this private data that they don't really know what to do with. And since the AI boom began, CTOs have been scurrying to find ways to have their company "leverage its capabilities". The obvious answer then, was RAG or fine-tuning. It's all use-case dependent, but I believe RAG much more widely used compared to fine-tuning right now. So, how have companies been using RAG for their private data?

Automation

One of the most effective ways that companies have incorporated RAG has been through automating tedious jobs that required a human to look through a file and/or Excel sheet in order to accomplish a task. Let's use a car warranty company as an example: A customer calls into the company saying that they have a dent. The representative asks for the customer's authentication information and maybe a car type. Then, that representative searches a data table for the customer to see their subscription level. Then, the representative looks through a huge PDF to find the subscription level and if the customer is covered for that dent. This takes time. Here's how we automate that with RAG: We set up a RAG bot with a workflow that takes in the customer name and maybe a customer ID. It then does an SQL search on a database for the customer subscription level. Gets this information, then does a simple chat with PDF task to see if the customer is covered and tells the representative the relevant information. This process can be completed in a few seconds and is only getting faster.

Customer Support

This goes along with the previous example shown before, but takes the human representative out of the equation. For this use-case, Klarna has become the predominant story.

This example means replacing 70% of your customer service team with a Chatbot that has access to relevant docs and was prompted similarly to your customer service team. What does this look like under the hood though?

How it works: To create this customer service chatbot, this is what I would do: First, I'd create a retrieval database that contains the documentation, FAQs, or whatever other data source has common answers to your customer's questions. Then, I would take past customer emails, calls, conversations with the previous customer support agents and use the successful conversations to fine-tune an LLM so that it acts the same as the ideal responses from customer support agents. To take this a step further, I would add a tool called "Contact Human" that then gets a human support agent on the line when prompted by the customer or if the AI realizes that it's stuck. This will likely take care of a majority of customer support messages. A cool addition could be adding a Semantic Router that takes the customer's initial message and puts it through a router that determines if a human should be contacted or if the AI can handle this error. A very cool video on how to do that here.

Compliance

A very interesting and less-talked-about use-case is Compliance. Compliance isn't sexy which is likely why no one talks about it. But, AI could completely innovate the Compliance industry. Compliance is heavy in many industries such as those that work closely with the Government, Finance, and Law. It typically involves a person that has a document containing a specific set of rules. Then, this person scans through company's documents and sees if they're following the rules. If the person catches something off, they raise an error by contacting someone about this. If the person misses an error, it could cost the company huge. This is very intriguing to me because 1. There's a bunch of compliance people making 6 figures and 2. these people likely add more value to the company than their salary because certain compliance penalties are so large and most importantly 3. It's not sexy or cool which means a majority of people won't want to work on it.

But how would RAG work with compliance?

This is very use-case specific so if you have any questions, feel free to contact me.

But generally, the important part for a chatbot of this nature is catching compliance issues. Now depending on the size of the document, I would ideally want the entire document to be in the prompt so that the model doesn't miss anything. But, if there's multiple documents/data sources or if the document's just too big for the model, then I'd set up a SOTA RAG Pipeline to the data sources.

Knowledge Management

Does your Google Drive have so many files that you're losing track of things? Now imagine if you were a large company with 1000x more files to manage. This can become overwhelming and cluttered, so here's how RAG can help. It would likely be a Natural Language Search Engine on your documents. Even cooler than this though is that you can add very useful metadata such as a "Last Modified" timestamp to apply "data freshness" to your Vector store data which would prioritize newer data vs older data.

So how would you do this?

Thinking back to How RAG works, you start with: How do I get relevant data into the prompt? For this example, let's stick with Google Drive and use my favorite file type .ipynb (Google Colab Notebooks). I would use a Python script that loads all of my .ipynb files from my Google Drive into a JSON file with these fields:

'Content': markdown of the colab file
'Source': File location/url of the file
Metadata
- 'Last Modified': Timestamp of last modified
- etc.