State of the Art RAG over Unstructured Data

July 7, 2024

This is our take on the best basic Pipeline for RAG over Unstructured Data:

Text Chunking

Recursive Text Splitter with Child/Parent Documents Sizes: (model and use-case dependent)

Child Documents: 100-500 tokens
Parent Documents: 1000 - 8000 tokens

Embedding

Hybrid: Sparse/Dense Encoders Sparse: SPLADE (BM42 if you have less compute) Dense: Closed-Source:text-embedding-small-003 Open-Source: e5 embedding

Vector DB

Pinecone Hybrid (for SPLADE) QDRANT Hybrid (for BM42)

Chain

Langgraph ReAct Agent

Recommended: Multi-Query search

LLM

Claude Haiku or Sonnet 3.5

Detailed Description

Our most common use-case is chatting with an entire website so we'll use that as our example: First we scrape the entire site with APIFY and send that JSON data to text processing script which drops all fields except: 'url', 'markdown', 'ETAG', and 'id'.

The markdown is our 'content' that we will make into Parent/Child Documents. We do this so we can search on the Child documents and get more accurate search results, but we return the Parent document to the model when doing RAG. This way, we can search on a specific paragraph on a webpage and return the entire webpage for context to the model. Then we embed and upsert the data to Pinecone (Example)

Then we set up our Langgraph ReAct agent with the vector store as a tool. We give it a nice prompt and instructions to perform multiple queries per search. We prefer Haiku for testing based on the amazing Cost/Performance ratio and we have a great RAG demo set up.

In order to make a great RAG application for production, we reccomend adding a lot of metadata to your data and filter on metadata when querying. This becomes very use-case specific but we will try to make a more detailed overview in the future.