Mike Yanagisawa: The Pursuit of Grounded Truths: A Primer on RAG

I’ve found that many concepts in computer science are simple ideas that are complex to implement. RAG (or “retrieval-augmented generation”) is a concept that is intuitive – but quickly gets shrouded by complex, domain-specific terminology. I hope to unravel this black box in this post.

The core issue that RAG solves is LLM’s crisis of confidence. ChatGPT is liable to “hallucinate” answers, which makes the tool a non-starter in most business use cases. The underlying issue: LLMs are trying to probabilistically pick the best next word in a sentence, and it feels compelled to answer something, even if it’s wrong. Thus, a focus of computer science research has shifted to preventing the LLM from hallucinating – which is where RAG steps in.

Accuracy through atomicity (i.e. small, trivial tasks)

The entire framework of RAG relies on a simple idea: break large tasks into small, trivial ones. That’s it. The way RAG accomplishes this is with small, specific LLM calls. The user inputs (1) a question and (2) a small chunk of text, and asks the LLM to get the answer from the text. The idea is that by giving the LLM clear guidelines a trusted text to search from, we can trust the response. This serves as the smallest indivisible unit that the rest of RAG is built upon.

Here’s an example of what this type of query looks like (you can copy the “context’ and “question” into ChatGPT yourself to prove that this works!) [1]:

Context (which is included in the ChatGPT query):

You are a helpful assistant. Given the context below and no prior knowledge, answer the question. If you are not sure of the answer, respond “I’m not sure.”

Context: William Jefferson Clinton (né Blythe III; born August 19, 1946) is an American politician and lawyer who was the 42nd president of the United States from 1993 to 2001. A member of the Democratic Party, he previously served as the attorney general of Arkansas from 1977 to 1979 and as the governor of Arkansas from 1979 to 1981, and again from 1983 to 1992. Clinton, whose policies reflected a centrist "Third Way" political philosophy, became known as a New Democrat.

Question: What years did Bill Clinton serve as president?

ChatGPT Response: Bill Clinton served as president from 1993 to 2001.

Question: How many children did Bill Clinton have?

ChatGPT Response: I’m not sure.

These “low-level” queries can be modified to different use cases, but the core idea remains. For example, if we want ChatGPT to include citations, we can just ask; I’ve included an example in the Appendix.

There’s an elegance to how elementary-school simple this process is. These use cases feel trivial, but this near-guarantee of accuracy allows us to architect a system around it. For me, it also dispels the notion that the systems built on top of LLMs are magic. Instead, RAG systems are just a bundle of simple ideas, which together provide a real value.

Chunking: breaking large documents (or many documents) into ChatGPT-able pieces

Let’s say we’re given a 100-page piece of text (e.g. a public company’s 10-K). LLMs have a maximum number of words that it can take at once – you can’t pass in a whole book to ChatGPT and expect it to have a good answer. (OpenAI’s GPT-4 has an ~8K token maximum for its APIs).

Another similar example: let’s say we have 20 documents (5 pages each) for a given company. Again, we can’t pass in all 100 pages at once with a question – there’s simply too many words for the LLM to take in at once.

The solution that RAG uses is again simple: (1) break the documents into pieces, then when a question is asked, (2) find the best chunks of data to look at and (3) pass them into the above ChatGPT prompt. A stylized example:

Input: 100-page document

Chunking:

Chunk 1: pages 1-5

Chunk 2: pages 6-10

…

Chunk 20: pages 95-100

Find the best chunks of data:

Chunk 7

Chunk 10

Pass them into the LLM:

Answer the question below only using the data provided: {Chunk 7} {Chunk 10} …

Retrieval: finding the right chunk of text to look at

In practice, there could be hundreds or thousands of documents, each with hundreds of pages, leaving us with millions of chunks of text. The challenge then becomes: of these millions of chunks, which 10 chunks should we pass to ChatGPT with our question?

This is a classic “search” problem, almost identical to Googling for information. Given a bunch of keywords (e.g. “What is Microstrategy’s business strategy?”) and a corpus of documents (e.g. 10-Ks, 10-Qs, earnings calls, analyst reports), which pieces are most relevant?

The naïve way to solve this problem is by counting how many times words from the input question match each chunk of data. [2] This is eerily reminiscent of search engines pre-Google [3] -- and thus a signal that this may not be the optimal strategy.

Encoding and vectorization: a more complex, LLM-driven way to find relevant chunks of data

The more modern solution is to use the LLM’s own capabilities to find matching chunks – i.e. “embedding” or “vectorization.”

A layman’s summary: at a low level, LLMs convert every word into a series of numbers (“vectors”), and words that are similar in meaning (such as “dog” and “puppy”) have vectors that are close together. This is essentially what the super expensive LLM training is: use every conceivable source of text on the internet, and train the LLM to convert words to vectors. This same “vectorization” concept is applied to our input question and the chunks of data: the question and chunks of data are “embedded” into vectors, allowing us to find chunks that are most similar to – and thus most likely to answer – the input question.

One gnawing question I had: why bother with this vectorization stuff at all? I believe “vectorization” takes advantage of the massive amount of training that the LLM provider had already done and allows the RAG system to match to semantically similar words (e.g. “dog” and “canine”). It’s also quick and easy to do – most LLMs providers have an API that allow you to vectorize content for cheap.

RAG, soup to nuts

The complete RAG process looks something like this:

Process the data

- Find and ingest data

- Chunk the data

- Encode the chunks into “vectors”

- Store the vectors into a database

Find relevant data chunks

- Given a user question:

o Embed the user question into a vector

o Find the closest vectors in the database to the user’s question

Ask the LLM to answer the question

- We now have (a) a user question and (b) relevant chunks of text

- Prompt the LLM, and return the answer to the user

RAG: challenging in practice

As with most things in computer science, RAG is a simple idea with copious challenges in implementation. Amongst the challenges developers actively face:

Data ingestion – converting documents and cleaning them can be challenging (e.g. converting a PDF to text, then removing headers/footers, cleaning tables, etc.; converting audio, graphs, and images can also be challenging)
Chunking strategy – deciding on how to split large files up is challenging (e.g. do you chunk by section? By sentence? By word?), and how you chunk may drastically impact accuracy of results later
Retrieval strategy – embedding/vectorization isn’t always the best strategy; there’s a few different algorithms and a few different vector databases that can be used.
Chaining queries together – open-source RAG frameworks like LlamaIndex allow you to easily chain queries together. For example, if a user enters a complex query that spans multiple documents (e.g. “How did 2022 and 2023 revenue compare?”), the framework will use an LLM to break the query into simpler ones that it then aggregates together.
User interface – most start-ups have defaulted to letting users interact with documents in a chat-style manner. But is this truly the best way?

For a company looking to use RAG, there are also copious other questions (e.g. buy vs. build). I plan to to tackle these sorts of questions, as well as a prototype, in future posts.

Appendix

Sample ChatGPT with Citations [4]

Context:

Please provide an answer based solely on the provided sources. When referencing information from a source, cite the appropriate source(s) using their corresponding numbers. Every answer should include at least one source citation. Only cite a source when you are explicitly referencing it. If none of the sources are helpful, you should indicate that.

For example:

Source 1: The sky is red in the evening and blue in the morning.

Source 2: Water is wet when the sky is red.

Query: When is water wet?

Answer: Water will be wet when the sky is red [2], which occurs in the evening [1].

Now it’s your turn. Below are several numbered sources of information:

------

Source 1: Born and raised in Arkansas, Clinton graduated from Georgetown University in 1968, and later from Yale Law School, where he met his future wife, Hillary Rodham.

Source 2: After graduating from law school, Clinton returned to Arkansas and won election as state attorney general, followed by two non-consecutive tenures as Arkansas governor.

------

Question: Which schools did Bill Clinton attend?

ChatGPT Response: Bill Clinton graduated from Georgetown University in 1968 and later from Yale Law School [1].

[1] Text below taken from Bill Clinton - Wikipedia

[2] One real-world example: the BM25 algorithm, which can be used in LlamaIndex

[3] See: Google: The Complete History and Strategy from the Acquired podcast

[4] This is identical to how LlamaIndex does it: CitationQueryEngine - LlamaIndex, llama_index/llama-index-core/llama_index/core/query_engine/citation_query_engine.py at main · run-llama/llama_index

Mike Yanagisawa

Wednesday, July 23, 2025

The Pursuit of Grounded Truths: A Primer on RAG

No comments:

Post a Comment

Dev Design: Investment Research "Agent"

Search This Blog