Mike Yanagisawa: July 2025

Thursday, July 24, 2025

Dev Design: Investment Research "Agent"

PROBLEM SUMMARY [1]

Companies are awash in information – both public and private – but have a hard time interacting with it all effectively. RAG has been proposed as a way to organize and chat with private data, but there’s a few key challenges:

Problem #1: User interface and integration

Most RAG systems’ user interface consists of a chat box, which “feels” natural and infinite for the end user. However, I think this only captures one use case, the “inquisitive” mode. It’s less effective with a more structured approach (e.g. you have a “research framework”), or in cases where you’re not sure what to ask. Sometimes you need a 45-minute lecture to jostle the questions out of your brain.

A bit of a tangent: but I think this is something start-ups today are not doing well. Start-ups are focused on specific pieces of the end-to-end chain (e.g. UnstructuredIO) or spinning up a generalized business model (e.g. RAG-as-a-service).

Problem #2: Ability to prioritize sources

Most knowledge management systems don’t have an easy way to prioritize some voices over others. From an investment research lens: I’d likely prioritize my own firm’s investment memo over another firm’s, I’d value the financials in the 10-K over those from another source, and I’d trust an article from the Financial Times over a clickbait-y Business Insider one. This level of discernment is a critical part of the knowledge aggregation process, but it isn’t an option in most software.

Problem #3: Ingesting data files effectively

I’ve alluded to this in my personal fight with ingesting PDFs, but taking “unstructured” documents (like PDFs that have tables, graphs, images, etc.) and converting them into “structured” text is a challenging task with a long tail of corner cases. Reddit is replete with start-ups trying to tackle this problem, without one clear winner. A couple common scenarios that are challenging: tables that don’t have borders (very common! [2]) and graphs.

Problem #4: Privacy and data sovereignty

Unsurprisingly, many companies have private data that they want to keep private – but start-up vendors want some sort of ownership of the data. For example, LLM vendors (like ChatGPT) have public APIs with promises not to use your private data (which is hard to believe coming from companies that (a) are running out of publicly available data and (b) have a profit motive to use your data). Most RAG vendors want you to house your documents on their servers (which can be architected to be virtually private). And industry-specific vendors (such as in investment research) have a vested interest in looking at your data, if only to aggregate the results later.

This problem seems identical to the one faced in healthcare software over who owns the patients’ records. Epic, one of the largest healthcare software companies, has stood firm that hospitals own the patient data, not Epic. It means one fewer line on the income statement (bad for short-term profit), but builds trust with the hospitals we’ve worked with (good for long-term revenue). Most healthcare start-ups today look for ways to monetize the data, a tendency I could see playing out in the LLM/RAG space, too.

Problem #5: Cost and vendor lock-in

My cynical take on the start-up software world is that it (a) finds product-market fit by solving a need, (b) makes itself “sticky” in some way so that (c) it can jack up the prices later. As a software vendor, “stickiness” is a key feature that can later justify price increases, but from a customer’s perspective, this looks more like “lock-in” that holds us hostage to a vendor that we might come to despise.

(Again, these issues parallel the healthcare software world. I know clinics that are locked into multi-year contracts with a health record system that they hate, and I work with hospital systems that spend hundreds of thousands of dollars – or more! – to switch from a legacy system to Epic.)

SOLUTION

Version 1.0

My plan is to create an online research web app that begins to address Problem #1 and #2. Initial features:

User interface

Investment framework: when you search for a company, it will use the available data to pre-populate an investment framework
Chat-to-learn-more: you’ll have the ability to ask deeper questions of the key sources (powered by AI)
This combination feels like it will be the most usable long-term – almost like a “lecture with Q&A”

Filtering: data sources will be tagged by company, source, etc.
Prioritization: data sources will be “marked” with a prioritization level

(You can call this solution “AI-native,” “agentic,” “privacy-first,” and “user-centric” if you’d like.)

This version will be built to be vendor-agnostic – i.e. no lock-in to any LLM (Problem #5). (LLMs are a commodity after all, aren’t they?) I’ll also only include a couple of companies, to reduce my own personal cost – each encoded document and each query costs fractions of a penny, which quickly add up. I plan to add a public company, private company, and investment fund. One of the biggest challenges here will be having the RAG system work well – which will likely be imperfect!

Configuration options

From what I’ve learned from the development design arc at Epic: less is more when it comes to configuration settings. The imperatives:

Input for company you’re looking at
Ability to ask follow-up questions
Maybe settings (or maybe these can just automagically work): ability to select data sources

Version 2.0

Version 2.0 will extend this:

Increase RAG accuracy
Add more companies
Add ways to retrieve data from and process outside sources
Add user ability to prioritize sources (or maybe delegate this task to an LLM?)
Test running this with a local LLM (Problem #4)

As we take in more data sources, the data ingestion problem (Problem #3) will become more of an issue. (In Version 1.0, I’ll manually clean the sources.) This problem will likely be delegated to a third-party vendor. There’s tons of them that are effective (I’ve been impressed by Morphik in a quick test) and cheap (fractions of a penny per page).

The ultimate goal

Ultimately, I’m hoping to build a prototype that is:

Usable for the long-term – i.e. integrates well with existing workflows),
Accurate,
Explainable – i.e. as few black boxes as possible,
Capable of privacy, and
Relatively inexpensive.

I foresee larger companies being able (and having a vested interest in!) building this structure in-house – why trust outsiders with your own information? I hope to iron out some of the technological wrinkles and go into depth on them in the next few posts.

[1] Developments at Epic require a development design, which roughly follows this format (with a few more questions, and a few more technical details). I've spared you from too many details, but found this format useful for thinking about the whole point of a development: the problem and the solution.

[2] Pdfplumber is a commonly cited Python library that can handle table… but it has a really hard time detecting tables that don’t have lines. See: the cited Masters thesis for why this is so tough.

Wednesday, July 23, 2025

The Pursuit of Grounded Truths: A Primer on RAG

I’ve found that many concepts in computer science are simple ideas that are complex to implement. RAG (or “retrieval-augmented generation”) is a concept that is intuitive – but quickly gets shrouded by complex, domain-specific terminology. I hope to unravel this black box in this post.

The core issue that RAG solves is LLM’s crisis of confidence. ChatGPT is liable to “hallucinate” answers, which makes the tool a non-starter in most business use cases. The underlying issue: LLMs are trying to probabilistically pick the best next word in a sentence, and it feels compelled to answer something, even if it’s wrong. Thus, a focus of computer science research has shifted to preventing the LLM from hallucinating – which is where RAG steps in.

Accuracy through atomicity (i.e. small, trivial tasks)

The entire framework of RAG relies on a simple idea: break large tasks into small, trivial ones. That’s it. The way RAG accomplishes this is with small, specific LLM calls. The user inputs (1) a question and (2) a small chunk of text, and asks the LLM to get the answer from the text. The idea is that by giving the LLM clear guidelines a trusted text to search from, we can trust the response. This serves as the smallest indivisible unit that the rest of RAG is built upon.

Here’s an example of what this type of query looks like (you can copy the “context’ and “question” into ChatGPT yourself to prove that this works!) [1]:

Context (which is included in the ChatGPT query):

You are a helpful assistant. Given the context below and no prior knowledge, answer the question. If you are not sure of the answer, respond “I’m not sure.”

Context: William Jefferson Clinton (né Blythe III; born August 19, 1946) is an American politician and lawyer who was the 42nd president of the United States from 1993 to 2001. A member of the Democratic Party, he previously served as the attorney general of Arkansas from 1977 to 1979 and as the governor of Arkansas from 1979 to 1981, and again from 1983 to 1992. Clinton, whose policies reflected a centrist "Third Way" political philosophy, became known as a New Democrat.

Question: What years did Bill Clinton serve as president?

ChatGPT Response: Bill Clinton served as president from 1993 to 2001.

Question: How many children did Bill Clinton have?

ChatGPT Response: I’m not sure.

These “low-level” queries can be modified to different use cases, but the core idea remains. For example, if we want ChatGPT to include citations, we can just ask; I’ve included an example in the Appendix.

There’s an elegance to how elementary-school simple this process is. These use cases feel trivial, but this near-guarantee of accuracy allows us to architect a system around it. For me, it also dispels the notion that the systems built on top of LLMs are magic. Instead, RAG systems are just a bundle of simple ideas, which together provide a real value.

Chunking: breaking large documents (or many documents) into ChatGPT-able pieces

Let’s say we’re given a 100-page piece of text (e.g. a public company’s 10-K). LLMs have a maximum number of words that it can take at once – you can’t pass in a whole book to ChatGPT and expect it to have a good answer. (OpenAI’s GPT-4 has an ~8K token maximum for its APIs).

Another similar example: let’s say we have 20 documents (5 pages each) for a given company. Again, we can’t pass in all 100 pages at once with a question – there’s simply too many words for the LLM to take in at once.

The solution that RAG uses is again simple: (1) break the documents into pieces, then when a question is asked, (2) find the best chunks of data to look at and (3) pass them into the above ChatGPT prompt. A stylized example:

Input: 100-page document

Chunking:

Chunk 1: pages 1-5

Chunk 2: pages 6-10

…

Chunk 20: pages 95-100

Find the best chunks of data:

Chunk 7

Chunk 10

Pass them into the LLM:

Answer the question below only using the data provided: {Chunk 7} {Chunk 10} …

Retrieval: finding the right chunk of text to look at

In practice, there could be hundreds or thousands of documents, each with hundreds of pages, leaving us with millions of chunks of text. The challenge then becomes: of these millions of chunks, which 10 chunks should we pass to ChatGPT with our question?

This is a classic “search” problem, almost identical to Googling for information. Given a bunch of keywords (e.g. “What is Microstrategy’s business strategy?”) and a corpus of documents (e.g. 10-Ks, 10-Qs, earnings calls, analyst reports), which pieces are most relevant?

The naïve way to solve this problem is by counting how many times words from the input question match each chunk of data. [2] This is eerily reminiscent of search engines pre-Google [3] -- and thus a signal that this may not be the optimal strategy.

Encoding and vectorization: a more complex, LLM-driven way to find relevant chunks of data

The more modern solution is to use the LLM’s own capabilities to find matching chunks – i.e. “embedding” or “vectorization.”

A layman’s summary: at a low level, LLMs convert every word into a series of numbers (“vectors”), and words that are similar in meaning (such as “dog” and “puppy”) have vectors that are close together. This is essentially what the super expensive LLM training is: use every conceivable source of text on the internet, and train the LLM to convert words to vectors. This same “vectorization” concept is applied to our input question and the chunks of data: the question and chunks of data are “embedded” into vectors, allowing us to find chunks that are most similar to – and thus most likely to answer – the input question.

One gnawing question I had: why bother with this vectorization stuff at all? I believe “vectorization” takes advantage of the massive amount of training that the LLM provider had already done and allows the RAG system to match to semantically similar words (e.g. “dog” and “canine”). It’s also quick and easy to do – most LLMs providers have an API that allow you to vectorize content for cheap.

RAG, soup to nuts

The complete RAG process looks something like this:

Process the data

- Find and ingest data

- Chunk the data

- Encode the chunks into “vectors”

- Store the vectors into a database

Find relevant data chunks

- Given a user question:

o Embed the user question into a vector

o Find the closest vectors in the database to the user’s question

Ask the LLM to answer the question

- We now have (a) a user question and (b) relevant chunks of text

- Prompt the LLM, and return the answer to the user

RAG: challenging in practice

As with most things in computer science, RAG is a simple idea with copious challenges in implementation. Amongst the challenges developers actively face:

Data ingestion – converting documents and cleaning them can be challenging (e.g. converting a PDF to text, then removing headers/footers, cleaning tables, etc.; converting audio, graphs, and images can also be challenging)
Chunking strategy – deciding on how to split large files up is challenging (e.g. do you chunk by section? By sentence? By word?), and how you chunk may drastically impact accuracy of results later
Retrieval strategy – embedding/vectorization isn’t always the best strategy; there’s a few different algorithms and a few different vector databases that can be used.
Chaining queries together – open-source RAG frameworks like LlamaIndex allow you to easily chain queries together. For example, if a user enters a complex query that spans multiple documents (e.g. “How did 2022 and 2023 revenue compare?”), the framework will use an LLM to break the query into simpler ones that it then aggregates together.
User interface – most start-ups have defaulted to letting users interact with documents in a chat-style manner. But is this truly the best way?

For a company looking to use RAG, there are also copious other questions (e.g. buy vs. build). I plan to to tackle these sorts of questions, as well as a prototype, in future posts.

Appendix

Sample ChatGPT with Citations [4]

Context:

Please provide an answer based solely on the provided sources. When referencing information from a source, cite the appropriate source(s) using their corresponding numbers. Every answer should include at least one source citation. Only cite a source when you are explicitly referencing it. If none of the sources are helpful, you should indicate that.

For example:

Source 1: The sky is red in the evening and blue in the morning.

Source 2: Water is wet when the sky is red.

Query: When is water wet?

Answer: Water will be wet when the sky is red [2], which occurs in the evening [1].

Now it’s your turn. Below are several numbered sources of information:

------

Source 1: Born and raised in Arkansas, Clinton graduated from Georgetown University in 1968, and later from Yale Law School, where he met his future wife, Hillary Rodham.

Source 2: After graduating from law school, Clinton returned to Arkansas and won election as state attorney general, followed by two non-consecutive tenures as Arkansas governor.

------

Question: Which schools did Bill Clinton attend?

ChatGPT Response: Bill Clinton graduated from Georgetown University in 1968 and later from Yale Law School [1].

[1] Text below taken from Bill Clinton - Wikipedia

[2] One real-world example: the BM25 algorithm, which can be used in LlamaIndex

[3] See: Google: The Complete History and Strategy from the Acquired podcast

[4] This is identical to how LlamaIndex does it: CitationQueryEngine - LlamaIndex, llama_index/llama-index-core/llama_index/core/query_engine/citation_query_engine.py at main · run-llama/llama_index

Tuesday, July 22, 2025

LLMs and Investment Research Agents

In the past month, I’ve spent a lot of time focused on raw data (see: SEC filings, reading PDFs), and so it’s a good time to zoom out and look at the bigger picture. I think there’s a few big questions:

What can LLMs do that couldn’t be done before?
Where can LLMs provide real value in the investment world?

How humans glean insights from data

Almost any research project can be generalized into a few steps:

Data gathering – find trusted sources of data
Data extraction – process the data from the source documents
Data storage – save the data somewhere, in case we need it again
Data analysis – slice and dice the data until you find something interesting
Conclusion – report on the results

For most things, we don’t even think about these being individual steps. For example, if you’re researching a public company for a personal portfolio, “data gathering” is Googling, “data extraction” os reading a few articles, and “data analysis” is simply “thinking.” Another example: if you’re deciding on what car to buy, “data gathering” may mean finding the articles (e.g. Consumer Report) or online forums (e.g. Reddit) that you trust the most, and “data analysis” means weighting the trustworthiness of each data source in your own head. A broader example: a city policy maker may “gather data” by through city statistics but also schmoozing with city business leaders, “data analysis” is a complex synthesis of the two, and the “conclusion” might be a new policy.

Why bother breaking these steps out? While intuitive at a small scale, these steps become important as you scale up. Most tech companies now have separate teams for data extraction/storage (“data engineering”) and data analysis (“data analyst”). Similarly, as we look to delegate tasks to robotic “agents,” we can’t just tell them: “Do this complex task.” Software developers have to break down complex problems into bite-sized components, closely mirroring the way that we humans think. If (and when) the software fails, we can take a closer look at each step: why did this fail? And was the approach we took here the right one for the job?

Additive value of LLMs in the research process

I keep finding myself coming back to the fundamental question: what changed with LLMs? What problems do they solve that couldn’t be done before?

In my view, the primary differentiating benefit of LLMs is the ability to (a) interact with the computer using truly natural language and (b) extract and analyze qualitative data (i.e. text, images, etc.). ChatGPT made such a big splash because of its ability to string together words into “fluent” English, but we’re coming to realize that its true value comes from its ability to process text.

It helps to view LLMs through the lens of the larger research process above. LLMs are excellent at “data extraction” (ingesting text) and good at some aspects of “data analysis” (regurgitating text). For example, if you have a list of 50 articles, an LLM is great at “reading” them all and summarizing the key points. Likewise, if you have a public company’s 10-K or a long legal document, an LLM should be great at pulling out details from it. They are getting better at “data gathering,” too; for complex or ambiguous questions, ChatGPT outperforms an old-fashioned Google search.

However, I believe humans still hold the edge in the other aspects of the research process (for now). Experts know where the best data is, how much to trust each source, and how to tie the pieces together. LLMs can – and should! – help with each step of the process, but in an inherently unpredictable, non-deterministic world, we’ll continue to rely on humans to make the consequential final decisions (“data analysis” and “conclusion”).

Investment research agents

As with LLMs, I find myself wondering: is there anything new with the advent of “AI agents” or is the term just a brilliant stroke of VC marketing? At their core, all agents do is add (a) automation and (b) action-oriented APIs, creating a sequence of steps that creates a mirage of computerized life. I don’t think there’s any truly novel technology here, though; instead, AI agents’ emergence is about ease, scale, and ubiquity. It’s become almost easy for companies to adopt useful “AI agents” that provide tangible value (by building it themselves or hiring a start-up).

Likewise, it’s easy to see how agents could add immediate value in an investment office. For example:

Company analysis – given primary sources (10-Ks, 10-Qs, earnings calls, etc.), extract key metrics and an understanding of the business
Financial data extraction from audited reports – given quarterly updates or audited financials, extract the text and data into an internal database
Investment fund/founder due diligence – given the reams of public data about a person or investment fund (e.g. podcasts), build a one-pager
Investment pitch deck analysis – given an inbound pitch deck, partially fill out an investment framework (leaving gaps for where we need still to ask questions)

These types of agents are starting to gain traction. LlamaIndex, one of the leading open-source agentic frameworks, advertises “faster data extraction, enhanced accuracy and consistency, and actionable insights” for finance workflows. One of its testimonials comes from Carlyle, who built out an LBO agent to “automatically fill structured from unstructured 10-Ks and earnings decks.”

A 5-year plan for LLM-infused investment offices

I recently read about a solo VC GP (Sarah Smith) who’s built an “AI-native firm that can deliver 10x value in 1/10 of the time.” If we take this at face value, it represents a new, tech-forward model of how smaller offices (such as endowments and foundations) can operate with a lean team.

I think the small investment office of the future will need a couple of key things:

LLM-driven investment process. An AI agent should be able to start with a list of “most trusted sources” (e.g. Bloomberg, SEC EDGAR, Pitchbook, internal notes) and then branch out (e.g. via a self-driven Google search) to output a strong first pass of due diligence. It’s then up to the analyst team to review, think, and draw conclusions.
Strong, well-structured internal database. In order for machine learning to work best, they rely on clean data (e.g. a singular data source for performance of companies and funds). LLMs (combined with data engineering) can help convert PDFs into well-structured data, which will then fuel future analyses.
Data-driven and process-focused governance. If we believe that LLMs and data will change the world, the challenge becomes integrating these new tools into everyday workflows. From my experience in healthcare, this integration step is the most difficult – adopting and trusting a new system is extremely difficult.

In the past month, I’ve been most focused on thing #1 above, building an LLM-driven investment tool. It’s meant spending time with the minutiae of the data, how LLM agent systems are architected, building a (naïve) prototype, and seeing which start-ups and companies are worth using in this space. Many more posts to come on progress here (as well as investment deep dives, the end goal of all these tools!).

Thursday, July 3, 2025

An App for High-Quality SEC Filings

High-quality data as a differentiator

I’ve spent a good chunk of the last month working with data – data sources, data pipelines, data warehouses – and as a result, I’ve been forced to think about why I’ve been so drawn to it. I think it comes down to a thorough understanding of the ingredients. An analogy: a good home chef buys ingredients from the grocery store, but a great Michelin-star chef sources their ingredients directly from the farmers. This is what I’m after: sourcing the data, trusting its provenance, and cleaning it in the most effective way. It’s time-consuming and detail-oriented, but the general motivating idea is that the highest-quality meals can only be made with the highest-quality ingredients [1].

Pseudo-public data

I took a course this past spring whose theme was “harnessing data for the public good,” that (a) we are surrounded by data (much of it public!) but (b) struggle to draw good insights from it. I think of it as “pseudo-public” data, data that is technically public but used effectively by few. It’s an extremely attractive idea: all the information you need to solve the puzzle is at your disposal, and you just need to put the pieces together.

SEC filings as pseudo-public data

This brings me to my day-and-a-half side quest to pull SEC filing data, inspired by Joel Cohen's recent post on X asking for the best way to (a) download SEC filings and call transcripts and (b) analyze them. There were a few flavors of responses:

Start-up finance-focused AI companies: including Alphasense (Series F, finance data collection and analysis), Quartr (Series seed/A, finance research platform), ChatsheetAI (generalized AI-infused spreadsheet, extending into finance), Aiera (Series B, tracks investor events like earnings calls), finchat.io (now Fiscal.ai, a Series A investment research platform), fintool.com (Series seed, finance research platform), quasarmarkets.com (Angel-backed financial intelligence platform), Portrait (Accelerator, AI-powered investment research focused on screening ideas), askedgar.io (very early stage, AI-driven research of SEC filings)
Incumbent LLMs: Perplexity, Gemini Deep Research
Incumbent finance aggregators: Bloomberg, Factset
DIY: edgartools, a script written by Ben Brostoff

From an outsider’s perspective: the filings are publicly available (and with an API, too), but yet people are still willing to buy it from a re-packager. Joel’s difficulty in getting this public information easily, though, reminded me of this pseudo-public data paradox, so I thought it was a worthwhile endeavor to add SEC filings to my data catalog.

App to pull high-quality SEC filing data to Markdown

Some technical notes:

SEC filings are accessible via an EDGAR API
The filings are available as HTML files, XBRL files, and a hybrid form (iXBRL)

HTML files gives all the information filed, but HTML is very verbose. (Every piece of data is accompanied with its formatting, so you might see style=“font-size:10pt;font-weight:400;top-margin:10px;left-margin:10px” repeated many times. Some of the tables also have extra rows and columns to help with visual spacing.)
XBRL files contain table-like data

My stance is that SEC filing data can be converted to be “higher quality” with a few steps:

It’s easy to pull the HTML files from the EDGAR API, but it’s a little harder to know that the data should be (a) cleaned and (b) converted to Markdown.

(Note: Markdown is a format that is “extremely close to plain text, with minimal markup or formatting” and a format that “mainstream LLMs, such as OpenAI’s GPT-4o, natively ‘speak’” [2].)

I take a few extra steps – including tidying up the HTML tables to remove spacing columns [3] and contextualizing some of the text styles [4] – to ensure the output Markdown better reflects the original filing.

The resulting Markdown files are cleaner (i.e. contain minimal markup/formatting), smaller (important because LLMs charge by word), and most importantly more usable for downstream systems. My hypothesis is that it’s a higher-quality form of SEC filing data.

Gatekeeping this public data, though, feels a bit selfish. I built an app to allow anyone to pull this data easily (and for free). It allows you to download recent SEC filings for any company they list.

Extracting insights from SEC filings (an ongoing pursuit)

These Markdown SEC filings are small enough to be loaded into Google NotebookLM. (NotebookLM struggles with the full 10-K and 10-Q PDFs and HTML file.) I like NotebookLM for its ability to let you upload files and source answers from them, and based on some quick tests, it seems to do a good job:

It also seems to be able to extract table data well from these Markdown files:

This “data analysis” piece is an ongoing project – it will be interesting to explore if the analysis can be differentiated, or if all solutions are small variations of big LLM players (ChatGPT, Gemini, etc.). If it's the former, look forward to more posts on it! If it’s the latter, then high-quality data (the ingredients that fuel the LLMs) may be a credible differentiator – and the focus may turn once again to sourcing the best data.

[1] Digging into the data is also a worthwhile exercise to see which data sources are a commodity (and thus can be purchased off the shelf) and which can be a source of differentiation.

[2] microsoft/markitdown: Python tool for converting files and office documents to Markdown.

[3] The HTML tables in SEC filings contain “spacing columns” and “spacing rows,” which are strictly used to make the output look prettier to a human. Computers can be confused by these decorations.

[4] A little technical, but the HTML in SEC filings look something like: <span style=“font-size:24pt;font-weight:700;top-margin:10px;left-margin:10px”>text</span>. Many of these styles are not useful to a computer, but things like the font size and font weight (i.e. bolding) gives us contextual clues where the headers, etc. are.