Mike Yanagisawa

Thursday, July 24, 2025

Dev Design: Investment Research "Agent"

PROBLEM SUMMARY [1]

Companies are awash in information – both public and private – but have a hard time interacting with it all effectively. RAG has been proposed as a way to organize and chat with private data, but there’s a few key challenges:

Problem #1: User interface and integration

Most RAG systems’ user interface consists of a chat box, which “feels” natural and infinite for the end user. However, I think this only captures one use case, the “inquisitive” mode. It’s less effective with a more structured approach (e.g. you have a “research framework”), or in cases where you’re not sure what to ask. Sometimes you need a 45-minute lecture to jostle the questions out of your brain.

A bit of a tangent: but I think this is something start-ups today are not doing well. Start-ups are focused on specific pieces of the end-to-end chain (e.g. UnstructuredIO) or spinning up a generalized business model (e.g. RAG-as-a-service).

Problem #2: Ability to prioritize sources

Most knowledge management systems don’t have an easy way to prioritize some voices over others. From an investment research lens: I’d likely prioritize my own firm’s investment memo over another firm’s, I’d value the financials in the 10-K over those from another source, and I’d trust an article from the Financial Times over a clickbait-y Business Insider one. This level of discernment is a critical part of the knowledge aggregation process, but it isn’t an option in most software.

Problem #3: Ingesting data files effectively

I’ve alluded to this in my personal fight with ingesting PDFs, but taking “unstructured” documents (like PDFs that have tables, graphs, images, etc.) and converting them into “structured” text is a challenging task with a long tail of corner cases. Reddit is replete with start-ups trying to tackle this problem, without one clear winner. A couple common scenarios that are challenging: tables that don’t have borders (very common! [2]) and graphs.

Problem #4: Privacy and data sovereignty

Unsurprisingly, many companies have private data that they want to keep private – but start-up vendors want some sort of ownership of the data. For example, LLM vendors (like ChatGPT) have public APIs with promises not to use your private data (which is hard to believe coming from companies that (a) are running out of publicly available data and (b) have a profit motive to use your data). Most RAG vendors want you to house your documents on their servers (which can be architected to be virtually private). And industry-specific vendors (such as in investment research) have a vested interest in looking at your data, if only to aggregate the results later.

This problem seems identical to the one faced in healthcare software over who owns the patients’ records. Epic, one of the largest healthcare software companies, has stood firm that hospitals own the patient data, not Epic. It means one fewer line on the income statement (bad for short-term profit), but builds trust with the hospitals we’ve worked with (good for long-term revenue). Most healthcare start-ups today look for ways to monetize the data, a tendency I could see playing out in the LLM/RAG space, too.

Problem #5: Cost and vendor lock-in

My cynical take on the start-up software world is that it (a) finds product-market fit by solving a need, (b) makes itself “sticky” in some way so that (c) it can jack up the prices later. As a software vendor, “stickiness” is a key feature that can later justify price increases, but from a customer’s perspective, this looks more like “lock-in” that holds us hostage to a vendor that we might come to despise.

(Again, these issues parallel the healthcare software world. I know clinics that are locked into multi-year contracts with a health record system that they hate, and I work with hospital systems that spend hundreds of thousands of dollars – or more! – to switch from a legacy system to Epic.)

SOLUTION

Version 1.0

My plan is to create an online research web app that begins to address Problem #1 and #2. Initial features:

User interface

Investment framework: when you search for a company, it will use the available data to pre-populate an investment framework
Chat-to-learn-more: you’ll have the ability to ask deeper questions of the key sources (powered by AI)
This combination feels like it will be the most usable long-term – almost like a “lecture with Q&A”

Filtering: data sources will be tagged by company, source, etc.
Prioritization: data sources will be “marked” with a prioritization level

(You can call this solution “AI-native,” “agentic,” “privacy-first,” and “user-centric” if you’d like.)

This version will be built to be vendor-agnostic – i.e. no lock-in to any LLM (Problem #5). (LLMs are a commodity after all, aren’t they?) I’ll also only include a couple of companies, to reduce my own personal cost – each encoded document and each query costs fractions of a penny, which quickly add up. I plan to add a public company, private company, and investment fund. One of the biggest challenges here will be having the RAG system work well – which will likely be imperfect!

Configuration options

From what I’ve learned from the development design arc at Epic: less is more when it comes to configuration settings. The imperatives:

Input for company you’re looking at
Ability to ask follow-up questions
Maybe settings (or maybe these can just automagically work): ability to select data sources

Version 2.0

Version 2.0 will extend this:

Increase RAG accuracy
Add more companies
Add ways to retrieve data from and process outside sources
Add user ability to prioritize sources (or maybe delegate this task to an LLM?)
Test running this with a local LLM (Problem #4)

As we take in more data sources, the data ingestion problem (Problem #3) will become more of an issue. (In Version 1.0, I’ll manually clean the sources.) This problem will likely be delegated to a third-party vendor. There’s tons of them that are effective (I’ve been impressed by Morphik in a quick test) and cheap (fractions of a penny per page).

The ultimate goal

Ultimately, I’m hoping to build a prototype that is:

Usable for the long-term – i.e. integrates well with existing workflows),
Accurate,
Explainable – i.e. as few black boxes as possible,
Capable of privacy, and
Relatively inexpensive.

I foresee larger companies being able (and having a vested interest in!) building this structure in-house – why trust outsiders with your own information? I hope to iron out some of the technological wrinkles and go into depth on them in the next few posts.

[1] Developments at Epic require a development design, which roughly follows this format (with a few more questions, and a few more technical details). I've spared you from too many details, but found this format useful for thinking about the whole point of a development: the problem and the solution.

[2] Pdfplumber is a commonly cited Python library that can handle table… but it has a really hard time detecting tables that don’t have lines. See: the cited Masters thesis for why this is so tough.

Wednesday, July 23, 2025

The Pursuit of Grounded Truths: A Primer on RAG

I’ve found that many concepts in computer science are simple ideas that are complex to implement. RAG (or “retrieval-augmented generation”) is a concept that is intuitive – but quickly gets shrouded by complex, domain-specific terminology. I hope to unravel this black box in this post.

The core issue that RAG solves is LLM’s crisis of confidence. ChatGPT is liable to “hallucinate” answers, which makes the tool a non-starter in most business use cases. The underlying issue: LLMs are trying to probabilistically pick the best next word in a sentence, and it feels compelled to answer something, even if it’s wrong. Thus, a focus of computer science research has shifted to preventing the LLM from hallucinating – which is where RAG steps in.

Accuracy through atomicity (i.e. small, trivial tasks)

The entire framework of RAG relies on a simple idea: break large tasks into small, trivial ones. That’s it. The way RAG accomplishes this is with small, specific LLM calls. The user inputs (1) a question and (2) a small chunk of text, and asks the LLM to get the answer from the text. The idea is that by giving the LLM clear guidelines a trusted text to search from, we can trust the response. This serves as the smallest indivisible unit that the rest of RAG is built upon.

Here’s an example of what this type of query looks like (you can copy the “context’ and “question” into ChatGPT yourself to prove that this works!) [1]:

Context (which is included in the ChatGPT query):

You are a helpful assistant. Given the context below and no prior knowledge, answer the question. If you are not sure of the answer, respond “I’m not sure.”

Context: William Jefferson Clinton (né Blythe III; born August 19, 1946) is an American politician and lawyer who was the 42nd president of the United States from 1993 to 2001. A member of the Democratic Party, he previously served as the attorney general of Arkansas from 1977 to 1979 and as the governor of Arkansas from 1979 to 1981, and again from 1983 to 1992. Clinton, whose policies reflected a centrist "Third Way" political philosophy, became known as a New Democrat.

Question: What years did Bill Clinton serve as president?

ChatGPT Response: Bill Clinton served as president from 1993 to 2001.

Question: How many children did Bill Clinton have?

ChatGPT Response: I’m not sure.

These “low-level” queries can be modified to different use cases, but the core idea remains. For example, if we want ChatGPT to include citations, we can just ask; I’ve included an example in the Appendix.

There’s an elegance to how elementary-school simple this process is. These use cases feel trivial, but this near-guarantee of accuracy allows us to architect a system around it. For me, it also dispels the notion that the systems built on top of LLMs are magic. Instead, RAG systems are just a bundle of simple ideas, which together provide a real value.

Chunking: breaking large documents (or many documents) into ChatGPT-able pieces

Let’s say we’re given a 100-page piece of text (e.g. a public company’s 10-K). LLMs have a maximum number of words that it can take at once – you can’t pass in a whole book to ChatGPT and expect it to have a good answer. (OpenAI’s GPT-4 has an ~8K token maximum for its APIs).

Another similar example: let’s say we have 20 documents (5 pages each) for a given company. Again, we can’t pass in all 100 pages at once with a question – there’s simply too many words for the LLM to take in at once.

The solution that RAG uses is again simple: (1) break the documents into pieces, then when a question is asked, (2) find the best chunks of data to look at and (3) pass them into the above ChatGPT prompt. A stylized example:

Input: 100-page document

Chunking:

Chunk 1: pages 1-5

Chunk 2: pages 6-10

…

Chunk 20: pages 95-100

Find the best chunks of data:

Chunk 7

Chunk 10

Pass them into the LLM:

Answer the question below only using the data provided: {Chunk 7} {Chunk 10} …

Retrieval: finding the right chunk of text to look at

In practice, there could be hundreds or thousands of documents, each with hundreds of pages, leaving us with millions of chunks of text. The challenge then becomes: of these millions of chunks, which 10 chunks should we pass to ChatGPT with our question?

This is a classic “search” problem, almost identical to Googling for information. Given a bunch of keywords (e.g. “What is Microstrategy’s business strategy?”) and a corpus of documents (e.g. 10-Ks, 10-Qs, earnings calls, analyst reports), which pieces are most relevant?

The naïve way to solve this problem is by counting how many times words from the input question match each chunk of data. [2] This is eerily reminiscent of search engines pre-Google [3] -- and thus a signal that this may not be the optimal strategy.

Encoding and vectorization: a more complex, LLM-driven way to find relevant chunks of data

The more modern solution is to use the LLM’s own capabilities to find matching chunks – i.e. “embedding” or “vectorization.”

A layman’s summary: at a low level, LLMs convert every word into a series of numbers (“vectors”), and words that are similar in meaning (such as “dog” and “puppy”) have vectors that are close together. This is essentially what the super expensive LLM training is: use every conceivable source of text on the internet, and train the LLM to convert words to vectors. This same “vectorization” concept is applied to our input question and the chunks of data: the question and chunks of data are “embedded” into vectors, allowing us to find chunks that are most similar to – and thus most likely to answer – the input question.

One gnawing question I had: why bother with this vectorization stuff at all? I believe “vectorization” takes advantage of the massive amount of training that the LLM provider had already done and allows the RAG system to match to semantically similar words (e.g. “dog” and “canine”). It’s also quick and easy to do – most LLMs providers have an API that allow you to vectorize content for cheap.

RAG, soup to nuts

The complete RAG process looks something like this:

Process the data

- Find and ingest data

- Chunk the data

- Encode the chunks into “vectors”

- Store the vectors into a database

Find relevant data chunks

- Given a user question:

o Embed the user question into a vector

o Find the closest vectors in the database to the user’s question

Ask the LLM to answer the question

- We now have (a) a user question and (b) relevant chunks of text

- Prompt the LLM, and return the answer to the user

RAG: challenging in practice

As with most things in computer science, RAG is a simple idea with copious challenges in implementation. Amongst the challenges developers actively face:

Data ingestion – converting documents and cleaning them can be challenging (e.g. converting a PDF to text, then removing headers/footers, cleaning tables, etc.; converting audio, graphs, and images can also be challenging)
Chunking strategy – deciding on how to split large files up is challenging (e.g. do you chunk by section? By sentence? By word?), and how you chunk may drastically impact accuracy of results later
Retrieval strategy – embedding/vectorization isn’t always the best strategy; there’s a few different algorithms and a few different vector databases that can be used.
Chaining queries together – open-source RAG frameworks like LlamaIndex allow you to easily chain queries together. For example, if a user enters a complex query that spans multiple documents (e.g. “How did 2022 and 2023 revenue compare?”), the framework will use an LLM to break the query into simpler ones that it then aggregates together.
User interface – most start-ups have defaulted to letting users interact with documents in a chat-style manner. But is this truly the best way?

For a company looking to use RAG, there are also copious other questions (e.g. buy vs. build). I plan to to tackle these sorts of questions, as well as a prototype, in future posts.

Appendix

Sample ChatGPT with Citations [4]

Context:

Please provide an answer based solely on the provided sources. When referencing information from a source, cite the appropriate source(s) using their corresponding numbers. Every answer should include at least one source citation. Only cite a source when you are explicitly referencing it. If none of the sources are helpful, you should indicate that.

For example:

Source 1: The sky is red in the evening and blue in the morning.

Source 2: Water is wet when the sky is red.

Query: When is water wet?

Answer: Water will be wet when the sky is red [2], which occurs in the evening [1].

Now it’s your turn. Below are several numbered sources of information:

------

Source 1: Born and raised in Arkansas, Clinton graduated from Georgetown University in 1968, and later from Yale Law School, where he met his future wife, Hillary Rodham.

Source 2: After graduating from law school, Clinton returned to Arkansas and won election as state attorney general, followed by two non-consecutive tenures as Arkansas governor.

------

Question: Which schools did Bill Clinton attend?

ChatGPT Response: Bill Clinton graduated from Georgetown University in 1968 and later from Yale Law School [1].

[1] Text below taken from Bill Clinton - Wikipedia

[2] One real-world example: the BM25 algorithm, which can be used in LlamaIndex

[3] See: Google: The Complete History and Strategy from the Acquired podcast

[4] This is identical to how LlamaIndex does it: CitationQueryEngine - LlamaIndex, llama_index/llama-index-core/llama_index/core/query_engine/citation_query_engine.py at main · run-llama/llama_index

Tuesday, July 22, 2025

LLMs and Investment Research Agents

In the past month, I’ve spent a lot of time focused on raw data (see: SEC filings, reading PDFs), and so it’s a good time to zoom out and look at the bigger picture. I think there’s a few big questions:

What can LLMs do that couldn’t be done before?
Where can LLMs provide real value in the investment world?

How humans glean insights from data

Almost any research project can be generalized into a few steps:

Data gathering – find trusted sources of data
Data extraction – process the data from the source documents
Data storage – save the data somewhere, in case we need it again
Data analysis – slice and dice the data until you find something interesting
Conclusion – report on the results

For most things, we don’t even think about these being individual steps. For example, if you’re researching a public company for a personal portfolio, “data gathering” is Googling, “data extraction” os reading a few articles, and “data analysis” is simply “thinking.” Another example: if you’re deciding on what car to buy, “data gathering” may mean finding the articles (e.g. Consumer Report) or online forums (e.g. Reddit) that you trust the most, and “data analysis” means weighting the trustworthiness of each data source in your own head. A broader example: a city policy maker may “gather data” by through city statistics but also schmoozing with city business leaders, “data analysis” is a complex synthesis of the two, and the “conclusion” might be a new policy.

Why bother breaking these steps out? While intuitive at a small scale, these steps become important as you scale up. Most tech companies now have separate teams for data extraction/storage (“data engineering”) and data analysis (“data analyst”). Similarly, as we look to delegate tasks to robotic “agents,” we can’t just tell them: “Do this complex task.” Software developers have to break down complex problems into bite-sized components, closely mirroring the way that we humans think. If (and when) the software fails, we can take a closer look at each step: why did this fail? And was the approach we took here the right one for the job?

Additive value of LLMs in the research process

I keep finding myself coming back to the fundamental question: what changed with LLMs? What problems do they solve that couldn’t be done before?

In my view, the primary differentiating benefit of LLMs is the ability to (a) interact with the computer using truly natural language and (b) extract and analyze qualitative data (i.e. text, images, etc.). ChatGPT made such a big splash because of its ability to string together words into “fluent” English, but we’re coming to realize that its true value comes from its ability to process text.

It helps to view LLMs through the lens of the larger research process above. LLMs are excellent at “data extraction” (ingesting text) and good at some aspects of “data analysis” (regurgitating text). For example, if you have a list of 50 articles, an LLM is great at “reading” them all and summarizing the key points. Likewise, if you have a public company’s 10-K or a long legal document, an LLM should be great at pulling out details from it. They are getting better at “data gathering,” too; for complex or ambiguous questions, ChatGPT outperforms an old-fashioned Google search.

However, I believe humans still hold the edge in the other aspects of the research process (for now). Experts know where the best data is, how much to trust each source, and how to tie the pieces together. LLMs can – and should! – help with each step of the process, but in an inherently unpredictable, non-deterministic world, we’ll continue to rely on humans to make the consequential final decisions (“data analysis” and “conclusion”).

Investment research agents

As with LLMs, I find myself wondering: is there anything new with the advent of “AI agents” or is the term just a brilliant stroke of VC marketing? At their core, all agents do is add (a) automation and (b) action-oriented APIs, creating a sequence of steps that creates a mirage of computerized life. I don’t think there’s any truly novel technology here, though; instead, AI agents’ emergence is about ease, scale, and ubiquity. It’s become almost easy for companies to adopt useful “AI agents” that provide tangible value (by building it themselves or hiring a start-up).

Likewise, it’s easy to see how agents could add immediate value in an investment office. For example:

Company analysis – given primary sources (10-Ks, 10-Qs, earnings calls, etc.), extract key metrics and an understanding of the business
Financial data extraction from audited reports – given quarterly updates or audited financials, extract the text and data into an internal database
Investment fund/founder due diligence – given the reams of public data about a person or investment fund (e.g. podcasts), build a one-pager
Investment pitch deck analysis – given an inbound pitch deck, partially fill out an investment framework (leaving gaps for where we need still to ask questions)

These types of agents are starting to gain traction. LlamaIndex, one of the leading open-source agentic frameworks, advertises “faster data extraction, enhanced accuracy and consistency, and actionable insights” for finance workflows. One of its testimonials comes from Carlyle, who built out an LBO agent to “automatically fill structured from unstructured 10-Ks and earnings decks.”

A 5-year plan for LLM-infused investment offices

I recently read about a solo VC GP (Sarah Smith) who’s built an “AI-native firm that can deliver 10x value in 1/10 of the time.” If we take this at face value, it represents a new, tech-forward model of how smaller offices (such as endowments and foundations) can operate with a lean team.

I think the small investment office of the future will need a couple of key things:

LLM-driven investment process. An AI agent should be able to start with a list of “most trusted sources” (e.g. Bloomberg, SEC EDGAR, Pitchbook, internal notes) and then branch out (e.g. via a self-driven Google search) to output a strong first pass of due diligence. It’s then up to the analyst team to review, think, and draw conclusions.
Strong, well-structured internal database. In order for machine learning to work best, they rely on clean data (e.g. a singular data source for performance of companies and funds). LLMs (combined with data engineering) can help convert PDFs into well-structured data, which will then fuel future analyses.
Data-driven and process-focused governance. If we believe that LLMs and data will change the world, the challenge becomes integrating these new tools into everyday workflows. From my experience in healthcare, this integration step is the most difficult – adopting and trusting a new system is extremely difficult.

In the past month, I’ve been most focused on thing #1 above, building an LLM-driven investment tool. It’s meant spending time with the minutiae of the data, how LLM agent systems are architected, building a (naïve) prototype, and seeing which start-ups and companies are worth using in this space. Many more posts to come on progress here (as well as investment deep dives, the end goal of all these tools!).

Thursday, July 3, 2025

An App for High-Quality SEC Filings

High-quality data as a differentiator

I’ve spent a good chunk of the last month working with data – data sources, data pipelines, data warehouses – and as a result, I’ve been forced to think about why I’ve been so drawn to it. I think it comes down to a thorough understanding of the ingredients. An analogy: a good home chef buys ingredients from the grocery store, but a great Michelin-star chef sources their ingredients directly from the farmers. This is what I’m after: sourcing the data, trusting its provenance, and cleaning it in the most effective way. It’s time-consuming and detail-oriented, but the general motivating idea is that the highest-quality meals can only be made with the highest-quality ingredients [1].

Pseudo-public data

I took a course this past spring whose theme was “harnessing data for the public good,” that (a) we are surrounded by data (much of it public!) but (b) struggle to draw good insights from it. I think of it as “pseudo-public” data, data that is technically public but used effectively by few. It’s an extremely attractive idea: all the information you need to solve the puzzle is at your disposal, and you just need to put the pieces together.

SEC filings as pseudo-public data

This brings me to my day-and-a-half side quest to pull SEC filing data, inspired by Joel Cohen's recent post on X asking for the best way to (a) download SEC filings and call transcripts and (b) analyze them. There were a few flavors of responses:

Start-up finance-focused AI companies: including Alphasense (Series F, finance data collection and analysis), Quartr (Series seed/A, finance research platform), ChatsheetAI (generalized AI-infused spreadsheet, extending into finance), Aiera (Series B, tracks investor events like earnings calls), finchat.io (now Fiscal.ai, a Series A investment research platform), fintool.com (Series seed, finance research platform), quasarmarkets.com (Angel-backed financial intelligence platform), Portrait (Accelerator, AI-powered investment research focused on screening ideas), askedgar.io (very early stage, AI-driven research of SEC filings)
Incumbent LLMs: Perplexity, Gemini Deep Research
Incumbent finance aggregators: Bloomberg, Factset
DIY: edgartools, a script written by Ben Brostoff

From an outsider’s perspective: the filings are publicly available (and with an API, too), but yet people are still willing to buy it from a re-packager. Joel’s difficulty in getting this public information easily, though, reminded me of this pseudo-public data paradox, so I thought it was a worthwhile endeavor to add SEC filings to my data catalog.

App to pull high-quality SEC filing data to Markdown

Some technical notes:

SEC filings are accessible via an EDGAR API
The filings are available as HTML files, XBRL files, and a hybrid form (iXBRL)

HTML files gives all the information filed, but HTML is very verbose. (Every piece of data is accompanied with its formatting, so you might see style=“font-size:10pt;font-weight:400;top-margin:10px;left-margin:10px” repeated many times. Some of the tables also have extra rows and columns to help with visual spacing.)
XBRL files contain table-like data

My stance is that SEC filing data can be converted to be “higher quality” with a few steps:

It’s easy to pull the HTML files from the EDGAR API, but it’s a little harder to know that the data should be (a) cleaned and (b) converted to Markdown.

(Note: Markdown is a format that is “extremely close to plain text, with minimal markup or formatting” and a format that “mainstream LLMs, such as OpenAI’s GPT-4o, natively ‘speak’” [2].)

I take a few extra steps – including tidying up the HTML tables to remove spacing columns [3] and contextualizing some of the text styles [4] – to ensure the output Markdown better reflects the original filing.

The resulting Markdown files are cleaner (i.e. contain minimal markup/formatting), smaller (important because LLMs charge by word), and most importantly more usable for downstream systems. My hypothesis is that it’s a higher-quality form of SEC filing data.

Gatekeeping this public data, though, feels a bit selfish. I built an app to allow anyone to pull this data easily (and for free). It allows you to download recent SEC filings for any company they list.

Extracting insights from SEC filings (an ongoing pursuit)

These Markdown SEC filings are small enough to be loaded into Google NotebookLM. (NotebookLM struggles with the full 10-K and 10-Q PDFs and HTML file.) I like NotebookLM for its ability to let you upload files and source answers from them, and based on some quick tests, it seems to do a good job:

It also seems to be able to extract table data well from these Markdown files:

This “data analysis” piece is an ongoing project – it will be interesting to explore if the analysis can be differentiated, or if all solutions are small variations of big LLM players (ChatGPT, Gemini, etc.). If it's the former, look forward to more posts on it! If it’s the latter, then high-quality data (the ingredients that fuel the LLMs) may be a credible differentiator – and the focus may turn once again to sourcing the best data.

[1] Digging into the data is also a worthwhile exercise to see which data sources are a commodity (and thus can be purchased off the shelf) and which can be a source of differentiation.

[2] microsoft/markitdown: Python tool for converting files and office documents to Markdown.

[3] The HTML tables in SEC filings contain “spacing columns” and “spacing rows,” which are strictly used to make the output look prettier to a human. Computers can be confused by these decorations.

[4] A little technical, but the HTML in SEC filings look something like: <span style=“font-size:24pt;font-weight:700;top-margin:10px;left-margin:10px”>text</span>. Many of these styles are not useful to a computer, but things like the font size and font weight (i.e. bolding) gives us contextual clues where the headers, etc. are.

Thursday, June 26, 2025

Can Computers Read PDFs?: Overview

LLM recap

VCs readily acknowledge that we’re in a AI bubble [1], which naturally evoke questions about where LLMs do well – and where they will continue to struggle. I wrote about my perspective, and my main takeaways:

LLMs are a powerful but overutilized tool. Their responses feel like magic, leading to their meteoric rise. As a result, people try to use LLMs for nearly everything.
However, LLMs still struggle with accuracy and reliability. My (hyperbolic) take: these are features, not bugs. The non-deterministic nature of LLMs that make them so good at conversation make them bad at being 100% accurate.
We will discover that we still need older but reliable technologies. LLMs will need to co-exist with – not supplant – strong data pipelines, traditional relational databases, etc.

I don’t think these statements are terribly controversial, but it is worth saying explicitly because we’ve grown up with the idea that a new technology must wipe out the old. Our collective imagination has become enamored with sexy stories of “disruptive technologies” – if the iPhone is to succeed, the Blackberry must fail; if Netflix is to succeed, cable television must disappear; if Uber is to succeed, taxis will become an artifact. We overimagine the dominance of the new technology, leading to a bubble that eventually deflates when reality sets in.

This instinct – that new technology completely replaces the old – is misguided. What cemented this idea for me was learning more about GPUs (from Chris Miller’s Chip War) and the nascent quantum computing industry. Chip War gives you the general impression that GPUs are all that matter moving forward. In reality, GPUs handle specialized work (e.g. high-end computer graphics, LLM inference); computers will still rely on CPUs for the bulk of its processing. Likewise, quantum experts do not believe that quantum computing will replace CPUs; instead, they will handle only the hardest problems that CPUs and GPUs cannot solve today [2]. A future state-of-the-art computer will feature a central CPU, further enhanced by a GPU and QPU (quantum processing unit). In other words, newer technology will still rely on – and build on – the strengths of older technologies.

The core tension with LLMs: 100% reliability at scale

This was a long way to get to the core tension with LLMs (and more broadly, “AI agents”) today:

LLM are great at getting information from unstructured data sources (e.g. PDFs, Excel files, Word documents),
However, LLMs are hard to trust at scale, and
There is no easy fix to making LLMs more trustworthy.

Kunle Omojola is tackling this same problem from the healthcare billing world, and summarizes it well:

“The more I see and the more I dig in, the clearer it is that the deterministic approach utilizing models for edge cases and self healing/auto healing beats the pure nondeterministic at enterprise scale … if there’s a model free alternative that is completely deterministic and the customer/end user expects a deterministic result, enterprise customers would choose that 10 times out of 10.”

Through this lens, the LLM/non-LLM discussion is nothing more than the Pareto principle: LLMs do 80% of a task well, but in some cases, the last 20% (or last 5%) is what provides true value. And, as Kunle states, a standalone LLM may not be the best solution if the last 20% that matters most.

I’m approaching the problem from a slightly different lens, tackling the opportunities that investment offices face today. Some use cases I’ve heard about: (a) analyzing start-up or investment fund pitch decks and doing an initial pass of due diligence, (b) automatically extracting financials (e.g. investment returns) from quarterly reports from start-ups/funds, and (c) searching through past start-up/fund documentation for patterns or changes. These are all flavors of the same problem: given a PDF, can the LLM extract insights? It feels like a foundation question in finance (or finance operations) today: we are surrounded by an excess of high-quality data but an inability to use most of it well.

Tackling the reliability gap

The patient and virtue required to solve challenging problems with sustainable solutions is one of the things I've taken away from my eight years at Epic Systems. Over my tenure, I spent over 80 days onsite doing "floor support" (at-the-elbow training help for a hospital just starting to use our software) and over 110 "command center" days (centralized support making broad system changes), a hands-on knowledge that I would later learn is rare in the tech world. I'd always keep a list of pain points users faced, which would generally fall into a few buckets:

Training issues. The system is "working as designed," but either (a) the user wasn't trained well or (b) the system's layout is unintuitive (or a combination of both).
Quick-wins. A software as large and complex as Epic always has a few simple developments that could save people a few clicks or make information available exactly when it's needed. Tens of papercuts can ruin the first impressions of an otherwise great core product, so I'd bring these back to our development team (and sometimes just tackle a few of them myself).
Customer requests that don't address the core need. One underlying philosophy at Epic is that the customer is not always right: they're asking for X, but that's really addressing Y, and so they really need Z. One example: pharmacies would ask for a couple of reporting columns to hack together a jury-rigged med sync program, and to their disappointment, we would tell them to hold out for the full-blown med sync development project. It was always hard for end-users (i.e. pharmacists) to see the best solution, because they didn't have a strong sense of technical underpinnings.
Simple requests that are extremely challenging to do well. The last bucket was projects that seem simple on the surface but were actually extremely difficult to develop well (i.e. make it function well and architected in a way that could be built upon easily). One example that fit this bucket was the ability to schedule future fill (e.g. patient calls and wants their prescription filled next week). It would've been easy to throw together a quick stop-gap solution, but we put our strongest developer on a hundreds-of-hour project to come up with a complete solution. It meant delaying the feature by a few releases (and undoubtedly frustrated a few customers), but I think this intentionality meant the code would have a better chance of standing the test of time.

It's through lenses 3 and 4 that I tackle the PDF/LLM problem. Lens 3: it's relatively easy to see that "we have piles of data that we're not fully using," but it's much harder for an investing professional to see the steps you might need to take to address it. Lens 4: taking a slower, more deliberate technical approach will be much more painful now but will create a strong technical foundation (i.e. the completely antithesis of "vibe coding").

Building blocks for LLM-driven PDF analysis

With that, I'd break the generalized problem (PDF-driven analysis) into a few discrete steps:

Extract text/tables from unstructured data,
Store/retrieve the data, then
Analyze the unstructured data.

Naïve LLM implementations today try to use LLMs to tackle all three steps in one fell swoop. I’ve tried this before with middling success: give ChatGPT a question and ask it basic questions. It doesn’t do a great job. There’s a couple issues: (a) it’s asking too much of the LLMs to do all three steps at once, and (b) even if it could, it would be hard to scale it reliably across a large organization. The result: when some organizations test out LLMs, they find that LLMs add little value.

As I’ve not-so-subtly alluded to, the process needs to be broken into its core components and tackled individually. I’ll break this down into a few separate posts to get into the details of each step, but I’ll spoil the big takeaway: LLMs in their current iteration are not well-suited for all the tasks above. Indeed, the best version of the future will layer LLMs on top of deterministic algorithms and well-built databases – exactly what we’ll try to build up to in the next few write-ups.

[1] Gartner – of the famous Gartner Hype Cycle – put “prompt engineering,” “AI engineering,” and “foundation models” near the “peak of inflated expectations” last November 2024.

[2] Before learning more about quantum computing, I had originally thought that quantum computers would completely replace modern computers.

Wednesday, June 18, 2025

LLMs: Technological Savior or the Next Bygone Trend?

A brief, incomplete history of LLMs

While machine learning is not new to the 2020s, the modern era of LLMs kicked off with the release of ChatGPT in November 2022. I learned about ChatGPT from my professor in an Advanced Computer Security and Privacy class; he was floored by what it could accomplish. The first use cases were novelty trinkets that were built to go viral on social media; I remember a poem about a toaster, then more from OpenAI and then the New York Times. Yet LLMs and generative AI still felt like they were in their exploratory phase, an easy-to-use technology looking for its footing in the real world.

ChatGPT was just the catalyst for the new LLM bubble, though. ChatGPT was built on the landmark paper, “Attention Is All You Need,” published in 2017 by Google researchers, a full five years before OpenAI released their app. Recounting the history raises more questions than answers – e.g. how did Google squander their lead in AI? Why did they not release an LLM before OpenAI?

Instead, I’d like to focus on the bigger picture: why did ChatGPT become so popular, what problems did it excel at, and where might it struggle? As builders begin to integrate LLMs, where’s the hype and where’s the real value?

ChatGPT: realized use cases

At a fundamental level, all generative models like ChatGPT try to do is predict the next word in a sequence. (For example: predict what word comes next: “The cat sat on the …”) ChatGPT’s transformer architecture differed from prior models in two significant ways: (1) it was able to process more words at once and decide which words to pay more “attention” to, and (2) training the models – which was previously a bottleneck – could be sped up significantly by GPUs, allowing for more efficient training on larger datasets. In short, ChatGPT had been trained on vastly more data with a better algorithm, leading to its efficacy and near-immediate popularity on release.

To me, what made ChatGPT (and subsequent LLMs) so powerful:

Ease of use. You interact with it using normal English, with an interface no more complicated than Google’s homepage.
Broad (and novel) use cases. It could accomplish tasks previously hard for a computer, such as drafting good emails, composing poems, writing prose in the style of a famous author, and developing well-written code.
Mostly accurate. It did pretty well on most things, but occasionally would say something overtly wrong, or would forget that it knew how to write a sonnet.

From these origins extended an unbridled optimism about LLMs. After all, if the LLMs could reason (or at least appear to reason [1]), at what point could AI build stronger AI than human computer scientists could, leading to superintelligence [2]? More straightforwardly: which jobs could a good LLM replace? Certainly, call center reps and chatbots, but could poets, journalists, writers, lawyers, consultants, and doctors be next?

Where LLMs struggle

It’s been 2.5 years since ChatGPT was released, and enthusiasm for LLMs has not waned. Instead, the environment has only become frothier – authentic innovators (people who understand a problem and solve it using the best technology for the job) are mixed in with “follow-on” entrepreneurs (people who prioritize adding “LLM” or “AI Agent” into their start-up pitch over solving a real problem).

Nevertheless, we’ve seen LLM adoption stall – there seems to be some invisible barrier to everyday adoption in the workplaces. In a simplified tech adoption world, there’s only a few potential root causes: (a) the product is bad, (b) the product is good but doesn’t meet a customer need, or (c) the customer is poorly trained. Which is it?

To try to answer this, it’s important to see where and why LLMs still struggle:

Accuracy. Models are sometimes slightly wrong, other times very wrong. For everyday use cases (e.g. planning a road trip, looking up a recipe), this isn’t a big deal, but for work cases, you can’t have a computer program that makes mistakes like a summer intern.
Reliability. Tied in with “accuracy,” most work use cases require something you can absolutely trust to work 99.9% of the time.
False confidence. Whether the model is wrong or right, it may project the same level of confidence.
Black box algorithm. The LLM’s algorithm is a black box: you feed it input, it gives you an output, but it’s hard to troubleshoot if something goes wrong.
Data privacy concerns. Sending sensitive data (like confidential documents) to model providers like OpenAI may be against contractual agreements (especially if OpenAI decides to use the data later for model training).

To me, the answer is (b) – the product is fine, but the customer expects it to do more (be more accurate, be more reliable) than it currently is. These issues stem from the very nature of LLMs and machine learning in general. Traditional algorithms are deterministic: for any input, it returns a single output. It’s predictable and reliable by design. Machine learning, however, is probabilistic: a single input can result in a range of outputs, some more probable than others. It’s what allows ChatGPT to create (what I consider to be) truly creative poems – but it also means it will (and does!) struggle with problems that have a single correct solution. The probabilistic nature of LLMs not only harm accuracy/reliability, but make it almost impossible to know where it went wrong when it inevitably does.

A side story on where/how models can go wrong

Can LLMs replace grunt work? One task I wanted help on was finding the Employer Identification Number (EIN) for a list of private universities. The reason: you can look up a non-profit’s 990 filings with the EIN, so it serves as a critical bridge from a university to its federal financial filings.

This seemed like the perfect task for ChatGPT: I give it a spreadsheet of 600 universities and a simple task (“find the EIN for each university”), and let it get to work. Immediately it ran into problems. Some EINs were correct, but others were “understandably” inaccurate (same city/state, completely different entity), while others were wildly incorrect (a completely made up EIN). It also refused to do more than about 40 universities at a time, and would try to “deceive” me, showing me 10 EINs it processed, asking if I wanted to see the rest, then failing to provide it. Or, it would let me download a file, but only 20 lines would be filled out. I tried different tricks to get it to get me all the information at once, but in the end, I had to manually feed it all 600 universities in chunks of 40 to get it to complete the task. (LLMs often have a “token” budget, only allowing it to process so many words at once; my theory was that the LLM had run out of budget for the larger queries.) I did finally get my list of 600 EINs – and only later discovered that most of them were wrong.

Working with ChatGPT felt like working with an insolent intern more than superintelligent being. Ultimately, I had zero trust in the work being accurate and ended up going through the list of universities one-by-one by myself to ensure 100% accuracy. (So, like an intern, it proved to be less time-efficient than doing it myself!)

The race to bridge the accuracy/reliability gap

I’m sure anyone who’s used ChatGPT or its competitors have had similar experiences – LLMs are mostly correct, sometimes wrong but in a cute way, and sometimes extraordinarily wrong. It can save time on steps where a human verifies the outputs (e.g. writing emails or poems), but can they be trusted to act independently for “business-critical” needs? Can the probabilistic nature of LLMs be coerced into solving deterministic problems well? Can they reliably and accurately synthesize answers to questions?[3]

This is what I see as the next frontier in LLMs playing out today: making LLMs predictable and reliable.

The first wave was prompt engineering. Perhaps, the thinking went, the problem is not the LLM but us, the users. If only we asked the right questions in the right ways, we could convince the LLM to give us the right answers. Thus emerged the field of “prompt engineering,” with the idea that well-written prompts can elicit better answers. There is a truth to this – more detailed, context-laden questions get better answers – but it is not a panacea to the reliability crisis. Great prompt engineering alone cannot make LLMs perfect, but it’s a great (and easy to deploy) tool to have in the toolkit.

Another idea emerged, targeting the data that LLMs were trained on. LLM models (like the original GPT-3 model) are trained on reams of data trawled from the entire internet, but what if the internet didn’t have enough contextual information for our use case? For example, if we want a healthcare-focused LLM, perhaps it didn’t have enough exposure to realistic healthcare data. LLM developers could take a base model (e.g. GPT-3) and fine-tune it with domain-specific data, increasing accuracy. It takes a lot of work – the data must be labeled and accurate, and it takes time and compute to train the model – but it could result in a stronger underlying LLM. This approach seems to work well for domain-specific contexts, such as healthcare (Nuance/Microsoft, HippocraticAI), finance (BloombergGPT), and legal (Harvey).

Re-training a model, though, can be costly, and while fine-tuned models might be more accurate, it was hard to guarantee a certain threshold of reliability. The line of thinking shifted: if we can’t get the LLM to reliably give us a correct answer, perhaps it could just point us to trusted documentation. We could provide the LLM with a source of truth – say, a folder of proprietary documents – and the LLM’s would respond with referenced evidence. This approach – retrieval-augmented generation (RAG) – required way less (expensive) training, while also grounding the model in some truths.

The RAG approach seems to be a useful way to close the reliability gap. Implementation-wise, it’s lightweight and easy to update (although there is plenty of complexity in implementing it well). Usability-wise, it’s simple and explainable and replicates how we humans have been taught to think. It still has its shortcomings, such as fetching data stored in tables reliably and providing completely hallucinate-free responses, but RAG seems like a simple and useful architecture that will stand the test of time.

I’ve also seen more technical areas of computer science research looking to dissect the black box powering the LLM. One area of research I found fascinating was explainable AI, a pursuit to understand the weights and algorithms behind LLMs. For example, are there certain words that strongly influence outputs? How consistent are an LLMs responses, subject to slight perturbations in the input? And can you force the LLM to explain its reasoning through “chain of thought” (or understand how they reason by looking at the internal weights)? This research feels largely theoretical and impractical today, but provides direction on what the future of LLMs might look like.

This brings us to today’s AI agents (a term that is a brilliant piece of marketing). At its core, the key insight was that LLMs could perform better on complex tasks if helped break the problem into simpler components. For example, asking an LLM for a quantitative analysis of a stock might be too complex and open-ended for an LLM to do well, whereas asking it to (1) write SQL code to retrieve certain financial data then (2) analyze it has a much higher probability of success. This approach also allows us to better understand what steps the LLM went wrong – i.e. better explainability, higher reliability. I think this approach also has staying power – as long as the core components behind each step are reliable. As with all new tech, reliability is paramount.

The compounding effect of technology

My latest hands-on exploration of LLMs have made me see LLMs for what they are: a powerful tool that shine at some things but fall short of the reliability threshold needed for many real-world applications. There is simply too much noise, unpredictability, and “hallucinations” to be 100% reliable where 100% reliability is needed. This reliability gap will narrow, but will it ever fully close? I don’t think so. To me, it comes back to the probabilistic foundation that LLMs were built on. They were not built to be 100% reliable, and trying to cajole it into being something it’s not is a fool’s errand.

Instead, for me, the latest technology shines a light on older, weather-tested technologies. In the finance and investing world, this means structured databases with automated data pipelines filled with data you can 100% trust. An investing office that has a strong, well-structured database can then benefit from building LLM applications on top of it to query it. Without it, LLMs can quickly become unmoored.

Through this lens, building a technology stack mirrors the principles of long-term value investing. The core portfolio – or the core technology infrastructure – should be high-conviction, something we’re willing to hold onto for the next decade. Newer technologies (like LLMs) are the venture-like investments in the technology stack; they have the potential to grossly outperform, but can also fizzle out quickly. Nevertheless, the core technology stack must be built on a strong foundation – a strong internal database, strong data pipeline capabilities. Only with this can the newer tools (like LLMs) truly shine.

[1] This also raises existential questions about what it means to be human, in general. We’ve all met enough people who merely appear to reason, so are “dumb” LLMs really any different than humans?

[2] There’s a whole separate strain of research around the risks of AI superintelligence; see Superintelligence: Paths, Dangers, Strategies by Nick Bostrom (2014) for one foundational apocalyptic take on AI.

[3] In this section, I focus on the “answer retrieval” use case for LLMs. Another excellent use case that I ignore here is summarizing text. Some of the use cases that LLMs were originally trained on are summarizing large bodies of text. (Essentially, the LLM is “translating” a larger block of text into a smaller block of text.) One example I remember is early LLMs summarizing an entire book by first summarizing each chapter, then summarizing all the chapter summaries – in a way that matched human-generated summaries. All this to say: summarizing is an area I think LLMs excel at (and in most use cases, summaries can have slight imperfections).

Sunday, January 5, 2025

Raison D’Etre

In my prior life at Epic Systems, now a healthcare software goliath, I spent the majority of my time hyper-focused on how to keep customers happy on our outpatient pharmacy product. My core mandate was long-term customer support, but this meant different things on different days: walking customers through an upcoming upgrade, diving deep into the code to pinpoint the root cause of a patient safety issue, developing complex code for a future release, and critiquing developers’ upcoming designs could comprise a single day. In retrospect, what made the role so unique was that 95% of time and energy was spent on the customer; rarely did I think about sales, revenue, hitting KPIs, industry trends, or third-party relationships. The only focus was customer happiness, by whatever means necessary.

The past two years have reaffirmed my decision to leave Epic after 8 great years. Even though few days were boring and I really felt the patient-focused mission, I had a nagging feeling that growth at the company was an inward growth, that “new opportunities” at the company would build my internal expertise, doubling down on my Epic specialization. I’d felt that if I had stayed another five years, my work would be largely the same. Ten years later, would I regret not trying something new and challenging?

Seldom is the career risk not worth taking; I’ve enjoyed the opportunity to explore the endless number of careers paths. While still in Wisconsin, I finished up a Master’s of Computer Science just as ChatGPT was coming out, an exciting time to be studying machine learning. I’d never felt at home as a full-time software developer, though, so when I came to business school, I endeavored to continue to explore the standard paths (i.e. management consulting, investment banking, tech product management) to the less-trodden paths (e.g. general management, investment management, PE/VC, entrepreneurship through acquisition).

In hindsight, the way I gauged various career paths is: did the skills, expertise, and values of tenured employees match what I wanted? What future version of myself would I be willing to work hard for right now? What stuck out most was a career in investments, which at its best, seems like a mix of art and science, intellect and relationships, statistics and history. It’s an industry where every piece of news and history seems to be relevant.

I’ve spent the past couple years learning the foundations of investing; two years ago, I hadn’t heard of the “time value of money” or the concept of private equity. My goal for this blog is to take the leap from passive learning to actively synthesizing information about the wider investing universe. I’m certain that I’ll get some things wrong, but I’m hopeful that this will serve as (a) a form of accountability to myself and (b) a ledger-like record of my current thoughts. I hope it will help me straighten out the vast landscape of financial markets and, at some point, have a clear and interesting voice in the discussion.

A few things I’m hoping to dig into:

Technical (i.e. coding) analysis of markets. At Epic, I prided myself on being able to write code quickly to prove or disprove an idea; translating idea to code felt as easy as “breathing out” code. Unfortunately, Epic used a bespoke language (MUMPS) with a proprietary library, and so I’ve felt the pain of having to learn a new language (Python). The best solution: repetition, practice, and personal projects. In a past life, I built my reputation on understanding technical details better than anyone (and being able to translate into customer-consumable content). Being able to effortlessly produce quantitative evidence will be invaluable to my long-term edge as an investor.

Some projects here include:

Collecting market data. At Epic, all of the data was centrally stored and easy to access – not the case in the real world! I’ve found I need to learn not only where to find financial data but which to trust, as well as how to access it via code.
Testing market ideas. My hope is to be able to quickly and easily test market ideas, be it company vs. macroeconomic correlations, backtesting investment ideas, or screening public companies.
Running cross-sectional regressions. After taking a course on quantitative investing, I’ve learned that the gold-standard in (academic) factor testing is cross-sectional regression. I’d love to be able to quickly test market factor ideas and recreate interesting paper ideas (such as Verdad Cap’s recent poor man’s pod shop replication idea).

Investment ideas, company valuations, and industry trends. I also hope to practice (a) thesis building and (b) fundamental valuations of public companies, especially for companies that I might want to invest in. My hope: have a paper trail of my own investment ideas and theses, so that I can see where I went right/wrong later.

One thing I’ve been interested in lately is how/if investments in speculative industries can successfully be a part of an investment strategy. It’s not a novel strategy in the least; I’m reading Devil Takes the Hindmost and Bill Janeway’s Doing Capitalism in the Innovation Economy, which both touch on the historic hype/bust cycles of bubble markets (referred to as the “innovation economy” by Bill Janeway). But there have been a slew of public stocks that have “rocket-shipped” in the past couple months, including bitcoin and quantum. Are these performing well because of social media buzz (e.g. Reddit, Robinhood), market catalysts (e.g. Trump’s election), fundamental/structural changes in the internet economy, or something else? Could you construct a VC-like public equity portfolio (i.e. one or two big winners in 10 stocks) to capture these ultra-short-term momentum stocks? How would you hedge away risk (maybe hedge against an unexpected market dip with the VIX)?

Random finance news, history, and questions (especially from an asset allocator perspective). It’s fascinating how finance, like most industries, has been cobbled together into its current state through a series of scandals, accidents, and course corrections.

One topic I’m particularly interested now is index investing. Modern wisdom for personal investors is to index everything (see: Buffett), but most large allocators have a large active portfolio. At what point (or under what conditions) does active management add value?
If index investing is increasing, how does that impact price discovery? Does an increase in index investing buffer against price fluctuations or act as a catalyst for more volatility?
There is a laundry list of other similar other finance and finance-adjacent topics, including: Ireland and its tax impact, quantitative understanding of risk, which types of companies go public (and what macro factors drives big IPO years).