Mike Yanagisawa: Can Computers Read PDFs?: Overview

LLM recap

VCs readily acknowledge that we’re in a AI bubble [1], which naturally evoke questions about where LLMs do well – and where they will continue to struggle. I wrote about my perspective, and my main takeaways:

LLMs are a powerful but overutilized tool. Their responses feel like magic, leading to their meteoric rise. As a result, people try to use LLMs for nearly everything.
However, LLMs still struggle with accuracy and reliability. My (hyperbolic) take: these are features, not bugs. The non-deterministic nature of LLMs that make them so good at conversation make them bad at being 100% accurate.
We will discover that we still need older but reliable technologies. LLMs will need to co-exist with – not supplant – strong data pipelines, traditional relational databases, etc.

I don’t think these statements are terribly controversial, but it is worth saying explicitly because we’ve grown up with the idea that a new technology must wipe out the old. Our collective imagination has become enamored with sexy stories of “disruptive technologies” – if the iPhone is to succeed, the Blackberry must fail; if Netflix is to succeed, cable television must disappear; if Uber is to succeed, taxis will become an artifact. We overimagine the dominance of the new technology, leading to a bubble that eventually deflates when reality sets in.

This instinct – that new technology completely replaces the old – is misguided. What cemented this idea for me was learning more about GPUs (from Chris Miller’s Chip War) and the nascent quantum computing industry. Chip War gives you the general impression that GPUs are all that matter moving forward. In reality, GPUs handle specialized work (e.g. high-end computer graphics, LLM inference); computers will still rely on CPUs for the bulk of its processing. Likewise, quantum experts do not believe that quantum computing will replace CPUs; instead, they will handle only the hardest problems that CPUs and GPUs cannot solve today [2]. A future state-of-the-art computer will feature a central CPU, further enhanced by a GPU and QPU (quantum processing unit). In other words, newer technology will still rely on – and build on – the strengths of older technologies.

The core tension with LLMs: 100% reliability at scale

This was a long way to get to the core tension with LLMs (and more broadly, “AI agents”) today:

LLM are great at getting information from unstructured data sources (e.g. PDFs, Excel files, Word documents),
However, LLMs are hard to trust at scale, and
There is no easy fix to making LLMs more trustworthy.

Kunle Omojola is tackling this same problem from the healthcare billing world, and summarizes it well:

“The more I see and the more I dig in, the clearer it is that the deterministic approach utilizing models for edge cases and self healing/auto healing beats the pure nondeterministic at enterprise scale … if there’s a model free alternative that is completely deterministic and the customer/end user expects a deterministic result, enterprise customers would choose that 10 times out of 10.”

Through this lens, the LLM/non-LLM discussion is nothing more than the Pareto principle: LLMs do 80% of a task well, but in some cases, the last 20% (or last 5%) is what provides true value. And, as Kunle states, a standalone LLM may not be the best solution if the last 20% that matters most.

I’m approaching the problem from a slightly different lens, tackling the opportunities that investment offices face today. Some use cases I’ve heard about: (a) analyzing start-up or investment fund pitch decks and doing an initial pass of due diligence, (b) automatically extracting financials (e.g. investment returns) from quarterly reports from start-ups/funds, and (c) searching through past start-up/fund documentation for patterns or changes. These are all flavors of the same problem: given a PDF, can the LLM extract insights? It feels like a foundation question in finance (or finance operations) today: we are surrounded by an excess of high-quality data but an inability to use most of it well.

Tackling the reliability gap

The patient and virtue required to solve challenging problems with sustainable solutions is one of the things I've taken away from my eight years at Epic Systems. Over my tenure, I spent over 80 days onsite doing "floor support" (at-the-elbow training help for a hospital just starting to use our software) and over 110 "command center" days (centralized support making broad system changes), a hands-on knowledge that I would later learn is rare in the tech world. I'd always keep a list of pain points users faced, which would generally fall into a few buckets:

Training issues. The system is "working as designed," but either (a) the user wasn't trained well or (b) the system's layout is unintuitive (or a combination of both).
Quick-wins. A software as large and complex as Epic always has a few simple developments that could save people a few clicks or make information available exactly when it's needed. Tens of papercuts can ruin the first impressions of an otherwise great core product, so I'd bring these back to our development team (and sometimes just tackle a few of them myself).
Customer requests that don't address the core need. One underlying philosophy at Epic is that the customer is not always right: they're asking for X, but that's really addressing Y, and so they really need Z. One example: pharmacies would ask for a couple of reporting columns to hack together a jury-rigged med sync program, and to their disappointment, we would tell them to hold out for the full-blown med sync development project. It was always hard for end-users (i.e. pharmacists) to see the best solution, because they didn't have a strong sense of technical underpinnings.
Simple requests that are extremely challenging to do well. The last bucket was projects that seem simple on the surface but were actually extremely difficult to develop well (i.e. make it function well and architected in a way that could be built upon easily). One example that fit this bucket was the ability to schedule future fill (e.g. patient calls and wants their prescription filled next week). It would've been easy to throw together a quick stop-gap solution, but we put our strongest developer on a hundreds-of-hour project to come up with a complete solution. It meant delaying the feature by a few releases (and undoubtedly frustrated a few customers), but I think this intentionality meant the code would have a better chance of standing the test of time.

It's through lenses 3 and 4 that I tackle the PDF/LLM problem. Lens 3: it's relatively easy to see that "we have piles of data that we're not fully using," but it's much harder for an investing professional to see the steps you might need to take to address it. Lens 4: taking a slower, more deliberate technical approach will be much more painful now but will create a strong technical foundation (i.e. the completely antithesis of "vibe coding").

Building blocks for LLM-driven PDF analysis

With that, I'd break the generalized problem (PDF-driven analysis) into a few discrete steps:

Extract text/tables from unstructured data,
Store/retrieve the data, then
Analyze the unstructured data.

Naïve LLM implementations today try to use LLMs to tackle all three steps in one fell swoop. I’ve tried this before with middling success: give ChatGPT a question and ask it basic questions. It doesn’t do a great job. There’s a couple issues: (a) it’s asking too much of the LLMs to do all three steps at once, and (b) even if it could, it would be hard to scale it reliably across a large organization. The result: when some organizations test out LLMs, they find that LLMs add little value.

As I’ve not-so-subtly alluded to, the process needs to be broken into its core components and tackled individually. I’ll break this down into a few separate posts to get into the details of each step, but I’ll spoil the big takeaway: LLMs in their current iteration are not well-suited for all the tasks above. Indeed, the best version of the future will layer LLMs on top of deterministic algorithms and well-built databases – exactly what we’ll try to build up to in the next few write-ups.

[1] Gartner – of the famous Gartner Hype Cycle – put “prompt engineering,” “AI engineering,” and “foundation models” near the “peak of inflated expectations” last November 2024.

[2] Before learning more about quantum computing, I had originally thought that quantum computers would completely replace modern computers.

Mike Yanagisawa

Thursday, June 26, 2025

Can Computers Read PDFs?: Overview

No comments:

Post a Comment

Dev Design: Investment Research "Agent"

Search This Blog