Thursday, June 26, 2025

Can Computers Read PDFs?: Overview

LLM recap

VCs readily acknowledge that we’re in a AI bubble [1], which naturally evoke questions about where LLMs do well – and where they will continue to struggle. I wrote about my perspective, and my main takeaways:

  • LLMs are a powerful but overutilized tool. Their responses feel like magic, leading to their meteoric rise. As a result, people try to use LLMs for nearly everything.
  • However, LLMs still struggle with accuracy and reliability. My (hyperbolic) take: these are features, not bugs. The non-deterministic nature of LLMs that make them so good at conversation make them bad at being 100% accurate.
  • We will discover that we still need older but reliable technologies. LLMs will need to co-exist with – not supplant – strong data pipelines, traditional relational databases, etc.

I don’t think these statements are terribly controversial, but it is worth saying explicitly because we’ve grown up with the idea that a new technology must wipe out the old. Our collective imagination has become enamored with sexy stories of “disruptive technologies” – if the iPhone is to succeed, the Blackberry must fail; if Netflix is to succeed, cable television must disappear; if Uber is to succeed, taxis will become an artifact. We overimagine the dominance of the new technology, leading to a bubble that eventually deflates when reality sets in.

This instinct – that new technology completely replaces the old – is misguided. What cemented this idea for me was learning more about GPUs (from Chris Miller’s Chip War) and the nascent quantum computing industry. Chip War gives you the general impression that GPUs are all that matter moving forward. In reality, GPUs handle specialized work (e.g. high-end computer graphics, LLM inference); computers will still rely on CPUs for the bulk of its processing. Likewise, quantum experts do not believe that quantum computing will replace CPUs; instead, they will handle only the hardest problems that CPUs and GPUs cannot solve today [2]. A future state-of-the-art computer will feature a central CPU, further enhanced by a GPU and QPU (quantum processing unit). In other words, newer technology will still rely on – and build on – the strengths of older technologies.

The core tension with LLMs: 100% reliability at scale

This was a long way to get to the core tension with LLMs (and more broadly, “AI agents”) today:

  • LLM are great at getting information from unstructured data sources (e.g. PDFs, Excel files, Word documents),
  • However, LLMs are hard to trust at scale, and
  • There is no easy fix to making LLMs more trustworthy.

Kunle Omojola is tackling this same problem from the healthcare billing world, and summarizes it well:

The more I see and the more I dig in, the clearer it is that the deterministic approach utilizing models for edge cases and self healing/auto healing beats the pure nondeterministic at enterprise scale … if there’s a model free alternative that is completely deterministic and the customer/end user expects a deterministic result, enterprise customers would choose that 10 times out of 10.

Through this lens, the LLM/non-LLM discussion is nothing more than the Pareto principle: LLMs do 80% of a task well, but in some cases, the last 20% (or last 5%) is what provides true value. And, as Kunle states, a standalone LLM may not be the best solution if the last 20% that matters most.

I’m approaching the problem from a slightly different lens, tackling the opportunities that investment offices face today. Some use cases I’ve heard about: (a) analyzing start-up or investment fund pitch decks and doing an initial pass of due diligence, (b) automatically extracting financials (e.g. investment returns) from quarterly reports from start-ups/funds, and (c) searching through past start-up/fund documentation for patterns or changes. These are all flavors of the same problem: given a PDF, can the LLM extract insights? It feels like a foundation question in finance (or finance operations) today: we are surrounded by an excess of high-quality data but an inability to use most of it well. 

Tackling the reliability gap

The patient and virtue required to solve challenging problems with sustainable solutions is one of the things I've taken away from my eight years at Epic Systems. Over my tenure, I spent over 80 days onsite doing "floor support" (at-the-elbow training help for a hospital just starting to use our software) and over 110 "command center" days (centralized support making broad system changes), a hands-on knowledge that I would later learn is rare in the tech world. I'd always keep a list of pain points users faced, which would generally fall into a few buckets:

  1. Training issues. The system is "working as designed," but either (a) the user wasn't trained well or (b) the system's layout is unintuitive (or a combination of both). 
  2. Quick-wins. A software as large and complex as Epic always has a few simple developments that could save people a few clicks or make information available exactly when it's needed. Tens of papercuts can ruin the first impressions of an otherwise great core product, so I'd bring these back to our development team (and sometimes just tackle a few of them myself). 
  3.  Customer requests that don't address the core need. One underlying philosophy at Epic is that the customer is not always right: they're asking for X, but that's really addressing Y, and so they really need Z. One example: pharmacies would ask for a couple of reporting columns to hack together a jury-rigged med sync program, and to their disappointment, we would tell them to hold out for the full-blown med sync development project. It was always hard for end-users (i.e. pharmacists) to see the best solution, because they didn't have a strong sense of technical underpinnings.
  4. Simple requests that are extremely challenging to do well. The last bucket was projects that seem simple on the surface but were actually extremely difficult to develop well (i.e. make it function well and architected in a way that could be built upon easily). One example that fit this bucket was the ability to schedule future fill (e.g. patient calls and wants their prescription filled next week). It would've been easy to throw together a quick stop-gap solution, but we put our strongest developer on a hundreds-of-hour project to come up with a complete solution. It meant delaying the feature by a few releases (and undoubtedly frustrated a few customers), but I think this intentionality meant the code would have a better chance of standing the test of time. 
It's through lenses 3 and 4 that I tackle the PDF/LLM problem. Lens 3: it's relatively easy to see that "we have piles of data that we're not fully using," but it's much harder for an investing professional to see the steps you might need to take to address it. Lens 4: taking a slower, more deliberate technical approach will be much more painful now but will create a strong technical foundation (i.e. the completely antithesis of "vibe coding"). 

Building blocks for LLM-driven PDF analysis

With that, I'd break the generalized problem (PDF-driven analysis) into a few discrete steps:

  1. Extract text/tables from unstructured data,
  2. Store/retrieve the data, then
  3. Analyze the unstructured data.

Naïve LLM implementations today try to use LLMs to tackle all three steps in one fell swoop. I’ve tried this before with middling success: give ChatGPT a question and ask it basic questions. It doesn’t do a great job. There’s a couple issues: (a) it’s asking too much of the LLMs to do all three steps at once, and (b) even if it could, it would be hard to scale it reliably across a large organization. The result: when some organizations test out LLMs, they find that LLMs add little value.

As I’ve not-so-subtly alluded to, the process needs to be broken into its core components and tackled individually. I’ll break this down into a few separate posts to get into the details of each step, but I’ll spoil the big takeaway: LLMs in their current iteration are not well-suited for all the tasks above. Indeed, the best version of the future will layer LLMs on top of deterministic algorithms and well-built databases – exactly what we’ll try to build up to in the next few write-ups.



[1] Gartner – of the famous Gartner Hype Cycle – put “prompt engineering,” “AI engineering,” and “foundation models” near the “peak of inflated expectations” last November 2024.

[2] Before learning more about quantum computing, I had originally thought that quantum computers would completely replace modern computers.

Wednesday, June 18, 2025

LLMs: Technological Savior or the Next Bygone Trend?

A brief, incomplete history of LLMs

While machine learning is not new to the 2020s, the modern era of LLMs kicked off with the release of ChatGPT in November 2022. I learned about ChatGPT from my professor in an Advanced Computer Security and Privacy class; he was floored by what it could accomplish. The first use cases were novelty trinkets that were built to go viral on social media; I remember a poem about a toaster, then more from OpenAI and then the New York Times. Yet LLMs and generative AI still felt like they were in their exploratory phase, an easy-to-use technology looking for its footing in the real world.

ChatGPT was just the catalyst for the new LLM bubble, though. ChatGPT was built on the landmark paper, “Attention Is All You Need,” published in 2017 by Google researchers, a full five years before OpenAI released their app. Recounting the history raises more questions than answers – e.g. how did Google squander their lead in AI? Why did they not release an LLM before OpenAI?

Instead, I’d like to focus on the bigger picture: why did ChatGPT become so popular, what problems did it excel at, and where might it struggle? As builders begin to integrate LLMs, where’s the hype and where’s the real value?

ChatGPT: realized use cases 

At a fundamental level, all generative models like ChatGPT try to do is predict the next word in a sequence. (For example: predict what word comes next: “The cat sat on the …”) ChatGPT’s transformer architecture differed from prior models in two significant ways: (1) it was able to process more words at once and decide which words to pay more “attention” to, and (2) training the models – which was previously a bottleneck – could be sped up significantly by GPUs, allowing for more efficient training on larger datasets. In short, ChatGPT had been trained on vastly more data with a better algorithm, leading to its efficacy and near-immediate popularity on release.

To me, what made ChatGPT (and subsequent LLMs) so powerful:

  1. Ease of use. You interact with it using normal English, with an interface no more complicated than Google’s homepage.
  2. Broad (and novel) use cases. It could accomplish tasks previously hard for a computer, such as drafting good emails, composing poems, writing prose in the style of a famous author, and developing well-written code.
  3. Mostly accurate. It did pretty well on most things, but occasionally would say something overtly wrong, or would forget that it knew how to write a sonnet.

From these origins extended an unbridled optimism about LLMs. After all, if the LLMs could reason (or at least appear to reason [1]), at what point could AI build stronger AI than human computer scientists could, leading to superintelligence [2]? More straightforwardly: which jobs could a good LLM replace? Certainly, call center reps and chatbots, but could poets, journalists, writers, lawyers, consultants, and doctors be next?

Where LLMs struggle

It’s been 2.5 years since ChatGPT was released, and enthusiasm for LLMs has not waned. Instead, the environment has only become frothier – authentic innovators (people who understand a problem and solve it using the best technology for the job) are mixed in with “follow-on” entrepreneurs (people who prioritize adding “LLM” or “AI Agent” into their start-up pitch over solving a real problem).

Nevertheless, we’ve seen LLM adoption stall – there seems to be some invisible barrier to everyday adoption in the workplaces. In a simplified tech adoption world, there’s only a few potential root causes: (a) the product is bad, (b) the product is good but doesn’t meet a customer need, or (c) the customer is poorly trained. Which is it?

To try to answer this, it’s important to see where and why LLMs still struggle:

  1. Accuracy. Models are sometimes slightly wrong, other times very wrong. For everyday use cases (e.g. planning a road trip, looking up a recipe), this isn’t a big deal, but for work cases, you can’t have a computer program that makes mistakes like a summer intern.
  2. Reliability. Tied in with “accuracy,” most work use cases require something you can absolutely trust to work 99.9% of the time.
  3. False confidence. Whether the model is wrong or right, it may project the same level of confidence.
  4. Black box algorithm. The LLM’s algorithm is a black box: you feed it input, it gives you an output, but it’s hard to troubleshoot if something goes wrong.
  5. Data privacy concerns. Sending sensitive data (like confidential documents) to model providers like OpenAI may be against contractual agreements (especially if OpenAI decides to use the data later for model training).

To me, the answer is (b) – the product is fine, but the customer expects it to do more (be more accurate, be more reliable) than it currently is. These issues stem from the very nature of LLMs and machine learning in general. Traditional algorithms are deterministic: for any input, it returns a single output. It’s predictable and reliable by design. Machine learning, however, is probabilistic: a single input can result in a range of outputs, some more probable than others. It’s what allows ChatGPT to create (what I consider to be) truly creative poems – but it also means it will (and does!) struggle with problems that have a single correct solution. The probabilistic nature of LLMs not only harm accuracy/reliability, but make it almost impossible to know where it went wrong when it inevitably does.

A side story on where/how models can go wrong

Can LLMs replace grunt work? One task I wanted help on was finding the Employer Identification Number (EIN) for a list of private universities. The reason: you can look up a non-profit’s 990 filings with the EIN, so it serves as a critical bridge from a university to its federal financial filings.

This seemed like the perfect task for ChatGPT: I give it a spreadsheet of 600 universities and a simple task (“find the EIN for each university”), and let it get to work. Immediately it ran into problems. Some EINs were correct, but others were “understandably” inaccurate (same city/state, completely different entity), while others were wildly incorrect (a completely made up EIN). It also refused to do more than about 40 universities at a time, and would try to “deceive” me, showing me 10 EINs it processed, asking if I wanted to see the rest, then failing to provide it. Or, it would let me download a file, but only 20 lines would be filled out. I tried different tricks to get it to get me all the information at once, but in the end, I had to manually feed it all 600 universities in chunks of 40 to get it to complete the task. (LLMs often have a “token” budget, only allowing it to process so many words at once; my theory was that the LLM had run out of budget for the larger queries.) I did finally get my list of 600 EINs – and only later discovered that most of them were wrong.

Working with ChatGPT felt like working with an insolent intern more than superintelligent being. Ultimately, I had zero trust in the work being accurate and ended up going through the list of universities one-by-one by myself to ensure 100% accuracy. (So, like an intern, it proved to be less time-efficient than doing it myself!)

The race to bridge the accuracy/reliability gap

I’m sure anyone who’s used ChatGPT or its competitors have had similar experiences – LLMs are mostly correct, sometimes wrong but in a cute way, and sometimes extraordinarily wrong. It can save time on steps where a human verifies the outputs (e.g. writing emails or poems), but can they be trusted to act independently for “business-critical” needs? Can the probabilistic nature of LLMs be coerced into solving deterministic problems well? Can they reliably and accurately synthesize answers to questions?[3]

This is what I see as the next frontier in LLMs playing out today: making LLMs predictable and reliable.

The first wave was prompt engineering. Perhaps, the thinking went, the problem is not the LLM but us, the users. If only we asked the right questions in the right ways, we could convince the LLM to give us the right answers. Thus emerged the field of “prompt engineering,” with the idea that well-written prompts can elicit better answers. There is a truth to this – more detailed, context-laden questions get better answers – but it is not a panacea to the reliability crisis. Great prompt engineering alone cannot make LLMs perfect, but it’s a great (and easy to deploy) tool to have in the toolkit.

Another idea emerged, targeting the data that LLMs were trained on. LLM models (like the original GPT-3 model) are trained on reams of data trawled from the entire internet, but what if the internet didn’t have enough contextual information for our use case? For example, if we want a healthcare-focused LLM, perhaps it didn’t have enough exposure to realistic healthcare data. LLM developers could take a base model (e.g. GPT-3) and fine-tune it with domain-specific data, increasing accuracy. It takes a lot of work – the data must be labeled and accurate, and it takes time and compute to train the model – but it could result in a stronger underlying LLM. This approach seems to work well for domain-specific contexts, such as healthcare (Nuance/Microsoft, HippocraticAI), finance (BloombergGPT), and legal (Harvey).

Re-training a model, though, can be costly, and while fine-tuned models might be more accurate, it was hard to guarantee a certain threshold of reliability. The line of thinking shifted: if we can’t get the LLM to reliably give us a correct answer, perhaps it could just point us to trusted documentation. We could provide the LLM with a source of truth – say, a folder of proprietary documents – and the LLM’s would respond with referenced evidence. This approach – retrieval-augmented generation (RAG) required way less (expensive) training, while also grounding the model in some truths.

The RAG approach seems to be a useful way to close the reliability gap. Implementation-wise, it’s lightweight and easy to update (although there is plenty of complexity in implementing it well). Usability-wise, it’s simple and explainable and replicates how we humans have been taught to think. It still has its shortcomings, such as fetching data stored in tables reliably and providing completely hallucinate-free responses, but RAG seems like a simple and useful architecture that will stand the test of time.

I’ve also seen more technical areas of computer science research looking to dissect the black box powering the LLM. One area of research I found fascinating was explainable AI, a pursuit to understand the weights and algorithms behind LLMs. For example, are there certain words that strongly influence outputs? How consistent are an LLMs responses, subject to slight perturbations in the input? And can you force the LLM to explain its reasoning through “chain of thought” (or understand how they reason by looking at the internal weights)? This research feels largely theoretical and impractical today, but provides direction on what the future of LLMs might look like.

This brings us to today’s AI agents (a term that is a brilliant piece of marketing). At its core, the key insight was that LLMs could perform better on complex tasks if helped break the problem into simpler components. For example, asking an LLM for a quantitative analysis of a stock might be too complex and open-ended for an LLM to do well, whereas asking it to (1) write SQL code to retrieve certain financial data then (2) analyze it has a much higher probability of success. This approach also allows us to better understand what steps the LLM went wrong – i.e. better explainability, higher reliability. I think this approach also has staying power – as long as the core components behind each step are reliable. As with all new tech, reliability is paramount.

The compounding effect of technology

My latest hands-on exploration of LLMs have made me see LLMs for what they are: a powerful tool that shine at some things but fall short of the reliability threshold needed for many real-world applications. There is simply too much noise, unpredictability, and “hallucinations” to be 100% reliable where 100% reliability is needed. This reliability gap will narrow, but will it ever fully close? I don’t think so. To me, it comes back to the probabilistic foundation that LLMs were built on. They were not built to be 100% reliable, and trying to cajole it into being something it’s not is a fool’s errand.

Instead, for me, the latest technology shines a light on older, weather-tested technologies. In the finance and investing world, this means structured databases with automated data pipelines filled with data you can 100% trust. An investing office that has a strong, well-structured database can then benefit from building LLM applications on top of it to query it. Without it, LLMs can quickly become unmoored.

Through this lens, building a technology stack mirrors the principles of long-term value investing. The core portfolio – or the core technology infrastructure – should be high-conviction, something we’re willing to hold onto for the next decade. Newer technologies (like LLMs) are the venture-like investments in the technology stack; they have the potential to grossly outperform, but can also fizzle out quickly. Nevertheless, the core technology stack must be built on a strong foundation – a strong internal database, strong data pipeline capabilities. Only with this can the newer tools (like LLMs) truly shine.


[1] This also raises existential questions about what it means to be human, in general. We’ve all met enough people who merely appear to reason, so are “dumb” LLMs really any different than humans?

[2] There’s a whole separate strain of research around the risks of AI superintelligence; see Superintelligence: Paths, Dangers, Strategies by Nick Bostrom (2014) for one foundational apocalyptic take on AI.

[3] In this section, I focus on the “answer retrieval” use case for LLMs. Another excellent use case that I ignore here is summarizing text. Some of the use cases that LLMs were originally trained on are summarizing large bodies of text. (Essentially, the LLM is “translating” a larger block of text into a smaller block of text.) One example I remember is early LLMs summarizing an entire book by first summarizing each chapter, then summarizing all the chapter summaries – in a way that matched human-generated summaries. All this to say: summarizing is an area I think LLMs excel at (and in most use cases, summaries can have slight imperfections).

Dev Design: Investment Research "Agent"

PROBLEM SUMMARY [1] Companies are awash in information – both public and private – but have a hard time interacting with it all effective...