A brief, incomplete history of LLMs
While machine learning is not new to the 2020s, the modern era of LLMs kicked off with the release of ChatGPT in November 2022. I learned about ChatGPT from my professor in an Advanced Computer Security and Privacy class; he was floored by what it could accomplish. The first use cases were novelty trinkets that were built to go viral on social media; I remember a poem about a toaster, then more from OpenAI and then the New York Times. Yet LLMs and generative AI still felt like they were in their exploratory phase, an easy-to-use technology looking for its footing in the real world.
ChatGPT was just the catalyst for the new LLM bubble, though. ChatGPT was built on the landmark paper, “Attention Is All You Need,” published in 2017 by Google researchers, a full five years before OpenAI released their app. Recounting the history raises more questions than answers – e.g. how did Google squander their lead in AI? Why did they not release an LLM before OpenAI?
Instead, I’d like to focus on the bigger picture: why did ChatGPT become so popular, what problems did it excel at, and where might it struggle? As builders begin to integrate LLMs, where’s the hype and where’s the real value?
ChatGPT: realized use cases
At a fundamental level, all generative models like ChatGPT try to do is predict the next word in a sequence. (For example: predict what word comes next: “The cat sat on the …”) ChatGPT’s transformer architecture differed from prior models in two significant ways: (1) it was able to process more words at once and decide which words to pay more “attention” to, and (2) training the models – which was previously a bottleneck – could be sped up significantly by GPUs, allowing for more efficient training on larger datasets. In short, ChatGPT had been trained on vastly more data with a better algorithm, leading to its efficacy and near-immediate popularity on release.
To me, what made ChatGPT (and subsequent LLMs) so powerful:
- Ease of use. You interact with it using normal English, with an interface no more complicated than Google’s homepage.
- Broad (and novel) use cases. It could accomplish tasks previously hard for a computer, such as drafting good emails, composing poems, writing prose in the style of a famous author, and developing well-written code.
- Mostly accurate. It did pretty well on most things, but occasionally would say something overtly wrong, or would forget that it knew how to write a sonnet.
From these origins extended an unbridled optimism about LLMs. After all, if the LLMs could reason (or at least appear to reason [1]), at what point could AI build stronger AI than human computer scientists could, leading to superintelligence [2]? More straightforwardly: which jobs could a good LLM replace? Certainly, call center reps and chatbots, but could poets, journalists, writers, lawyers, consultants, and doctors be next?
Where LLMs struggle
It’s been 2.5 years since ChatGPT was released, and enthusiasm for LLMs has not waned. Instead, the environment has only become frothier – authentic innovators (people who understand a problem and solve it using the best technology for the job) are mixed in with “follow-on” entrepreneurs (people who prioritize adding “LLM” or “AI Agent” into their start-up pitch over solving a real problem).
Nevertheless, we’ve seen LLM adoption stall – there seems to be some invisible barrier to everyday adoption in the workplaces. In a simplified tech adoption world, there’s only a few potential root causes: (a) the product is bad, (b) the product is good but doesn’t meet a customer need, or (c) the customer is poorly trained. Which is it?
To try to answer this, it’s important to see where and why LLMs
still struggle:
- Accuracy. Models are sometimes slightly wrong, other times very wrong. For everyday use cases (e.g. planning a road trip, looking up a recipe), this isn’t a big deal, but for work cases, you can’t have a computer program that makes mistakes like a summer intern.
- Reliability. Tied in with “accuracy,” most work use cases require something you can absolutely trust to work 99.9% of the time.
- False confidence. Whether the model is wrong or right, it may project the same level of confidence.
- Black box algorithm. The LLM’s algorithm is a black box: you feed it input, it gives you an output, but it’s hard to troubleshoot if something goes wrong.
- Data privacy concerns. Sending sensitive data (like confidential documents) to model providers like OpenAI may be against contractual agreements (especially if OpenAI decides to use the data later for model training).
To me, the answer is (b) – the product is fine, but the customer expects it to do more (be more accurate, be more reliable) than it currently is. These issues stem from the very nature of LLMs and machine learning in general. Traditional algorithms are deterministic: for any input, it returns a single output. It’s predictable and reliable by design. Machine learning, however, is probabilistic: a single input can result in a range of outputs, some more probable than others. It’s what allows ChatGPT to create (what I consider to be) truly creative poems – but it also means it will (and does!) struggle with problems that have a single correct solution. The probabilistic nature of LLMs not only harm accuracy/reliability, but make it almost impossible to know where it went wrong when it inevitably does.
A side story on where/how models can go wrong
Can LLMs replace grunt work? One task I wanted help on was finding the Employer Identification Number (EIN) for a list of private universities. The reason: you can look up a non-profit’s 990 filings with the EIN, so it serves as a critical bridge from a university to its federal financial filings.
This seemed like the perfect task for ChatGPT: I give it a spreadsheet of 600 universities and a simple task (“find the EIN for each university”), and let it get to work. Immediately it ran into problems. Some EINs were correct, but others were “understandably” inaccurate (same city/state, completely different entity), while others were wildly incorrect (a completely made up EIN). It also refused to do more than about 40 universities at a time, and would try to “deceive” me, showing me 10 EINs it processed, asking if I wanted to see the rest, then failing to provide it. Or, it would let me download a file, but only 20 lines would be filled out. I tried different tricks to get it to get me all the information at once, but in the end, I had to manually feed it all 600 universities in chunks of 40 to get it to complete the task. (LLMs often have a “token” budget, only allowing it to process so many words at once; my theory was that the LLM had run out of budget for the larger queries.) I did finally get my list of 600 EINs – and only later discovered that most of them were wrong.
Working with ChatGPT felt like working with an insolent intern more than superintelligent being. Ultimately, I had zero trust in the work being accurate and ended up going through the list of universities one-by-one by myself to ensure 100% accuracy. (So, like an intern, it proved to be less time-efficient than doing it myself!)
The race to bridge the accuracy/reliability gap
I’m sure anyone who’s used ChatGPT or its competitors have had similar experiences – LLMs are mostly correct, sometimes wrong but in a cute way, and sometimes extraordinarily wrong. It can save time on steps where a human verifies the outputs (e.g. writing emails or poems), but can they be trusted to act independently for “business-critical” needs? Can the probabilistic nature of LLMs be coerced into solving deterministic problems well? Can they reliably and accurately synthesize answers to questions?[3]
This is what I see as the next frontier in LLMs playing out today: making LLMs predictable and reliable.
The first wave was prompt engineering. Perhaps, the thinking went, the problem is not the LLM but us, the users. If only we asked the right questions in the right ways, we could convince the LLM to give us the right answers. Thus emerged the field of “prompt engineering,” with the idea that well-written prompts can elicit better answers. There is a truth to this – more detailed, context-laden questions get better answers – but it is not a panacea to the reliability crisis. Great prompt engineering alone cannot make LLMs perfect, but it’s a great (and easy to deploy) tool to have in the toolkit.
Another idea emerged, targeting the data that LLMs were trained on. LLM models (like the original GPT-3 model) are trained on reams of data trawled from the entire internet, but what if the internet didn’t have enough contextual information for our use case? For example, if we want a healthcare-focused LLM, perhaps it didn’t have enough exposure to realistic healthcare data. LLM developers could take a base model (e.g. GPT-3) and fine-tune it with domain-specific data, increasing accuracy. It takes a lot of work – the data must be labeled and accurate, and it takes time and compute to train the model – but it could result in a stronger underlying LLM. This approach seems to work well for domain-specific contexts, such as healthcare (Nuance/Microsoft, HippocraticAI), finance (BloombergGPT), and legal (Harvey).
Re-training a model, though, can be costly, and while fine-tuned models might be more accurate, it was hard to guarantee a certain threshold of reliability. The line of thinking shifted: if we can’t get the LLM to reliably give us a correct answer, perhaps it could just point us to trusted documentation. We could provide the LLM with a source of truth – say, a folder of proprietary documents – and the LLM’s would respond with referenced evidence. This approach – retrieval-augmented generation (RAG) – required way less (expensive) training, while also grounding the model in some truths.
The RAG approach seems to be a useful way to close the reliability gap. Implementation-wise, it’s lightweight and easy to update (although there is plenty of complexity in implementing it well). Usability-wise, it’s simple and explainable and replicates how we humans have been taught to think. It still has its shortcomings, such as fetching data stored in tables reliably and providing completely hallucinate-free responses, but RAG seems like a simple and useful architecture that will stand the test of time.
I’ve also seen more technical areas of computer science research looking to dissect the black box powering the LLM. One area of research I found fascinating was explainable AI, a pursuit to understand the weights and algorithms behind LLMs. For example, are there certain words that strongly influence outputs? How consistent are an LLMs responses, subject to slight perturbations in the input? And can you force the LLM to explain its reasoning through “chain of thought” (or understand how they reason by looking at the internal weights)? This research feels largely theoretical and impractical today, but provides direction on what the future of LLMs might look like.
This brings us to today’s AI agents (a term that is a brilliant piece of marketing). At its core, the key insight was that LLMs could perform better on complex tasks if helped break the problem into simpler components. For example, asking an LLM for a quantitative analysis of a stock might be too complex and open-ended for an LLM to do well, whereas asking it to (1) write SQL code to retrieve certain financial data then (2) analyze it has a much higher probability of success. This approach also allows us to better understand what steps the LLM went wrong – i.e. better explainability, higher reliability. I think this approach also has staying power – as long as the core components behind each step are reliable. As with all new tech, reliability is paramount.
The compounding effect of technology
My latest hands-on exploration of LLMs have made me see LLMs for what they are: a powerful tool that shine at some things but fall short of the reliability threshold needed for many real-world applications. There is simply too much noise, unpredictability, and “hallucinations” to be 100% reliable where 100% reliability is needed. This reliability gap will narrow, but will it ever fully close? I don’t think so. To me, it comes back to the probabilistic foundation that LLMs were built on. They were not built to be 100% reliable, and trying to cajole it into being something it’s not is a fool’s errand.
Instead, for me, the latest technology shines a light on older, weather-tested technologies. In the finance and investing world, this means structured databases with automated data pipelines filled with data you can 100% trust. An investing office that has a strong, well-structured database can then benefit from building LLM applications on top of it to query it. Without it, LLMs can quickly become unmoored.
Through this lens, building a technology stack mirrors the principles of long-term value investing. The core portfolio – or the core technology infrastructure – should be high-conviction, something we’re willing to hold onto for the next decade. Newer technologies (like LLMs) are the venture-like investments in the technology stack; they have the potential to grossly outperform, but can also fizzle out quickly. Nevertheless, the core technology stack must be built on a strong foundation – a strong internal database, strong data pipeline capabilities. Only with this can the newer tools (like LLMs) truly shine.
[1]
This also raises existential questions about what it means to be human, in
general. We’ve all met enough people who merely appear to reason, so are
“dumb” LLMs really any different than humans?
[2]
There’s a whole separate strain of research around the risks of AI
superintelligence; see Superintelligence: Paths, Dangers, Strategies by
Nick Bostrom (2014) for one foundational apocalyptic take on AI.
[3] In
this section, I focus on the “answer retrieval” use case for LLMs. Another
excellent use case that I ignore here is summarizing text. Some of the use
cases that LLMs were originally trained on are summarizing large bodies of
text. (Essentially, the LLM is “translating” a larger block of text into a
smaller block of text.) One example
I remember is early LLMs summarizing an entire book by first summarizing each
chapter, then summarizing all the chapter summaries – in a way that matched
human-generated summaries. All this to say: summarizing is an area I think LLMs
excel at (and in most use cases, summaries can have slight imperfections).
No comments:
Post a Comment