PROBLEM SUMMARY [1]
Companies are awash in information – both public and private
– but have a hard time interacting with it all effectively. RAG has been proposed
as a way to organize and chat with private data, but there’s a few key
challenges:
Problem #1: User interface and integration
Most RAG systems’ user interface consists of a chat box, which “feels” natural and infinite for the end user. However, I think this only captures one use case, the “inquisitive” mode. It’s less effective with a more structured approach (e.g. you have a “research framework”), or in cases where you’re not sure what to ask. Sometimes you need a 45-minute lecture to jostle the questions out of your brain.
A bit of a tangent: but I think this is something start-ups today are not doing well. Start-ups are focused on specific pieces of the end-to-end chain (e.g. UnstructuredIO) or spinning up a generalized business model (e.g. RAG-as-a-service).
Problem #2: Ability to prioritize sources
Most knowledge management systems don’t have an easy way to prioritize some voices over others. From an investment research lens: I’d likely prioritize my own firm’s investment memo over another firm’s, I’d value the financials in the 10-K over those from another source, and I’d trust an article from the Financial Times over a clickbait-y Business Insider one. This level of discernment is a critical part of the knowledge aggregation process, but it isn’t an option in most software.
Problem #3: Ingesting data files effectively
I’ve alluded to this in my personal fight with ingesting PDFs, but taking “unstructured” documents (like PDFs that have tables, graphs, images, etc.) and converting them into “structured” text is a challenging task with a long tail of corner cases. Reddit is replete with start-ups trying to tackle this problem, without one clear winner. A couple common scenarios that are challenging: tables that don’t have borders (very common! [2]) and graphs.
Problem #4: Privacy and data sovereignty
Unsurprisingly, many companies have private data that they want to keep private – but start-up vendors want some sort of ownership of the data. For example, LLM vendors (like ChatGPT) have public APIs with promises not to use your private data (which is hard to believe coming from companies that (a) are running out of publicly available data and (b) have a profit motive to use your data). Most RAG vendors want you to house your documents on their servers (which can be architected to be virtually private). And industry-specific vendors (such as in investment research) have a vested interest in looking at your data, if only to aggregate the results later.
This problem seems identical to the one faced in healthcare software over who owns the patients’ records. Epic, one of the largest healthcare software companies, has stood firm that hospitals own the patient data, not Epic. It means one fewer line on the income statement (bad for short-term profit), but builds trust with the hospitals we’ve worked with (good for long-term revenue). Most healthcare start-ups today look for ways to monetize the data, a tendency I could see playing out in the LLM/RAG space, too.
Problem #5: Cost and vendor lock-in
My cynical take on the start-up software world is that it (a) finds product-market fit by solving a need, (b) makes itself “sticky” in some way so that (c) it can jack up the prices later. As a software vendor, “stickiness” is a key feature that can later justify price increases, but from a customer’s perspective, this looks more like “lock-in” that holds us hostage to a vendor that we might come to despise.
(Again, these issues parallel the healthcare software world.
I know clinics that are locked into multi-year contracts with a health record
system that they hate, and I work with hospital systems that spend hundreds of
thousands of dollars – or more! – to switch from a legacy system to Epic.)
SOLUTION
Version 1.0
My plan is to create an online research web app that begins
to address Problem #1 and #2. Initial features:
- User interface
- Investment framework: when you search for a company, it will use the available data to pre-populate an investment framework
- Chat-to-learn-more: you’ll have the ability to ask deeper questions of the key sources (powered by AI)
- This combination feels like it will be the most usable long-term – almost like a “lecture with Q&A”
- Filtering: data sources will be tagged by company, source, etc.
- Prioritization: data sources will be “marked” with a prioritization level
This version will be built to be vendor-agnostic – i.e. no lock-in to any LLM (Problem #5). (LLMs are a commodity after all, aren’t they?) I’ll also only include a couple of companies, to reduce my own personal cost – each encoded document and each query costs fractions of a penny, which quickly add up. I plan to add a public company, private company, and investment fund. One of the biggest challenges here will be having the RAG system work well – which will likely be imperfect!
Configuration options
From what I’ve learned from the development design arc at
Epic: less is more when it comes to configuration settings. The imperatives:
- Input for company you’re looking at
- Ability to ask follow-up questions
- Maybe settings (or maybe these can just automagically work): ability to select data sources
Version 2.0
Version 2.0 will extend this:
- Increase RAG accuracy
- Add more companies
- Add ways to retrieve data from and process outside sources
- Add user ability to prioritize sources (or maybe delegate this task to an LLM?)
- Test running this with a local LLM (Problem #4)
As we take in more data sources, the data ingestion problem
(Problem #3) will become more of an issue. (In Version 1.0, I’ll manually clean
the sources.) This problem will likely be delegated to a third-party vendor.
There’s tons of them that are effective (I’ve been impressed by Morphik in a
quick test) and cheap (fractions of a penny per page).
The ultimate goal
Ultimately, I’m hoping to build a prototype that is:
- Usable for the long-term – i.e. integrates well with existing workflows),
- Accurate,
- Explainable – i.e. as few black boxes as possible,
- Capable of privacy, and
- Relatively inexpensive.
I foresee larger companies being able (and having a vested
interest in!) building this structure in-house – why trust outsiders with your own
information? I hope to iron out some of the technological wrinkles and go into
depth on them in the next few posts.
[1] Developments at Epic require a development design, which roughly follows this format (with a few more questions, and a few more technical details). I've spared you from too many details, but found this format useful for thinking about the whole point of a development: the problem and the solution.
[2] Pdfplumber
is a commonly cited Python library that can handle table… but it has a really hard
time detecting tables that don’t have lines. See: the cited
Masters thesis for why this is so tough.
No comments:
Post a Comment