High-quality data as a differentiator
I’ve spent a good chunk of the last month working with data – data sources, data pipelines, data warehouses – and as a result, I’ve been forced to think about why I’ve been so drawn to it. I think it comes down to a thorough understanding of the ingredients. An analogy: a good home chef buys ingredients from the grocery store, but a great Michelin-star chef sources their ingredients directly from the farmers. This is what I’m after: sourcing the data, trusting its provenance, and cleaning it in the most effective way. It’s time-consuming and detail-oriented, but the general motivating idea is that the highest-quality meals can only be made with the highest-quality ingredients [1].
Pseudo-public data
I took a course this past spring whose theme was “harnessing data for the public good,” that (a) we are surrounded by data (much of it public!) but (b) struggle to draw good insights from it. I think of it as “pseudo-public” data, data that is technically public but used effectively by few. It’s an extremely attractive idea: all the information you need to solve the puzzle is at your disposal, and you just need to put the pieces together.
SEC filings as pseudo-public data
This brings me to my day-and-a-half side quest to pull SEC
filing data, inspired by Joel Cohen's recent
post on X asking for the best way to (a) download SEC filings and call
transcripts and (b) analyze them. There were a few flavors of responses:
- Start-up
finance-focused AI companies: including Alphasense (Series F, finance
data collection and analysis), Quartr (Series seed/A, finance research
platform), ChatsheetAI (generalized AI-infused spreadsheet, extending into
finance), Aiera (Series B, tracks investor events like earnings calls),
finchat.io (now Fiscal.ai, a Series A investment research platform),
fintool.com (Series seed, finance research platform), quasarmarkets.com
(Angel-backed financial intelligence platform), Portrait (Accelerator,
AI-powered investment research focused on screening ideas), askedgar.io
(very early stage, AI-driven research of SEC filings)
- Incumbent
LLMs: Perplexity, Gemini Deep Research
- Incumbent
finance aggregators: Bloomberg, Factset
- DIY: edgartools, a script written by Ben Brostoff
From an outsider’s perspective: the filings are publicly available (and with an API, too), but yet people are still willing to buy it from a re-packager. Joel’s difficulty in getting this public information easily, though, reminded me of this pseudo-public data paradox, so I thought it was a worthwhile endeavor to add SEC filings to my data catalog.
App to pull high-quality SEC filing data to Markdown
Some technical notes:
- SEC filings are accessible via an EDGAR API
- The filings are available as HTML files, XBRL files, and a hybrid form (iXBRL)
- HTML files gives all the information filed, but HTML is very verbose. (Every piece of data is accompanied with its formatting, so you might see style=“font-size:10pt;font-weight:400;top-margin:10px;left-margin:10px” repeated many times. Some of the tables also have extra rows and columns to help with visual spacing.)
- XBRL files contain table-like data
My stance is that SEC filing data can be converted to be “higher
quality” with a few steps:
- It’s easy to pull the HTML files from the EDGAR API, but it’s a little harder to know that the data should be (a) cleaned and (b) converted to Markdown.
- (Note: Markdown is a format that is “extremely close to plain text, with minimal markup or formatting” and a format that “mainstream LLMs, such as OpenAI’s GPT-4o, natively ‘speak’” [2].)
- I take a few extra steps – including tidying up the HTML tables to remove spacing columns [3] and contextualizing some of the text styles [4] – to ensure the output Markdown better reflects the original filing.
The resulting Markdown files are cleaner (i.e. contain minimal markup/formatting), smaller (important because LLMs charge by word), and most importantly more usable for downstream systems. My hypothesis is that it’s a higher-quality form of SEC filing data.
Gatekeeping this public data, though, feels a bit selfish. I
built an app
to allow anyone to pull this data easily (and for free). It allows you to
download recent SEC filings for any company they list.
Extracting insights from SEC filings (an ongoing pursuit)
These Markdown SEC filings are small enough to be loaded
into Google NotebookLM. (NotebookLM struggles with the full 10-K and 10-Q PDFs
and HTML file.) I like NotebookLM for its ability to let you upload files and source
answers from them, and based on some quick tests, it seems to do a good job:
It also seems to be able to extract table data well from
these Markdown files:
This “data analysis” piece is an ongoing project – it will be interesting to explore if the analysis can be differentiated, or if all solutions are small variations of big LLM players (ChatGPT, Gemini, etc.). If it's the former, look forward to more posts on it! If it’s the latter, then high-quality data (the ingredients that fuel the LLMs) may be a credible differentiator – and the focus may turn once again to sourcing the best data.
[1] Digging into the data is also a worthwhile exercise to see which data sources are a commodity (and thus
can be purchased off the shelf) and which can be a source of differentiation.
[3] The
HTML tables in SEC filings contain “spacing columns” and “spacing rows,” which
are strictly used to make the output look prettier to a human. Computers can be
confused by these decorations.
[4] A
little technical, but the HTML in SEC filings look something like: <span
style=“font-size:24pt;font-weight:700;top-margin:10px;left-margin:10px”>text</span>.
Many of these styles are not useful to a computer, but things like the font size
and font weight (i.e. bolding) gives us contextual clues where the headers,
etc. are.
No comments:
Post a Comment