Mike Yanagisawa: An App for High-Quality SEC Filings

High-quality data as a differentiator

I’ve spent a good chunk of the last month working with data – data sources, data pipelines, data warehouses – and as a result, I’ve been forced to think about why I’ve been so drawn to it. I think it comes down to a thorough understanding of the ingredients. An analogy: a good home chef buys ingredients from the grocery store, but a great Michelin-star chef sources their ingredients directly from the farmers. This is what I’m after: sourcing the data, trusting its provenance, and cleaning it in the most effective way. It’s time-consuming and detail-oriented, but the general motivating idea is that the highest-quality meals can only be made with the highest-quality ingredients [1].

Pseudo-public data

I took a course this past spring whose theme was “harnessing data for the public good,” that (a) we are surrounded by data (much of it public!) but (b) struggle to draw good insights from it. I think of it as “pseudo-public” data, data that is technically public but used effectively by few. It’s an extremely attractive idea: all the information you need to solve the puzzle is at your disposal, and you just need to put the pieces together.

SEC filings as pseudo-public data

This brings me to my day-and-a-half side quest to pull SEC filing data, inspired by Joel Cohen's recent post on X asking for the best way to (a) download SEC filings and call transcripts and (b) analyze them. There were a few flavors of responses:

Start-up finance-focused AI companies: including Alphasense (Series F, finance data collection and analysis), Quartr (Series seed/A, finance research platform), ChatsheetAI (generalized AI-infused spreadsheet, extending into finance), Aiera (Series B, tracks investor events like earnings calls), finchat.io (now Fiscal.ai, a Series A investment research platform), fintool.com (Series seed, finance research platform), quasarmarkets.com (Angel-backed financial intelligence platform), Portrait (Accelerator, AI-powered investment research focused on screening ideas), askedgar.io (very early stage, AI-driven research of SEC filings)
Incumbent LLMs: Perplexity, Gemini Deep Research
Incumbent finance aggregators: Bloomberg, Factset
DIY: edgartools, a script written by Ben Brostoff

From an outsider’s perspective: the filings are publicly available (and with an API, too), but yet people are still willing to buy it from a re-packager. Joel’s difficulty in getting this public information easily, though, reminded me of this pseudo-public data paradox, so I thought it was a worthwhile endeavor to add SEC filings to my data catalog.

App to pull high-quality SEC filing data to Markdown

Some technical notes:

SEC filings are accessible via an EDGAR API
The filings are available as HTML files, XBRL files, and a hybrid form (iXBRL)

HTML files gives all the information filed, but HTML is very verbose. (Every piece of data is accompanied with its formatting, so you might see style=“font-size:10pt;font-weight:400;top-margin:10px;left-margin:10px” repeated many times. Some of the tables also have extra rows and columns to help with visual spacing.)
XBRL files contain table-like data

My stance is that SEC filing data can be converted to be “higher quality” with a few steps:

It’s easy to pull the HTML files from the EDGAR API, but it’s a little harder to know that the data should be (a) cleaned and (b) converted to Markdown.

(Note: Markdown is a format that is “extremely close to plain text, with minimal markup or formatting” and a format that “mainstream LLMs, such as OpenAI’s GPT-4o, natively ‘speak’” [2].)

I take a few extra steps – including tidying up the HTML tables to remove spacing columns [3] and contextualizing some of the text styles [4] – to ensure the output Markdown better reflects the original filing.

The resulting Markdown files are cleaner (i.e. contain minimal markup/formatting), smaller (important because LLMs charge by word), and most importantly more usable for downstream systems. My hypothesis is that it’s a higher-quality form of SEC filing data.

Gatekeeping this public data, though, feels a bit selfish. I built an app to allow anyone to pull this data easily (and for free). It allows you to download recent SEC filings for any company they list.

Extracting insights from SEC filings (an ongoing pursuit)

These Markdown SEC filings are small enough to be loaded into Google NotebookLM. (NotebookLM struggles with the full 10-K and 10-Q PDFs and HTML file.) I like NotebookLM for its ability to let you upload files and source answers from them, and based on some quick tests, it seems to do a good job:

It also seems to be able to extract table data well from these Markdown files:

This “data analysis” piece is an ongoing project – it will be interesting to explore if the analysis can be differentiated, or if all solutions are small variations of big LLM players (ChatGPT, Gemini, etc.). If it's the former, look forward to more posts on it! If it’s the latter, then high-quality data (the ingredients that fuel the LLMs) may be a credible differentiator – and the focus may turn once again to sourcing the best data.

[1] Digging into the data is also a worthwhile exercise to see which data sources are a commodity (and thus can be purchased off the shelf) and which can be a source of differentiation.

[2] microsoft/markitdown: Python tool for converting files and office documents to Markdown.

[3] The HTML tables in SEC filings contain “spacing columns” and “spacing rows,” which are strictly used to make the output look prettier to a human. Computers can be confused by these decorations.

[4] A little technical, but the HTML in SEC filings look something like: <span style=“font-size:24pt;font-weight:700;top-margin:10px;left-margin:10px”>text</span>. Many of these styles are not useful to a computer, but things like the font size and font weight (i.e. bolding) gives us contextual clues where the headers, etc. are.

Mike Yanagisawa

Thursday, July 3, 2025

An App for High-Quality SEC Filings

No comments:

Post a Comment

The various modes of investment work

Search This Blog