Comparing 13 Rust Crates for Extracting Text from HTML
Applications that run documents through LLMs or embedding models need to clean the text before feeding it into the model. I'm building a personalized content feed called Scour and was looking for a Rust crate to extract text from scraped HTML. I started off using a library that's used by a couple of LLM-related projects. However, while hunting a phantom memory leak, I built a little tool (emschwartz/html-to-text-comparison) to compare 13 Rust crates for extracting text from HTML and found that the results varied widely.
TL;DR: lol_html
is a very impressive HTML rewriting crate from Cloudflare and fast_html2md
is a newer HTML-to-Markdown crate that makes use of it. If you're doing web scraping or working with LLMs in Rust, you should take a look at both of those.
Approaches
At a high level, there are 3 categories of approaches we might use for cleaning HTML:
- HTML-to-text - as the name suggests, these crates convert whole HTML documents to plain text and were mostly developed for use cases like rendering HTML emails in terminals.
- HTML-to-markdown - these crates convert the HTML document to markdown and were built for a variety of uses, ranging from displaying web pages in terminals to general web scraping and LLM applications.
- Readability - the final set of crates are ports of the mozilla/readability library, which is used for the Firefox Reader View. These attempt to extract only the main content from the page by scoring DOM elements using a variety of heuristics.
Any of these should work for an LLM application, because we mostly care about stripping away HTML tags and extraneous content like scripts and CSS. I say "should" because some of these crates definitely do not work as well as you might expect.
Parsers
While there are a variety of different crates for extracting text from HTML, 10 out of the 13 I'm testing use the same underlying library for parsing the HTML: html5ever
. This crate was developed as part of the Servo project and, as the download count suggests, it is used by many different libraries and applications.
The catch when using html5ever
, however, is that it does not ship with a DOM tree implementation. The Servo project does have a simple tree implementation using reference-counted pointers that is used for their tests. It comes with this warning though:
This crate is built for the express purpose of writing automated tests for the html5ever and xml5ever crates. It is not intended to be a production-quality DOM implementation, and has not been fuzzed or tested against arbitrary, malicious, or nontrivial inputs. No maintenance or support for any such issues will be provided. If you use this DOM implementation in a production, user-facing system, you do so at your own risk.
Despite the scary disclaimer, the markup5ever_rcdom
is used by plenty of libraries, including 7 out of the 10 crates I'm testing that use
html5ever
. The other 3 use DOM tree implementations from scraper
, dom_query
, and kuchiki
(note that kuchiki
is archived and unmaintained but Brave maintains a fork of it called kuchikiki
).
Of the 3 remaining crates that do not use html5ever
, two use custom HTML parsers and the third uses Cloudflare's lol_html
streaming HTML rewriter. We'll talk more about
lol_html
below.
The Competitors
Crate | Output | Parser | Tree | Notable Users | License |
---|---|---|---|---|---|
august |
Text | html5ever |
markup5ever_rcdom |
MIT | |
boilerpipe |
Text | html5ever |
scraper::Html |
MIT | |
dom_smoothie |
Readability | html5ever |
dom_query::Tree |
MIT | |
fast_html2md |
Markdown | lol_html |
N/A | Spider | MIT |
htmd |
Markdown | html5ever |
markup5ever_rcdom |
Swiftide | Apache-2.0 |
html2md |
Markdown | html5ever |
markup5ever_rcdom |
Atomic Data, ollama-rs , Lemmy |
GPL-3.0+ |
html2md-rs |
Markdown | Custom | Custom | MIT | |
html2text |
Text | html5ever |
markup5ever_rcdom |
Lemmy, various terminal apps | MIT |
llm_readability |
Readability | html5ever |
markup5ever_rcdom |
Spider | MIT |
mdka |
Markdown | html5ever |
markup5ever_rcdom |
Apache-2.0 | |
nanohtml2text |
Text | Custom | Custom | MIT | |
readability |
Readability | html5ever |
markup5ever_rcdom |
langchain-rust , Kalosm, llm_utils |
MIT |
readable-readability |
Readability | html5ever |
kuchiki::Node |
hackernews_tui |
MIT |
Test Criteria
Some of the criteria to care about when selecting an HTML extraction library are:
- Correct Content - whether the output contains the text you care about for any given website (this is a key criteria -- and not to be taken for granted, as we'll see in the results).
- Text Size - the total size of the output -- though of course what you really care about is how much extraneous content is included on top of the main content.
- Speed or Throughput - how fast it processes a given input file size. Note that with scraping, the processing time will be dwarfed by the latency of the actual network request.
- Memory Usage - depending on your application and how many pages you are scraping, you may care more or less about the total memory usage.
- Format - if you are using the cleaned text for an LLM application, you may not care too much about the correctness of the markdown or text formatting. For other types of applications, this obviously matters more.
Unlike when I was testing bitwise Hamming Distance implementations, I am not using Criterion for benchmarking this time. The output of these crates is not expected to be exactly equivalent and speed is not the only criteria that I wanted to compare.
Test Results
The test tool, emschwartz/html-to-text-comparison, is set up so that you can point it at any website and it will dump the output from each crate into a text file while printing various stats about each crate's run.
cargo install --locked --git https://github.com/emschwartz/html-to-text-comparison
html-to-text-comparison https://example.com
I would encourage you to try it yourself but here are the results from a couple different types of websites:
Hacker News Front Page
Name | Time (microseconds) | Peak Memory (bytes) | Peak Memory as % of HTML Size | Output Size (bytes) | % Reduction | Output File |
---|---|---|---|---|---|---|
august | 2015 | 70809 | 191.40% | 6411 | 82.67% | out/august.txt |
1830 | 125587 | 339.46% | 66 | 99.82% 🤐 | out/boilerpipe.txt | |
dom_smoothie | 6458 | 200729 | 542.57% | 5950 | 83.92% | out/dom_smoothie.txt |
fast_html2md | 1406 | 4806 | 12.99% | 11093 | 70.02% | out/fast_html2md.txt |
htmd | 1789 | 38549 | 104.20% | 11097 | 70.00% | out/htmd.txt |
14312 | 918503 | 2482.71% | 3823657 | -10235.33% 🤯 | out/html2md.txt | |
html2md-rs | 1472 | 85923 | 232.25% | 16792 | 54.61% | out/html2md-rs.txt |
html2text | 3028 | 100981 | 272.95% | 268567 | -625.94% | out/html2text.txt |
3852 | 72949 | 197.18% | 0 | 100.00% 🤐 | out/llm_readability.txt | |
1291 | 35315 | 95.46% | 1 | 100.00% 🤐 | out/mdka.txt | |
nanohtml2text | 606 | 6975 | 18.85% | 10648 | 71.22% | out/nanohtml2text.txt |
4129 | 67139 | 181.48% | 11 | 99.97% 🤐 | out/readability.txt | |
readable-readability | 1820 | 131031 | 354.18% | 3750 | 89.86% | out/readable-readability.txt |
Some of these are crossed out because the output is completely wrong. For example, llm_readability
and mdka
produced empty strings, readability
produced only the string "Hacker News"
, and boilerpipe
produced "195 points by recvonline 4 hours ago | hide | 181 comments\n15."
. html2md
exploded and outputted a file that was 100x larger than the original HTML, mostly filled with whitespace.
mozilla/readability
Github Repo
Name | Time (microseconds) | Peak Memory (bytes) | Peak Memory as % of HTML Size | Output Size (bytes) | % Reduction | Output File |
---|---|---|---|---|---|---|
august | 6546 | 214932 | 62.55% | 12916 | 96.24% | august.txt |
6574 | 340102 | 98.97% | 266 | 99.92% | boilerpipe.txt | |
dom_smoothie | 12428 | 498327 | 145.02% | 6446 | 98.12% | dom_smoothie.txt |
fast_html2md | 3649 | 6317 | 1.84% | 14607 | 95.75% | fast_html2md.txt |
htmd | 6388 | 160433 | 46.69% | 14071 | 95.91% | htmd.txt |
html2md | 7368 | 200740 | 58.42% | 89019 | 74.09% | html2md.txt |
html2md-rs | 4355 | 242241 | 70.50% | 17650 | 94.86% | html2md-rs.txt |
html2text | 7548 | 244119 | 71.04% | 28699 | 91.65% | html2text.txt |
5039 | 144964 | 42.19% | 19 | 99.99% | llm_readability.txt | |
6172 | 206179 | 60.00% | 6948 | 97.98% | mdka.txt | |
nanohtml2text | 2660 | 85684 | 24.94% | 18779 | 94.54% | nanohtml2text.txt |
6056 | 151532 | 44.10% | 53 | 99.98% | readability.txt | |
6000 | 212956 | 61.97% | 53 | 99.98% | readable-readability.txt |
As in the previous test, I've crossed out the crates that completely missed the mark. This time, All of the failing implementations seemed to focus on the wrong HTML element(s). For example, readability
and readable-readability
produced only the string "You can’t perform that action at this time."
Rust Lang Blog
Name | Time (microseconds) | Peak Memory (bytes) | Peak Memory as % of HTML Size | Output Size (bytes) | % Reduction | Output File |
---|---|---|---|---|---|---|
august | 893 | 52032 | 240.09% | 12601 | 41.86% | august.txt |
boilerpipe | 934 | 101874 | 470.07% | 5660 | 73.88% | boilerpipe.txt |
dom_smoothie | 2129 | 129626 | 598.13% | 6649 | 69.32% | dom_smoothie.txt |
fast_html2md | 639 | 5108 | 23.57% | 13102 | 39.54% | fast_html2md.txt |
htmd | 798 | 20549 | 94.82% | 11958 | 44.82% | htmd.txt |
html2md | 801 | 65159 | 300.66% | 13498 | 37.72% | html2md.txt |
311 | 35988 | 166.06% | 21 | 99.90% | html2md-rs.txt | |
html2text | 1177 | 38758 | 178.84% | 13574 | 37.37% | html2text.txt |
llm_readability | 2733 | 55464 | 255.92% | 5870 | 72.91% | llm_readability.txt |
mdka | 895 | 19169 | 88.45% | 13147 | 39.34% | mdka.txt |
nanohtml2text | 234 | 5345 | 24.66% | 12866 | 40.63% | nanohtml2text.txt |
readability | 2252 | 54610 | 251.98% | 5801 | 73.23% | readability.txt |
readable-readability | 609 | 80610 | 371.95% | 6561 | 69.73% | readable-readability.txt |
This is a more straightforward blog page and this time only one crate got it completely wrong (html2md-rs
produced "<noscript></noscript>"
).
Conclusion
The first conclusion we should draw from this text is that it is extremely important to check the output of your HTML cleaning library. Some of the libraries tested here are widely used and yet they completely failed to find the important content on the pages we looked at. If you're building an application with an LLM and get strange results, you should spot check some of the text you're feeding in.
If we remove the contenders that completely failed any of these tests, we're left with:
august
dom_smoothie
fast_html2md
htmd
html2text
nanohtml2text
You might have a fine experience building with any of these, but I would choose to narrow this list down further based on a combination of their performance and manually inspecting their outputs:
fast_html2md
This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage.
In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html
, lets you tune the memory settings.
The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html
. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library.
dom_smoothie
While this has much higher memory usage than fast_html2md
, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.
Appendix: HTML-to-Markdown with a Language Model
Jina has a couple of small language models designed to convert HTML to markdown. They are available on Hugging Face under a Creative Commons Non-Commercial license and via their API for commercial uses.
Depending on your use case, you might also want to try them out. The API-based version is included in the comparison tool under an optional feature flag. However, I left them out of the main comparison because the memory usage is going to be considerably higher than any of the Rust crates and the models are not freely available.
Discuss on Hacker News, Lobsters, or r/rust.
Subscribe via RSS or on 🐿️ Scour.