Evan Schwartz

Comparing 13 Rust Crates for Extracting Text from HTML

Applications that run documents through LLMs or embedding models need to clean the text before feeding it into the model. I'm building a personalized content feed called Scour and was looking for a Rust crate to extract text from scraped HTML. I started off using a library that's used by a couple of LLM-related projects. However, while hunting a phantom memory leak, I built a little tool (emschwartz/html-to-text-comparison) to compare 13 Rust crates for extracting text from HTML and found that the results varied widely.

TL;DR: lol_html is a very impressive HTML rewriting crate from Cloudflare and fast_html2md is a newer HTML-to-Markdown crate that makes use of it. If you're doing web scraping or working with LLMs in Rust, you should take a look at both of those.

Approaches

At a high level, there are 3 categories of approaches we might use for cleaning HTML:

  1. HTML-to-text - as the name suggests, these crates convert whole HTML documents to plain text and were mostly developed for use cases like rendering HTML emails in terminals.
  2. HTML-to-markdown - these crates convert the HTML document to markdown and were built for a variety of uses, ranging from displaying web pages in terminals to general web scraping and LLM applications.
  3. Readability - the final set of crates are ports of the mozilla/readability library, which is used for the Firefox Reader View. These attempt to extract only the main content from the page by scoring DOM elements using a variety of heuristics.

Any of these should work for an LLM application, because we mostly care about stripping away HTML tags and extraneous content like scripts and CSS. I say "should" because some of these crates definitely do not work as well as you might expect.

Parsers

While there are a variety of different crates for extracting text from HTML, 10 out of the 13 I'm testing use the same underlying library for parsing the HTML: html5ever html5ever. This crate was developed as part of the Servo project and, as the download count suggests, it is used by many different libraries and applications.

The catch when using html5ever, however, is that it does not ship with a DOM tree implementation. The Servo project does have a simple tree implementation using reference-counted pointers that is used for their tests. It comes with this warning though:

This crate is built for the express purpose of writing automated tests for the html5ever and xml5ever crates. It is not intended to be a production-quality DOM implementation, and has not been fuzzed or tested against arbitrary, malicious, or nontrivial inputs. No maintenance or support for any such issues will be provided. If you use this DOM implementation in a production, user-facing system, you do so at your own risk.

Despite the scary disclaimer, the markup5ever_rcdom markup5ever_rcdom is used by plenty of libraries, including 7 out of the 10 crates I'm testing that use html5ever. The other 3 use DOM tree implementations from scraper, dom_query, and kuchiki (note that kuchiki is archived and unmaintained but Brave maintains a fork of it called kuchikiki).

Of the 3 remaining crates that do not use html5ever, two use custom HTML parsers and the third uses Cloudflare's lol_html lol_html streaming HTML rewriter. We'll talk more about lol_html below.

The Competitors

Crate Output Parser Tree Notable Users License
august august Text html5ever markup5ever_rcdom MIT
boilerpipe boilerpipe Text html5ever scraper::Html MIT
dom_smoothie dom_smoothie Readability html5ever dom_query::Tree MIT
fast_html2md fast_html2md Markdown lol_html N/A Spider MIT
htmd htmd Markdown html5ever markup5ever_rcdom Swiftide Apache-2.0
html2md html2md Markdown html5ever markup5ever_rcdom Atomic Data, ollama-rs, Lemmy GPL-3.0+
html2md-rs html2md-rs Markdown Custom Custom MIT
html2text html2text Text html5ever markup5ever_rcdom Lemmy, various terminal apps MIT
llm_readability llm_readability Readability html5ever markup5ever_rcdom Spider MIT
mdka mdka Markdown html5ever markup5ever_rcdom Apache-2.0
nanohtml2text nanohtml2text Text Custom Custom MIT
readability readability Readability html5ever markup5ever_rcdom langchain-rust, Kalosm, llm_utils MIT
readable-readability readable-readability Readability html5ever kuchiki::Node hackernews_tui MIT

Test Criteria

Some of the criteria to care about when selecting an HTML extraction library are:

Unlike when I was testing bitwise Hamming Distance implementations, I am not using Criterion for benchmarking this time. The output of these crates is not expected to be exactly equivalent and speed is not the only criteria that I wanted to compare.

Test Results

The test tool, emschwartz/html-to-text-comparison, is set up so that you can point it at any website and it will dump the output from each crate into a text file while printing various stats about each crate's run.

cargo install --locked --git https://github.com/emschwartz/html-to-text-comparison
html-to-text-comparison https://example.com

I would encourage you to try it yourself but here are the results from a couple different types of websites:

Hacker News Front Page

Name Time (microseconds) Peak Memory (bytes) Peak Memory as % of HTML Size Output Size (bytes) % Reduction Output File
august 2015 70809 191.40% 6411 82.67% out/august.txt
boilerpipe 1830 125587 339.46% 66 99.82% 🤐 out/boilerpipe.txt
dom_smoothie 6458 200729 542.57% 5950 83.92% out/dom_smoothie.txt
fast_html2md 1406 4806 12.99% 11093 70.02% out/fast_html2md.txt
htmd 1789 38549 104.20% 11097 70.00% out/htmd.txt
html2md 14312 918503 2482.71% 3823657 -10235.33% 🤯 out/html2md.txt
html2md-rs 1472 85923 232.25% 16792 54.61% out/html2md-rs.txt
html2text 3028 100981 272.95% 268567 -625.94% out/html2text.txt
llm_readability 3852 72949 197.18% 0 100.00% 🤐 out/llm_readability.txt
mdka 1291 35315 95.46% 1 100.00% 🤐 out/mdka.txt
nanohtml2text 606 6975 18.85% 10648 71.22% out/nanohtml2text.txt
readability 4129 67139 181.48% 11 99.97% 🤐 out/readability.txt
readable-readability 1820 131031 354.18% 3750 89.86% out/readable-readability.txt

Some of these are crossed out because the output is completely wrong. For example, llm_readability and mdka produced empty strings, readability produced only the string "Hacker News", and boilerpipe produced "195 points by recvonline 4 hours ago | hide | 181 comments\n15.". html2md exploded and outputted a file that was 100x larger than the original HTML, mostly filled with whitespace.

mozilla/readability Github Repo

Name Time (microseconds) Peak Memory (bytes) Peak Memory as % of HTML Size Output Size (bytes) % Reduction Output File
august 6546 214932 62.55% 12916 96.24% august.txt
boilerpipe 6574 340102 98.97% 266 99.92% boilerpipe.txt
dom_smoothie 12428 498327 145.02% 6446 98.12% dom_smoothie.txt
fast_html2md 3649 6317 1.84% 14607 95.75% fast_html2md.txt
htmd 6388 160433 46.69% 14071 95.91% htmd.txt
html2md 7368 200740 58.42% 89019 74.09% html2md.txt
html2md-rs 4355 242241 70.50% 17650 94.86% html2md-rs.txt
html2text 7548 244119 71.04% 28699 91.65% html2text.txt
llm_readability 5039 144964 42.19% 19 99.99% llm_readability.txt
mdka 6172 206179 60.00% 6948 97.98% mdka.txt
nanohtml2text 2660 85684 24.94% 18779 94.54% nanohtml2text.txt
readability 6056 151532 44.10% 53 99.98% readability.txt
readable-readability 6000 212956 61.97% 53 99.98% readable-readability.txt

As in the previous test, I've crossed out the crates that completely missed the mark. This time, All of the failing implementations seemed to focus on the wrong HTML element(s). For example, readability and readable-readability produced only the string "You can’t perform that action at this time."

Rust Lang Blog

Name Time (microseconds) Peak Memory (bytes) Peak Memory as % of HTML Size Output Size (bytes) % Reduction Output File
august 893 52032 240.09% 12601 41.86% august.txt
boilerpipe 934 101874 470.07% 5660 73.88% boilerpipe.txt
dom_smoothie 2129 129626 598.13% 6649 69.32% dom_smoothie.txt
fast_html2md 639 5108 23.57% 13102 39.54% fast_html2md.txt
htmd 798 20549 94.82% 11958 44.82% htmd.txt
html2md 801 65159 300.66% 13498 37.72% html2md.txt
html2md-rs 311 35988 166.06% 21 99.90% html2md-rs.txt
html2text 1177 38758 178.84% 13574 37.37% html2text.txt
llm_readability 2733 55464 255.92% 5870 72.91% llm_readability.txt
mdka 895 19169 88.45% 13147 39.34% mdka.txt
nanohtml2text 234 5345 24.66% 12866 40.63% nanohtml2text.txt
readability 2252 54610 251.98% 5801 73.23% readability.txt
readable-readability 609 80610 371.95% 6561 69.73% readable-readability.txt

This is a more straightforward blog page and this time only one crate got it completely wrong (html2md-rs produced "<noscript></noscript>").

Conclusion

The first conclusion we should draw from this text is that it is extremely important to check the output of your HTML cleaning library. Some of the libraries tested here are widely used and yet they completely failed to find the important content on the pages we looked at. If you're building an application with an LLM and get strange results, you should spot check some of the text you're feeding in.

If we remove the contenders that completely failed any of these tests, we're left with:

You might have a fine experience building with any of these, but I would choose to narrow this list down further based on a combination of their performance and manually inspecting their outputs:

fast_html2md

This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage.

In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings.

The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library.

dom_smoothie

While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.

Appendix: HTML-to-Markdown with a Language Model

Jina has a couple of small language models designed to convert HTML to markdown. They are available on Hugging Face under a Creative Commons Non-Commercial license and via their API for commercial uses.

Depending on your use case, you might also want to try them out. The API-based version is included in the comparison tool under an optional feature flag. However, I left them out of the main comparison because the memory usage is going to be considerably higher than any of the Rust crates and the models are not freely available.


Discuss on Hacker News, Lobsters, or r/rust.

Subscribe via RSS or on 🐿️ Scour.


#ai #embeddings #rust #scour