Comparing 13 Rust Crates for Extracting Text from HTML

Jan 21, 2025

Applications that run documents through LLMs or embedding models need to clean the text before feeding it into the model. I'm building a personalized content feed called Scour and was looking for a Rust crate to extract text from scraped HTML. I started off using a library that's used by a couple of LLM-related projects. However, while hunting a phantom memory leak, I built a little tool (emschwartz/html-to-text-comparison) to compare 13 Rust crates for extracting text from HTML and found that the results varied widely.

TL;DR: lol_html is a very impressive HTML rewriting crate from Cloudflare and fast_html2md is a newer HTML-to-Markdown crate that makes use of it. If you're doing web scraping or working with LLMs in Rust, you should take a look at both of those.

Approaches

At a high level, there are 3 categories of approaches we might use for cleaning HTML:

HTML-to-text - as the name suggests, these crates convert whole HTML documents to plain text and were mostly developed for use cases like rendering HTML emails in terminals.
HTML-to-markdown - these crates convert the HTML document to markdown and were built for a variety of uses, ranging from displaying web pages in terminals to general web scraping and LLM applications.
Readability - the final set of crates are ports of the mozilla/readability library, which is used for the Firefox Reader View. These attempt to extract only the main content from the page by scoring DOM elements using a variety of heuristics.

Any of these should work for an LLM application, because we mostly care about stripping away HTML tags and extraneous content like scripts and CSS. I say "should" because some of these crates definitely do not work as well as you might expect.

Parsers

While there are a variety of different crates for extracting text from HTML, 10 out of the 13 I'm testing use the same underlying library for parsing the HTML: html5ever . This crate was developed as part of the Servo project and, as the download count suggests, it is used by many different libraries and applications.

The catch when using html5ever, however, is that it does not ship with a DOM tree implementation. The Servo project does have a simple tree implementation using reference-counted pointers that is used for their tests. It comes with this warning though:

This crate is built for the express purpose of writing automated tests for the html5ever and xml5ever crates. It is not intended to be a production-quality DOM implementation, and has not been fuzzed or tested against arbitrary, malicious, or nontrivial inputs. No maintenance or support for any such issues will be provided. If you use this DOM implementation in a production, user-facing system, you do so at your own risk.

Despite the scary disclaimer, the markup5ever_rcdom is used by plenty of libraries, including 7 out of the 10 crates I'm testing that use html5ever. The other 3 use DOM tree implementations from scraper, dom_query, and kuchiki (note that kuchiki is archived and unmaintained but Brave maintains a fork of it called kuchikiki).

Of the 3 remaining crates that do not use html5ever, two use custom HTML parsers and the third uses Cloudflare's lol_html streaming HTML rewriter. We'll talk more about lol_html below.

The Competitors

Crate	Output	Parser	Tree	Notable Users	License
`august`	Text	`html5ever`	`markup5ever_rcdom`		MIT
`boilerpipe`	Text	`html5ever`	`scraper::Html`		MIT
`dom_smoothie`	Readability	`html5ever`	`dom_query::Tree`		MIT
`fast_html2md`	Markdown	`lol_html`	N/A	Spider	MIT
`htmd`	Markdown	`html5ever`	`markup5ever_rcdom`	Swiftide	Apache-2.0
`html2md`	Markdown	`html5ever`	`markup5ever_rcdom`	Atomic Data, `ollama-rs`, Lemmy	GPL-3.0+
`html2md-rs`	Markdown	Custom	Custom		MIT
`html2text`	Text	`html5ever`	`markup5ever_rcdom`	Lemmy, various terminal apps	MIT
`llm_readability`	Readability	`html5ever`	`markup5ever_rcdom`	Spider	MIT
`mdka`	Markdown	`html5ever`	`markup5ever_rcdom`		Apache-2.0
`nanohtml2text`	Text	Custom	Custom		MIT
`readability`	Readability	`html5ever`	`markup5ever_rcdom`	`langchain-rust`, Kalosm, `llm_utils`	MIT
`readable-readability`	Readability	`html5ever`	`kuchiki::Node`	`hackernews_tui`	MIT

Test Criteria

Some of the criteria to care about when selecting an HTML extraction library are:

Correct Content - whether the output contains the text you care about for any given website (this is a key criteria -- and not to be taken for granted, as we'll see in the results).
Text Size - the total size of the output -- though of course what you really care about is how much extraneous content is included on top of the main content.
Speed or Throughput - how fast it processes a given input file size. Note that with scraping, the processing time will be dwarfed by the latency of the actual network request.
Memory Usage - depending on your application and how many pages you are scraping, you may care more or less about the total memory usage.
Format - if you are using the cleaned text for an LLM application, you may not care too much about the correctness of the markdown or text formatting. For other types of applications, this obviously matters more.

Unlike when I was testing bitwise Hamming Distance implementations, I am not using Criterion for benchmarking this time. The output of these crates is not expected to be exactly equivalent and speed is not the only criteria that I wanted to compare.

Test Results

The test tool, emschwartz/html-to-text-comparison, is set up so that you can point it at any website and it will dump the output from each crate into a text file while printing various stats about each crate's run.

cargo install --locked --git https://github.com/emschwartz/html-to-text-comparison
html-to-text-comparison https://example.com

I would encourage you to try it yourself but here are the results from a couple different types of websites:

Hacker News Front Page

Name	Time (microseconds)	Peak Memory (bytes)	Peak Memory as % of HTML Size	Output Size (bytes)	% Reduction	Output File
august	2015	70809	191.40%	6411	82.67%	out/august.txt
~~boilerpipe~~	1830	125587	339.46%	66	99.82% 🤐	out/boilerpipe.txt
dom_smoothie	6458	200729	542.57%	5950	83.92%	out/dom_smoothie.txt
fast_html2md	1406	4806	12.99%	11093	70.02%	out/fast_html2md.txt
htmd	1789	38549	104.20%	11097	70.00%	out/htmd.txt
~~html2md~~	14312	918503	2482.71%	3823657	-10235.33% 🤯	out/html2md.txt
html2md-rs	1472	85923	232.25%	16792	54.61%	out/html2md-rs.txt
html2text	3028	100981	272.95%	268567	-625.94%	out/html2text.txt
~~llm_readability~~	3852	72949	197.18%	0	100.00% 🤐	out/llm_readability.txt
~~mdka~~	1291	35315	95.46%	1	100.00% 🤐	out/mdka.txt
nanohtml2text	606	6975	18.85%	10648	71.22%	out/nanohtml2text.txt
~~readability~~	4129	67139	181.48%	11	99.97% 🤐	out/readability.txt
readable-readability	1820	131031	354.18%	3750	89.86%	out/readable-readability.txt

Some of these are crossed out because the output is completely wrong. For example, llm_readability and mdka produced empty strings, readability produced only the string "Hacker News", and boilerpipe produced "195 points by recvonline 4 hours ago | hide | 181 comments\n15.". html2md exploded and outputted a file that was 100x larger than the original HTML, mostly filled with whitespace.

`mozilla/readability` Github Repo

Name	Time (microseconds)	Peak Memory (bytes)	Peak Memory as % of HTML Size	Output Size (bytes)	% Reduction	Output File
august	6546	214932	62.55%	12916	96.24%	august.txt
~~boilerpipe~~	6574	340102	98.97%	266	99.92%	boilerpipe.txt
dom_smoothie	12428	498327	145.02%	6446	98.12%	dom_smoothie.txt
fast_html2md	3649	6317	1.84%	14607	95.75%	fast_html2md.txt
htmd	6388	160433	46.69%	14071	95.91%	htmd.txt
html2md	7368	200740	58.42%	89019	74.09%	html2md.txt
html2md-rs	4355	242241	70.50%	17650	94.86%	html2md-rs.txt
html2text	7548	244119	71.04%	28699	91.65%	html2text.txt
~~llm_readability~~	5039	144964	42.19%	19	99.99%	llm_readability.txt
~~mdka~~	6172	206179	60.00%	6948	97.98%	mdka.txt
nanohtml2text	2660	85684	24.94%	18779	94.54%	nanohtml2text.txt
~~readability~~	6056	151532	44.10%	53	99.98%	readability.txt
~~readable-readability~~	6000	212956	61.97%	53	99.98%	readable-readability.txt

As in the previous test, I've crossed out the crates that completely missed the mark. This time, All of the failing implementations seemed to focus on the wrong HTML element(s). For example, readability and readable-readability produced only the string "You can’t perform that action at this time."

Rust Lang Blog

Name	Time (microseconds)	Peak Memory (bytes)	Peak Memory as % of HTML Size	Output Size (bytes)	% Reduction	Output File
august	893	52032	240.09%	12601	41.86%	august.txt
boilerpipe	934	101874	470.07%	5660	73.88%	boilerpipe.txt
dom_smoothie	2129	129626	598.13%	6649	69.32%	dom_smoothie.txt
fast_html2md	639	5108	23.57%	13102	39.54%	fast_html2md.txt
htmd	798	20549	94.82%	11958	44.82%	htmd.txt
html2md	801	65159	300.66%	13498	37.72%	html2md.txt
~~html2md-rs~~	311	35988	166.06%	21	99.90%	html2md-rs.txt
html2text	1177	38758	178.84%	13574	37.37%	html2text.txt
llm_readability	2733	55464	255.92%	5870	72.91%	llm_readability.txt
mdka	895	19169	88.45%	13147	39.34%	mdka.txt
nanohtml2text	234	5345	24.66%	12866	40.63%	nanohtml2text.txt
readability	2252	54610	251.98%	5801	73.23%	readability.txt
readable-readability	609	80610	371.95%	6561	69.73%	readable-readability.txt

This is a more straightforward blog page and this time only one crate got it completely wrong (html2md-rs produced "<noscript></noscript>").

Conclusion

The first conclusion we should draw from this text is that it is extremely important to check the output of your HTML cleaning library. Some of the libraries tested here are widely used and yet they completely failed to find the important content on the pages we looked at. If you're building an application with an LLM and get strange results, you should spot check some of the text you're feeding in.

If we remove the contenders that completely failed any of these tests, we're left with:

august
dom_smoothie
fast_html2md
htmd
html2text
nanohtml2text

You might have a fine experience building with any of these, but I would choose to narrow this list down further based on a combination of their performance and manually inspecting their outputs:

`fast_html2md`

This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage.

In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings.

The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library.

`dom_smoothie`

While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.

Appendix: HTML-to-Markdown with a Language Model

Jina has a couple of small language models designed to convert HTML to markdown. They are available on Hugging Face under a Creative Commons Non-Commercial license and via their API for commercial uses.

Depending on your use case, you might also want to try them out. The API-based version is included in the comparison tool under an optional feature flag. However, I left them out of the main comparison because the memory usage is going to be considerably higher than any of the Rust crates and the models are not freely available.

Discuss on Hacker News, Lobsters, or r/rust.

Subscribe via RSS or on 🐿️ Scour.

#ai #embeddings #rust #scour