Evan Schwartz

Comparing full text search algorithms: BM25, TF-IDF, and Postgres

I wrote another post about Understanding the BM25 full text search algorithm and had initially included comparisons with two other algorithms. However, that post was already quite long so here are the brief comparisons between BM25, TF-IDF, and PostgreSQL's full text search.

BM25 vs TF-IDF

TF-IDF was the main model that was used prior to the development of BM25 (it was used in Lucene and Elasticsearch until 2016).

While TF-IDF also tries to use Term Frequency (TF) and the Inverse Document Frequency (IDF) to estimate the relevance of documents, BM25 introduced some key improvements:

Term frequency in tf/idf vs bm25

From Improved Text Scoring with BM25 (slides)

Postgres' built-in full text search is simpler and less powerful than BM25. Some of the specific differences include:

As far as I understand, Postgres' built-in search results may be somewhat worse than a system using BM25, though that may not be an issue for systems already using Postgres where simplicity is more important than search quality.

It is also worth mentioning that ParadeDB has a Postgres extension called pg_bm25 (formerly called pg_search), which adds support for BM25-based ranking.

Coincidentally, just yesterday I found an article on Scour (the personalized content feed I'm building) that benchmarks ParadeDB's full text search extension against PostgreSQL's native full text search. Spoiler alert: ParadeDB's implementation of BM25-based search is much faster.

#scour #search #understanding