Subtleties of SQLite Indexes
In the last 6 months, Scour has gone from ingesting 330,000 pieces of content per month to over 1.4 million this month. The massive increase in the number of items slowed down the ranking for users' feeds and sent me looking for ways to speed it up again.
After spending too many hours trying in vain to squeeze more performance out of my queries and indexes, I dug into how SQLite's query planner uses indexes, learned some of the subtleties that explained why my initial tweaks weren't working, and sped up one of my main queries by ~35%.
Scour's items
table
Scour is a personalized content feed that finds articles, blog posts, etc related to users' interests. For better and for worse, Scour does its ranking on the fly whenever users load their feeds page. Initially, this took 100 milliseconds or less, thanks to binary vector embeddings and the fact that it's using SQLite so there is no network latency to load data.
The most important table in Scour's database is the items
table. It includes an ID, URL, title, language, publish date (stored as a Unix timestamp), and a text quality rating.
Scour's main ranking query filters items based on when they were published, whether they are in a language the user understands, and whether they are above a certain quality threshold.
The question is: what indexes do we need to speed up this query?
Don't bother with multiple single-column indexes
When I first set up Scour's database, I put a bunch of indexes on the items
table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless.
It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column.
In most cases, the query planner won't bother merging the results from two indexes on the same table. Instead, it will use one of the indexes and then scan all of the rows that match the filter for that index's column.
It's worth being careful to only add indexes that will be used by real queries. Having additional indexes on each column won't hurt read performance. However, each index takes up storage space and more indexes will slow down writes, because all of the indexes need to be updated when new rows are inserted into the table.
If we're going to have an index on multiple columns, which columns should we include and what order should we put them in?
Index column order matters
The order of conditions in a query doesn't matter, but the order of columns in an index very much does.
Columns that come earlier in the index should be more "selective": they should help the database narrow the results set as much as possible.
In Scour's case, the most selective column is the publish date, followed by the quality rating, followed by the language. I put an index on those columns in that order:
CREATE INDEX idx_items_published_quality_lang
ON items(published, low_quality_probability, lang);
...and found that SQLite was only using one of the columns. Running this query:
EXPLAIN QUERY PLAN
SELECT id, low_quality_probability
FROM items
WHERE published BETWEEN $1 AND $2
AND low_quality_probability <= $3
AND lang IN (SELECT lang FROM user_languages WHERE user_id = $4)
Produced this query plan:
QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_items_published_quality_lang (published>? AND published<?)
`--CORRELATED LIST SUBQUERY 1
`--SCAN user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1
It was using the right index but only filtering by published
(note the part of the plan that says (published>? AND published<?)
). Puzzling.
Left to right, no skipping, stops at the first range
My aha moment came while watching Aaron Francis' High Performance SQLite course. He said the main rule for SQLite indexes is: "Left to right, no skipping, stops at the first range." (This is a much clearer statement of the implications of the Where Clause Analysis buried in the Query Optimizer Overview section of the official docs.)
This rule means that the query planner will:
- Consider columns from left to right. In my case, the first column in the index is
published
. SQLite will search for rows where thepublished
field is in the correct range before considering the other columns. - No skipping means that SQLite cannot use only the 1st and 3rd column in an index. As soon as it reaches a column in the index that does not appear in the query, it must do a scan through all of the rows matching the 1st column.
- Stops at the first range. That was the key I was missing. Filtering by the
published
timestamp first would indeed narrow down the results more than filtering first by one of the other columns. However, the fact that the query uses a range condition on thepublished
column (WHERE published BETWEEN $1 AND $2
) means that SQLite can only scan all of the rows in thatpublished
range, rather than fully utilizing the other columns in the index to hone in on the correct rows.
My query includes two ranges (published BETWEEN $1 AND $2 AND low_quality_probability <= $3
), so the "stops at the first range" rule explains why I was only seeing the query planner use one of those columns. This rule does, however, suggest that I can create an index that will allow SQLite to filter by the one non-range condition (lang IN (SELECT lang FROM user_languages WHERE user_id = $4)
) before using one of the ranges:
CREATE INDEX idx_lang_published_quality
ON items(lang, published, low_quality_probability);
The query plan shows that it can use both the lang
and published
columns (note the part that reads lang=? AND published>? AND published<?
):
QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_items_lang_published_quality (lang=? AND published>? AND published<?)
`--LIST SUBQUERY 1
`--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
Now we're using two out of the three columns to quickly filter the rows. Can we use all three? (Remember, the query planner won't be able to use multiple range queries on the same index, so we'll need something else.)
WHERE
conditions for partial indexes must exactly match
SQLite has a nice feature called Partial Indexes that allows you to define an index that only applies to a subset of the rows matching some conditions.
In Scour's case, we only really want items where the low_quality_probability
is less than or equal to 90%. The model I'm using to judge quality isn't that great, so I only trust it to filter out items if it's really sure they're low quality.
This means I can create an index like this:
CREATE INDEX idx_lang_published_quality_filtered
ON items(lang, published, low_quality_probability)
WHERE low_quality_probability <= .9;
And then update the query to use the same WHERE
condition:
EXPLAIN QUERY PLAN
SELECT id, low_quality_probability
FROM items
WHERE low_quality_probability <= 0.9
AND published BETWEEN $1 AND $2
AND low_quality_probability <= $3
AND lang IN (SELECT lang FROM user_languages WHERE id = $4)
And it should use our new partial index... right? Wrong. This query is still using the previous index.
QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_items_lang_published_quality (lang=? AND published>? AND published<?)
`--LIST SUBQUERY 1
`--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)
There's a subtle mistake in the relationship between our index and our query. Can you spot it?
Our index contains the condition WHERE low_quality_probability <= .9
but our query says WHERE low_quality_probability <= 0.9
. These are mathematically equivalent but they are not exactly the same.
SQLite's query planner requires the conditions to match exactly in order for it to use a partial index. Relatedly, a condition of <= 0.95
or even <= 0.5 + 0.4
in the query would also not utilize the partial index.
If we rewrite our query to use the exact same condition of <= .9
, we get the query plan:
QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_lang_published_quality_filtered (ANY(lang) AND published>? AND published<?)
`--CORRELATED LIST SUBQUERY 1
`--SCAN user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1
Now, we're starting with the items whose low_quality_probability <= .9
, then using the index to find the items in the desired language(s), and lastly narrowing down the results to the items that were published in the correct time range.
Better query plans find matching rows faster
As mentioned in the intro, these changes to the indexes and one of Scour's main ranking queries yielded a ~35% speedup.
Enabling the query planner to make better use of the indexes makes it so that SQLite doesn't need to scan as many rows to find the ones that match the query conditions.
Concretely, in Scour's case, filtering by language removes about 30% of items for most users and filtering out low quality content removes a further 50%. Together, these changes reduce the number of rows scanned by around 66%.
Sadly, however, a 66% reduction in the number of rows scanned does not directly translate to a 66% speedup in the query. If we're doing more than counting rows, the work to load the data out of the database and process it can be more resource intensive than scanning rows to match conditions. (The optimized queries and indexes still load the same number of rows as before, they just identifying the desired rows faster.) Nevertheless, a 35% speedup is a noticeable improvement.
Conclusion
It's worth digging into how your database's query planner uses indexes to help get to the bottom of performance issues.
If you're working with SQLite, remember that:
- A smaller number of composite indexes are more useful that multiple single-column indexes. It's better to have an index over
(lang, published, low_quality_probability)
than separate indexes for each. - The query planner uses the rule "Left to right, no skipping, stops at the first range". If a query has multiple range conditions, it may be worth putting the columns that use strict equality first in the index, like we did above with
lang
coming beforepublished
orlow_quality_probability
. - Conditions used in
WHERE
clauses for partial indexes must exactly match the condition used in the corresponding query.<= 0.9
is not exactly the same as<= .9
, even if they are mathematically equivalent.
Thanks to Aaron Francis for putting together the High Performance SQLite course! (I have no personal or financial relationship to him, but I appreciate his course unblocking me and helping me speed up Scour's ranking.) Thank you also to Adam Gluck and Alex Kesling for feedback on this post.