Evan Schwartz

Subtleties of SQLite Indexes

In the last 6 months, Scour has gone from ingesting 330,000 pieces of content per month to over 1.4 million this month. The massive increase in the number of items slowed down the ranking for users' feeds and sent me looking for ways to speed it up again.

After spending too many hours trying in vain to squeeze more performance out of my queries and indexes, I dug into how SQLite's query planner uses indexes, learned some of the subtleties that explained why my initial tweaks weren't working, and sped up one of my main queries by ~35%.

Scour's items table

Scour is a personalized content feed that finds articles, blog posts, etc related to users' interests. For better and for worse, Scour does its ranking on the fly whenever users load their feeds page. Initially, this took 100 milliseconds or less, thanks to binary vector embeddings and the fact that it's using SQLite so there is no network latency to load data.

The most important table in Scour's database is the items table. It includes an ID, URL, title, language, publish date (stored as a Unix timestamp), and a text quality rating.

Scour's main ranking query filters items based on when they were published, whether they are in a language the user understands, and whether they are above a certain quality threshold.

The question is: what indexes do we need to speed up this query?

Don't bother with multiple single-column indexes

When I first set up Scour's database, I put a bunch of indexes on the items table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless.

It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column.

In most cases, the query planner won't bother merging the results from two indexes on the same table. Instead, it will use one of the indexes and then scan all of the rows that match the filter for that index's column.

It's worth being careful to only add indexes that will be used by real queries. Having additional indexes on each column won't hurt read performance. However, each index takes up storage space and more indexes will slow down writes, because all of the indexes need to be updated when new rows are inserted into the table.

If we're going to have an index on multiple columns, which columns should we include and what order should we put them in?

Index column order matters

The order of conditions in a query doesn't matter, but the order of columns in an index very much does.

Columns that come earlier in the index should be more "selective": they should help the database narrow the results set as much as possible.

In Scour's case, the most selective column is the publish date, followed by the quality rating, followed by the language. I put an index on those columns in that order:

CREATE INDEX idx_items_published_quality_lang
ON items(published, low_quality_probability, lang);

...and found that SQLite was only using one of the columns. Running this query:

EXPLAIN QUERY PLAN
SELECT id, low_quality_probability
FROM items
WHERE published BETWEEN $1 AND $2
AND low_quality_probability <= $3
AND lang IN (SELECT lang FROM user_languages WHERE user_id = $4)

Produced this query plan:

QUERY PLAN
   |--SEARCH items USING COVERING INDEX idx_items_published_quality_lang (published>? AND published<?)
   `--CORRELATED LIST SUBQUERY 1
      `--SCAN user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1

It was using the right index but only filtering by published (note the part of the plan that says (published>? AND published<?)). Puzzling.

Left to right, no skipping, stops at the first range

My aha moment came while watching Aaron Francis' High Performance SQLite course. He said the main rule for SQLite indexes is: "Left to right, no skipping, stops at the first range." (This is a much clearer statement of the implications of the Where Clause Analysis buried in the Query Optimizer Overview section of the official docs.)

This rule means that the query planner will:

  1. Consider columns from left to right. In my case, the first column in the index is published. SQLite will search for rows where the published field is in the correct range before considering the other columns.
  2. No skipping means that SQLite cannot use only the 1st and 3rd column in an index. As soon as it reaches a column in the index that does not appear in the query, it must do a scan through all of the rows matching the 1st column.
  3. Stops at the first range. That was the key I was missing. Filtering by the published timestamp first would indeed narrow down the results more than filtering first by one of the other columns. However, the fact that the query uses a range condition on the published column (WHERE published BETWEEN $1 AND $2) means that SQLite can only scan all of the rows in that published range, rather than fully utilizing the other columns in the index to hone in on the correct rows.

My query includes two ranges (published BETWEEN $1 AND $2 AND low_quality_probability <= $3), so the "stops at the first range" rule explains why I was only seeing the query planner use one of those columns. This rule does, however, suggest that I can create an index that will allow SQLite to filter by the one non-range condition (lang IN (SELECT lang FROM user_languages WHERE user_id = $4)) before using one of the ranges:

CREATE INDEX idx_lang_published_quality
ON items(lang, published, low_quality_probability);

The query plan shows that it can use both the lang and published columns (note the part that reads lang=? AND published>? AND published<?):

QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_items_lang_published_quality (lang=? AND published>? AND published<?)
`--LIST SUBQUERY 1
   `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)

Now we're using two out of the three columns to quickly filter the rows. Can we use all three? (Remember, the query planner won't be able to use multiple range queries on the same index, so we'll need something else.)

WHERE conditions for partial indexes must exactly match

SQLite has a nice feature called Partial Indexes that allows you to define an index that only applies to a subset of the rows matching some conditions.

In Scour's case, we only really want items where the low_quality_probability is less than or equal to 90%. The model I'm using to judge quality isn't that great, so I only trust it to filter out items if it's really sure they're low quality.

This means I can create an index like this:

CREATE INDEX idx_lang_published_quality_filtered
ON items(lang, published, low_quality_probability)
WHERE low_quality_probability <= .9;

And then update the query to use the same WHERE condition:

EXPLAIN QUERY PLAN
SELECT id, low_quality_probability
FROM items
WHERE low_quality_probability <= 0.9
AND published BETWEEN $1 AND $2
AND low_quality_probability <= $3
AND lang IN (SELECT lang FROM user_languages WHERE id = $4)

And it should use our new partial index... right? Wrong. This query is still using the previous index.

QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_items_lang_published_quality (lang=? AND published>? AND published<?)
`--LIST SUBQUERY 1
   `--SEARCH user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1 (user_id=?)

There's a subtle mistake in the relationship between our index and our query. Can you spot it?

Our index contains the condition WHERE low_quality_probability <= .9 but our query says WHERE low_quality_probability <= 0.9. These are mathematically equivalent but they are not exactly the same.

SQLite's query planner requires the conditions to match exactly in order for it to use a partial index. Relatedly, a condition of <= 0.95 or even <= 0.5 + 0.4 in the query would also not utilize the partial index.

If we rewrite our query to use the exact same condition of <= .9, we get the query plan:

QUERY PLAN
|--SEARCH items USING COVERING INDEX idx_lang_published_quality_filtered (ANY(lang) AND published>? AND published<?)
`--CORRELATED LIST SUBQUERY 1
   `--SCAN user_languages USING COVERING INDEX sqlite_autoindex_user_languages_1

Now, we're starting with the items whose low_quality_probability <= .9, then using the index to find the items in the desired language(s), and lastly narrowing down the results to the items that were published in the correct time range.

Better query plans find matching rows faster

As mentioned in the intro, these changes to the indexes and one of Scour's main ranking queries yielded a ~35% speedup.

Enabling the query planner to make better use of the indexes makes it so that SQLite doesn't need to scan as many rows to find the ones that match the query conditions.

Concretely, in Scour's case, filtering by language removes about 30% of items for most users and filtering out low quality content removes a further 50%. Together, these changes reduce the number of rows scanned by around 66%.

Sadly, however, a 66% reduction in the number of rows scanned does not directly translate to a 66% speedup in the query. If we're doing more than counting rows, the work to load the data out of the database and process it can be more resource intensive than scanning rows to match conditions. (The optimized queries and indexes still load the same number of rows as before, they just identifying the desired rows faster.) Nevertheless, a 35% speedup is a noticeable improvement.

Conclusion

It's worth digging into how your database's query planner uses indexes to help get to the bottom of performance issues.

If you're working with SQLite, remember that:

  1. A smaller number of composite indexes are more useful that multiple single-column indexes. It's better to have an index over (lang, published, low_quality_probability) than separate indexes for each.
  2. The query planner uses the rule "Left to right, no skipping, stops at the first range". If a query has multiple range conditions, it may be worth putting the columns that use strict equality first in the index, like we did above with lang coming before published or low_quality_probability.
  3. Conditions used in WHERE clauses for partial indexes must exactly match the condition used in the corresponding query. <= 0.9 is not exactly the same as <= .9, even if they are mathematically equivalent.

Thanks to Aaron Francis for putting together the High Performance SQLite course! (I have no personal or financial relationship to him, but I appreciate his course unblocking me and helping me speed up Scour's ranking.) Thank you also to Adam Gluck and Alex Kesling for feedback on this post.

#scour #understanding