Unnecessary Optimization in Rust: Hamming Distances, SIMD, and Auto-Vectorization

Dec 22, 2024

If you're developing an application and find yourself running a benchmark whose results are measured in nanoseconds... you should probably stop and get back to more important tasks. But here we are.

I'm using binary vector embeddings to build Scour, a service that scours noisy feeds for content related to your interests. Scour uses the Hamming Distance to calculate the similarity between users' interests and each piece of content. (As a refresher, the Hamming Distance between two bit vectors is simply the number of bits that are set differently between the two.) I got nerd sniped into wondering which Hamming Distance implementation in Rust is fastest, learned more about SIMD and auto-vectorization, and ended up publishing a new (and extremely simple) implementation: hamming-bitwise-fast.

Benchmark Setup

The contestants were:

A naive implementation using a for loop
A naive implementation using iterators
hamming
hamming_rs
simsimd
hamming-bitwise-fast (the implementation amenable to auto-vectorization that I published)

(Note that we are not comparing the distances, stringzilla, or triple_accel crates because those calculate the Hamming distance between strings rather than bit-vectors.)

The benchmark uses criterion and tests distance calculations for vectors of 512, 768, 1024, and 2048 bits. You can find the code here.

The test was run on 3 machines:

2023 MacBook Pro M2 Max (results)
Linode dedicated CPU machine with 2 CPUs and 4 GB of RAM (results)
Fly.io performance node with 2 CPUs and 4 GB of RAM (results)

Benchmark Results

TL;DR

All of these are so fast that you probably won't notice the performance difference in most use cases. But if you've gotten to this point...

hamming-bitwise-fast is fastest across most of these benchmarks, with simsimd extremely close behind. The former is 23 SLOC of Rust, while the latter links a C library with platform-specific instructions.

Also, on my MacBook, the naive implementations scored extremely well, taking second place. Question for the audience: do you know why that might be? (You can compare the assembly in the compiler explorer.)

(If you don't care about the benchmark details, skip to the conclusion for discussion of these results.)

2023 MacBook Pro M2 Max

Here are the results running on my 2023 MacBook Pro M2 Max laptop. All times are listed in nanoseconds.

Implementation	512 bits	768 bits	1024 bits	2048 bits
`hamming-bitwise-fast`	🥇 4.6	🥇 6.5	🥇 4.6	🥇 7.5
naive for loop	🥉 4.9	🥉 6.6	🥉 7.0	11.9
naive iterator	🥈 4.8	🥉 6.6	🥈 7.0	🥉 11.7
`hamming`	8.1	10.0	12.1	19.3
`simsimd`	5.6	🥇 6.5	7.5	🥈 11.3
`hamming_rs`*	n/a	n/a	n/a	n/a

* hamming_rs only compiles on x86 and x86_64 machines

line-chart-macbook violin-chart-macbook

Linode

Here are the results from a Linode Dedicated CPU machine with 2 CPU cores and 4 GB of RAM:

Implementation	512 bits	768 bits	1024 bits	2048 bits
`hamming-bitwise-fast`	🥇 9.3	🥇 12.3	🥇 16.0	🥈 28.8
naive for loop	40.9	60.3	80.0	159.0
naive iterator	40.9	60.7	84.7	162.6
`hamming`	65.8	96.8	130.5	🥉 46.7
`simsimd`	🥈 12.9	🥈 15.0	🥈 17.8	🥇 27.8
`hamming_rs`	🥉 24.5	🥉 33.1	🥉 40.0	70.7

line-chart-linode violin-chart-linode

Fly.io

Finally, here are the results from a Fly.io Performance Node with 2 CPUs and 4 GB of RAM:

Implementation	512 bits	768 bits	1024 bits	2048 bits
`hamming-bitwise-fast`	🥇 11.6	🥇 14.1	🥇 18.5	🥈 33.3
naive for loop	51.0	73.4	97.3	202.7
naive iterator	51.5	73.9	115.4	201.3
`hamming`	78.2	123.1	154.5	🥉 54.6
`simsimd`	🥈 13.2	🥈 16.8	🥈 20.7	🥇 32.7
`hamming_rs`	🥉 28.5	🥉 35.5	🥉 42.7	71.1

line-chart-fly violin-chart-fly

Note that Fly does not guarantee support for SIMD and there is an unanswered question on the community forum about it. It would definitely be nice to be able to reason about what features the machines will support, though that isn't as necessary when we rely on auto-vectorization.

Conclusion: Auto-Vectorization FTW

Auto-vectorization refers to the compiler's ability to analyze your code and figure out when it can be optimized using Single Instruction, Multiple Data (SIMD) without you explicitly writing SIMD instructions.

matklad said it better than I could in Can You Trust a Compiler to Optimize Your Code?:

Writing SIMD by hand is really hard — you’ll need to re-do the work for every different CPU architecture. Also, you probably think that, for scalar code, a compiler writes better assembly than you. What makes you think that you’d beat the compiler at SIMD, where there are more funky instructions and constraints? Compilers are tools. They can reliably vectorize code if it is written in an amenable-to-vectorization form.

The approach we ended up with follows his conclusion for how to write code that is amenable to auto-vectorization:

We do not blindly rely on the compiler’s optimization. Rather, we are aware about specific optimizations we need in this case, and write the code in a way that triggers them.

Specifically, for SIMD:

we express the algorithm in terms of processing chunks of elements,

within each chunk, we make sure that there’s no branching and all elements are processed in the same way.

This is exactly what hamming-bitwise-fast does (this is the full implementation):

#[inline]
pub fn hamming_bitwise_fast(x: &[u8], y: &[u8]) -> u32 {
    assert_eq!(x.len(), y.len());

    // Process 8 bytes at a time using u64
    let mut distance = x
        .chunks_exact(8)
        .zip(y.chunks_exact(8))
        .map(|(x_chunk, y_chunk)| {
            // This is safe because we know the chunks are exactly 8 bytes.
            // Also, we don't care whether the platform uses little-endian or big-endian
            // byte order. Since we're only XORing values, we just care that the
            // endianness is the same for both.
            let x_val = u64::from_ne_bytes(x_chunk.try_into().unwrap());
            let y_val = u64::from_ne_bytes(y_chunk.try_into().unwrap());
            (x_val ^ y_val).count_ones()
        })
        .sum::<u32>();

    if x.len() % 8 != 0 {
        distance += x
            .chunks_exact(8)
            .remainder()
            .iter()
            .zip(y.chunks_exact(8).remainder())
            .map(|(x_byte, y_byte)| (x_byte ^ y_byte).count_ones())
            .sum::<u32>();
    }

    distance
}

(I also tried using 16-byte chunks and u128s and it was 30% faster specifically on the 512- and 768-bit vectors on my MacBook but slower on all of the other benchmarks. Question for the audience: any idea why that might be?)

Now, back to more important tasks!

Thanks to Alex Kesling for the original nerd snipe, feedback on drafts of this post, and the suggestion to try out auto-vectorization.

Discuss on Lobsters, Reddit, Hacker News, or Bluesky.

#embeddings #rust #scour