Unnecessary Optimization in Rust: Hamming Distances, SIMD, and Auto-Vectorization
If you're developing an application and find yourself running a benchmark whose results are measured in nanoseconds... you should probably stop and get back to more important tasks. But here we are.
I'm using binary vector embeddings to build Scour, a service that scours noisy feeds for content related to your interests. Scour uses the Hamming Distance to calculate the similarity between users' interests and each piece of content. (As a refresher, the Hamming Distance between two bit vectors is simply the number of bits that are set differently between the two.) I got nerd sniped into wondering which Hamming Distance implementation in Rust is fastest, learned more about SIMD and auto-vectorization, and ended up publishing a new (and extremely simple) implementation: hamming-bitwise-fast
.
Benchmark Setup
The contestants were:
- A naive implementation using a for loop
- A naive implementation using iterators
hamming
hamming_rs
simsimd
hamming-bitwise-fast
(the implementation amenable to auto-vectorization that I published)
(Note that we are not comparing the distances
, stringzilla
, or triple_accel
crates because those calculate the Hamming distance between strings rather than bit-vectors.)
The benchmark uses criterion
and tests distance calculations for vectors of 512, 768, 1024, and 2048 bits. You can find the code here.
The test was run on 3 machines:
- 2023 MacBook Pro M2 Max (results)
- Linode dedicated CPU machine with 2 CPUs and 4 GB of RAM (results)
- Fly.io performance node with 2 CPUs and 4 GB of RAM (results)
Benchmark Results
TL;DR
All of these are so fast that you probably won't notice the performance difference in most use cases. But if you've gotten to this point...
hamming-bitwise-fast
is fastest across most of these benchmarks, with simsimd
extremely close behind. The former is 23 SLOC of Rust, while the latter links a C library with platform-specific instructions.
Also, on my MacBook, the naive implementations scored extremely well, taking second place. Question for the audience: do you know why that might be? (You can compare the assembly in the compiler explorer.)
(If you don't care about the benchmark details, skip to the conclusion for discussion of these results.)
2023 MacBook Pro M2 Max
Here are the results running on my 2023 MacBook Pro M2 Max laptop. All times are listed in nanoseconds.
Implementation | 512 bits | 768 bits | 1024 bits | 2048 bits |
---|---|---|---|---|
hamming-bitwise-fast |
π₯ 4.6 | π₯ 6.5 | π₯ 4.6 | π₯ 7.5 |
naive for loop | π₯ 4.9 | π₯ 6.6 | π₯ 7.0 | 11.9 |
naive iterator | π₯ 4.8 | π₯ 6.6 | π₯ 7.0 | π₯ 11.7 |
hamming |
8.1 | 10.0 | 12.1 | 19.3 |
simsimd |
5.6 | π₯ 6.5 | 7.5 | π₯ 11.3 |
hamming_rs * |
n/a | n/a | n/a | n/a |
* hamming_rs
only compiles on x86
and x86_64
machines
Linode
Here are the results from a Linode Dedicated CPU machine with 2 CPU cores and 4 GB of RAM:
Implementation | 512 bits | 768 bits | 1024 bits | 2048 bits |
---|---|---|---|---|
hamming-bitwise-fast |
π₯ 9.3 | π₯ 12.3 | π₯ 16.0 | π₯ 28.8 |
naive for loop | 40.9 | 60.3 | 80.0 | 159.0 |
naive iterator | 40.9 | 60.7 | 84.7 | 162.6 |
hamming |
65.8 | 96.8 | 130.5 | π₯ 46.7 |
simsimd |
π₯ 12.9 | π₯ 15.0 | π₯ 17.8 | π₯ 27.8 |
hamming_rs |
π₯ 24.5 | π₯ 33.1 | π₯ 40.0 | 70.7 |
Fly.io
Finally, here are the results from a Fly.io Performance Node with 2 CPUs and 4 GB of RAM:
Implementation | 512 bits | 768 bits | 1024 bits | 2048 bits |
---|---|---|---|---|
hamming-bitwise-fast |
π₯ 11.6 | π₯ 14.1 | π₯ 18.5 | π₯ 33.3 |
naive for loop | 51.0 | 73.4 | 97.3 | 202.7 |
naive iterator | 51.5 | 73.9 | 115.4 | 201.3 |
hamming |
78.2 | 123.1 | 154.5 | π₯ 54.6 |
simsimd |
π₯ 13.2 | π₯ 16.8 | π₯ 20.7 | π₯ 32.7 |
hamming_rs |
π₯ 28.5 | π₯ 35.5 | π₯ 42.7 | 71.1 |
Note that Fly does not guarantee support for SIMD and there is an unanswered question on the community forum about it. It would definitely be nice to be able to reason about what features the machines will support, though that isn't as necessary when we rely on auto-vectorization.
Conclusion: Auto-Vectorization FTW
Auto-vectorization refers to the compiler's ability to analyze your code and figure out when it can be optimized using Single Instruction, Multiple Data (SIMD) without you explicitly writing SIMD instructions.
matklad said it better than I could in Can You Trust a Compiler to Optimize Your Code?:
Writing SIMD by hand is really hard β youβll need to re-do the work for every different CPU architecture. Also, you probably think that, for scalar code, a compiler writes better assembly than you. What makes you think that youβd beat the compiler at SIMD, where there are more funky instructions and constraints? Compilers are tools. They can reliably vectorize code if it is written in an amenable-to-vectorization form.
The approach we ended up with follows his conclusion for how to write code that is amenable to auto-vectorization:
We do not blindly rely on the compilerβs optimization. Rather, we are aware about specific optimizations we need in this case, and write the code in a way that triggers them.
Specifically, for SIMD:
- we express the algorithm in terms of processing chunks of elements,
- within each chunk, we make sure that thereβs no branching and all elements are processed in the same way.
This is exactly what hamming-bitwise-fast
does (this is the full implementation):
#[inline]
pub fn hamming_bitwise_fast(x: &[u8], y: &[u8]) -> u32 {
assert_eq!(x.len(), y.len());
// Process 8 bytes at a time using u64
let mut distance = x
.chunks_exact(8)
.zip(y.chunks_exact(8))
.map(|(x_chunk, y_chunk)| {
// This is safe because we know the chunks are exactly 8 bytes.
// Also, we don't care whether the platform uses little-endian or big-endian
// byte order. Since we're only XORing values, we just care that the
// endianness is the same for both.
let x_val = u64::from_ne_bytes(x_chunk.try_into().unwrap());
let y_val = u64::from_ne_bytes(y_chunk.try_into().unwrap());
(x_val ^ y_val).count_ones()
})
.sum::<u32>();
if x.len() % 8 != 0 {
distance += x
.chunks_exact(8)
.remainder()
.iter()
.zip(y.chunks_exact(8).remainder())
.map(|(x_byte, y_byte)| (x_byte ^ y_byte).count_ones())
.sum::<u32>();
}
distance
}
(I also tried using 16-byte chunks and u128s and it was 30% faster specifically on the 512- and 768-bit vectors on my MacBook but slower on all of the other benchmarks. Question for the audience: any idea why that might be?)
Now, back to more important tasks!
Thanks to Alex Kesling for the original nerd snipe, feedback on drafts of this post, and the suggestion to try out auto-vectorization.
Discuss on Lobsters, Reddit, Hacker News, or Bluesky.