Understanding Memory Ordering in Rust

Sep 17, 2024

I am reading Mara Bos' Rust Atomics and Locks. On the first pass, I didn't really grok memory ordering (what Mara describes as "the most complicated topic of the book"). So here's my attempt at understanding by explaining.

Why do we care about memory ordering

Atomic operations and memory ordering come into play when your program has multiple threads working on the same data, such as a shared counter or context object.

In single-threaded code, you can read a function's statements from top to bottom and understand what it's going to do. Earlier statements will happen before later statements.

In multi-threaded code, this is not so simple. Another thread could be reading or writing the same variables as a function you're looking at in ways that might break what your program is supposed to do.

Multi-threading is extra tricky to reason about because of two additional factors:

Compiler optimizations - at compile time, the compiler can re-order and combine instructions to optimize your code. We generally appreciate these performance improvements, but, if we're not careful, reordering certain instructions could break our multi-threaded code.
Processor optimizations - at runtime, the processor might also reorder instructions. For example, because a variable from a later instruction is already available in the processor cache while another needs to be loaded from memory. Similarly to the compiler optimizations, these reorderings are fine for single-threaded code but might break multi-threaded code.

Compiler and processor optimizations are great, but they don't take multiple threads into account.

Memory ordering is how we tell the compiler and processors about which operations can be safely reordered and which cannot. (Memory ordering is also used to implement higher-level multi-threading building blocks like channels, Mutexes, and RwLocks.)

Three categories of memory ordering

While the std::sync::atomic::Ordering enum has 5 variants, there are really 3 categories of memory orderings:

Relaxed
Release / Acquire (Acquire, Release, AcqRel)
Sequential Consistency (SeqCst)

Let's dive into each of these categories, starting with the "strongest".

You probably don't want Sequential Consistency

This point comes across very clearly:

While it might seem like the easiest memory ordering to reason about, SeqCst ordering is almost never necessary in practice...

Myth: Sequentially consistent memory ordering is a great default and is always correct.

Putting aside performance concerns, sequentially consistent memory ordering is often seen as the perfect type of memory ordering to default to, because of its strong guarantees. It’s true that if any other memory ordering is correct, SeqCst is also correct. This might make it sound like SeqCst is always correct. However, it’s entirely possible that a concurrent algorithm is simply incorrect, regardless of memory ordering. ... It is advisable to see SeqCst as a warning sign. Seeing it in the wild often means that either something complicated is going on, or simply that the author did not take the time to analyze their memory ordering related assumptions, both of which are reasons for extra scrutiny.

Ruh roh. I'm definitely guilty of having written code with SeqCst in the past, just because it seemed like the safest default.

So, what else do we have to work with?

Relax a little

Relaxed memory ordering is useful when you want to perform atomic operations on a single variable but you don't care about synchronizing different operations across threads. For example, if you are incrementing a single counter, you might not care about the order that the increment operations happen in as long as each is individually atomic.

Mara gives the example of this code, assuming that the functions a and b are run on separate threads:

static X: AtomicI32 = AtomicI32::new(0);

fn a() {
    X.fetch_add(5, Relaxed);
    X.fetch_add(10, Relaxed);
}

fn b() {
    let a = X.load(Relaxed);
    let b = X.load(Relaxed);
    let c = X.load(Relaxed);
    let d = X.load(Relaxed);
    println!("{a} {b} {c} {d}");
}

In this program:

We know that 5 will be added to X before 10 is.
We don't know what values will be loaded by the different load operations in b.
We do know that the load operations will observe the additions in the correct order (0 → 5 → 15).
Different runs of the same program might interleave the fetch_add and load operations in a different way.

Here is an illustration of 4 different ways that the operations could be ordered:

mermaid-diagram-2024-09-17-133730

To repeat, Relaxed is useful when:

You have a single variable you want to atomically update.
You don't care about the order in which multiple threads' updates happen.
It's okay if other threads don't immediately observe the effects of an update.
You have multiple variables and you don't care about synchronizing the updates between them.

Into the depths: `Release` and `Acquire` ordering

Now that we've seen that we probably don't want SeqCst and that Relaxed is for atomic operations without synchronization, let's turn our attention to Release and Acquire.

One thing to clarify up front: you should not think about Release and Acquire as two different "modes". Release is only relevant for write/store operations. Acquire is only relevant for read/load operations.

Release marks a moment in the program's overall timeline, declaring that all writes that happened with or before this moment will be visible after an Acquire on this same variable.

In simpler terms, let's say you have two threads doing some work and reading and writing data:

Producer Thread	Consumer Thread
🛠️ work 🛠️ work ✏️ write data ✏️ write more data 🚩 `Release`
	🚩 `Acquire` 📖 read data 📖 read more data 🛠️ work

Release and Acquire synchronize the threads such that all writes that happened with or before this moment will be visible after an Acquire on this same variable.

The use cases for this type of ordering include:

Implementing locks like Mutexes and RwLocks
Implementing producer / consumer patterns like channels
Lazy initialization of data
Signaling the completion of tasks between threads

Signaling example

Let's look at Mara's example of using an AtomicBool to signal when some data is ready:

use std::sync::atomic::{AtomicBool, AtomicU64};
use std::sync::atomic::Ordering::{Acquire, Relaxed, Release};

static DATA: AtomicU64 = AtomicU64::new(0);
static READY: AtomicBool = AtomicBool::new(false);

fn main() {
    std::thread::spawn(|| {
        DATA.store(123, Relaxed);
        READY.store(true, Release); // Everything from before this store ..
    });
    while !READY.load(Acquire) { // .. is visible after this loads `true`.
        std::thread::sleep(std::time::Duration::from_millis(100));
        println!("waiting...");
    }
    println!("{}", DATA.load(Relaxed));
}

The main thread is waiting for the DATA to be ready and sleeping for 100 milliseconds between each check.

Crucially, the line READY.store(true, Release); ensures that all writes that happened with or before this moment will be visible after an Acquire on this same variable. Note how DATA is written before the Releasemoment and it's only using the Relaxed ordering.

When the main thread finally observes READY.load(Acquire) to be true, we break out of the while loop and finally read the value. Even though DATA.load(Relaxed) uses Relaxed, it is guaranteed to see the value. The write happened before the Release moment on the READY variable and this loadis happening after the corresponding Acquire on the READY.

This is the only possible sequence for this code:

mermaid-diagram-2024-09-17-133900

Locking example

Mara's next example for Release / Acquire ordering is a very rudimentary lock that uses an AtomicBool to protect access to a String. Here's a slightly modified version of that example:

use std::sync::atomic::Ordering::{Acquire, Relaxed, Release};
use std::sync::atomic::AtomicBool;

static mut DATA: String = String::new();
static LOCKED: AtomicBool = AtomicBool::new(false);

fn f() {
    if LOCKED.compare_exchange(false, true, Acquire, Relaxed).is_ok() {
        // Safety: We hold the exclusive lock, so nothing else is accessing DATA.
        unsafe { DATA.push('!') };
        LOCKED.store(false, Release);
    }
}

fn main() {
    std::thread::scope(|s| {
        for _ in 0..100 {
            s.spawn(f);
        }
    });
    println!("{}", unsafe { &DATA });
}

The two critical lines are:

if LOCKED.compare_exchange(false, true, Acquire, Relaxed).is_ok() {
LOCKED.store(false, Release);

The first line atomically reads LOCKED using the Acquire ordering and if the value is false, it sets the value to true using the Relaxed ordering.

The second line sets LOCKED back to false using Release. As we've seen before, Release means that all writes that happened with or before this moment will be visible after an Acquire on this same variable.

Combining these two lines guarantees that the unsafe write to DATA and the Release write to READY will both be finished before another thread observes LOCKED as false using the Acquire ordering.

DIY synchronization primitives

The next couple chapters in Mara's book give excellent walkthroughs of building a spin lock, a channel, and an atomically reference-counted pointer (Arc) using the memory ordering principles we've discussed here. I won't go through those here, but I would recommend reading through them on your own.

If you're interested in diving in further, you can also read about atomic fences. These effectively decompose the store/Release and load/Acquire operations we've been discussing into separate fence and store or load operations, which can be useful for applying a fence to multiple variables at once.

Conclusion

Memory ordering can be tricky to grok, but here are a couple takeaways to remember:

SeqCst isn't as good a default as you think, and should be avoided in most cases.
Relaxed is good for when you want to apply atomic operations on individual variables but don't care about synchronizing reads and writes on multiple variables. For example, incrementing simple counters or collecting statistics.
Release / Acquire is for when you want to synchronize reads and writes to multiple variables across multiple threads. Storing a value to a given variable using Release guarantees that all writes that happened with or before this moment will be visible after the same variable is loaded using an Acquire. Release / Acquire are useful for building all manner of synchronization primitives, including locks, channels, and signals. Think of Release as releasing some changes (or a lock) to other threads, while Acquire acquires those changes (or the lock).

Thanks to Mara Bos for writing a clear and approachable explanation of Rust Atomics and Locks and to Simon Rassmussen for feedback on a draft of this post.

Discuss on r/rust, Lobsters, or Hacker News.

#rust