2023/09/01 - CAS, ABA and LL/SC

If this zoo of abbreviations does not mean anything to you then "fly you fools" as long as you still can :^)

Jokes aside. I recently had the opportunity to ramp up a new co-worker on the different semantics of atomic instructions (compare and swap CAS vs load-linked store-conditional LL/SC), lock-free programming and the ABA problem.

In the realm of lock-free programming and the ABA problem, I shared a story from some years ago when I was debugging a random memory corruption in our multi-threaded simulator. Sharing that story actually gave the idea to write this post as reference for potential new co-workers or all the internet people <3.

After many hours of debugging the random memory corruption, which obviously did not occur very often, turned out to be a bug in a custom memory allocator. To be more precise, in a primitive used by the allocator to manage a list of free memory blocks. The allocator was used from multiple threads concurrently and hence pushing and popping free memory blocks had to be thread-safe.

The operations to manage the free block list were implemented with an lock-free algorithm, which we will visit later when analyzing the bug. Before, let us revisit the fundamental semantics of atomic compare-and-swap (CAS) operations, such that everyone has the same understanding. If you are already familiar with CAS, feel free to skip ahead to the exploration of the lock-free memory block primitive.

compare-and-swap (CAS)

The following pseudo code gives a good mental model for an atomic CAS operation. The described compare-and-swap is executed atomically without giving any other execution stream the chance to interfere in the middle of this operation.

// bool cas(T* val, T* expected, T desired);
//
// Does the following operation atomically.
//
bool is_expected = (*val == *expected);
if (is_expected) {
    *val = desired;
} else {
    *expected = *val;
}
return is_expected;

Below are some concrete examples using the cpp std::atomic type.

// bool std::atomic<T>::compare_exchange_weak(T& expected, T desired);

std::atomic<int> val(42);
int expected = 43;
// val != expected -> expected = val
assert(val.compare_exchange_weak(expected, 1337) == false);
// val == expected -> val = desired
assert(val.compare_exchange_weak(expected, 1337) == true);
assert(val.load() == 1337);

When talking about atomics we also need to talk about memory ordering.

Memory ordering concerns the ordering of memory accesses, both atomic and non-atomic, in relation to atomic operations. It also determines how side-effects, such as unrelated writes for example, are observed across threads. Essentially, memory ordering sets the rules for optimizers to reorder instructions and dictates compilers to insert memory barrier instructions, ensuring the memory ordering guarantees, in cases where the underlying hardware's memory model is relatively weak and subject to instruction reordering.

For the purpose of our discussion, we'll focus on the concept of sequential consistent ordering, as it is the most intuitive and also the default in C++ when utilizing std::atomic. When dealing with atomic loads and stores, it provides the following guarantees:

atomic store: On the current thread no read / write preceding the store can be reordered after the store. Other threads will observe all writes before they observe the atomic store.
atomic load: On the current thread no read / write following the load can be reordered before the load. The current thread observes all writes done on other threads, which happen before the store to the atomic.

In the example below that means, on Thread 1 the M = X can not be reordered after the A.store(1). When Thread 2 sees the atomic variable change, it is guaranteed that is observes the write to M on Thread 1 and hence print(M) would give X.

Thread 1        Thread 2

M = X           if (A.load() == 1)
A.store(1)        print(M)

An atomic CAS operations performs and atomic load and store, which means that no read or write can be reordered across the CAS operation in any direction.

With that, we just scratched the surface of the rabbit hole, but sufficient for the remaining discussion. The 2017 cppcon talk from Fedor Pikus "C++ atomics, from basic to advanced. What do they really do?" goes into more depth and is highly recommended. I also added some more references at the end of the post.

`lock-free` free memory block primitive

The code below gives the essence of the primitive to manage the free memory blocks with the goal of being thread-safe. Essentially it is a linked list of BlockData::Block objects where the head pointer to the beginning of the list is swapped atomically.

Details such as how the list initially is filled up or how it grows dynamically at runtime are left out here. The same goes for any sort of error handling as none of this is relevant for the discussion.

struct BlockStack {
  struct Block {
    Block* next{nullptr};
    // ..
  };

  void push(Block* blk) {
    do {
      blk->next = m_head;
    } while (!m_head.compare_exchange_weak(blk->next, blk));
  }

  Block* pop() {
    Block* blk;
    do {
      blk = m_head;
      // eg, if (blk == nullptr) expand with new free Blocks.
    } while (!m_head.compare_exchange_weak(blk, blk->next));
    return blk;
  }

private:
  std::atomic<Block*> m_head{nullptr};
};

The std::atomic<T>::compare_exchange_weak overload used has an additional argument for the memory ordering with the default value of sequential consistent. This ensures in the push() method, that when another thread observes the new head pointer the write to blk->next before the CAS is also visible.

Taking a first look at the push() and pop() methods, the implementation looks quite sound.

push() inserts the blk object at the beginning of the list. That means the newly inserted block needs to point to the current list head and the list head pointer must be updated to point to the newly inserted block. The head pointer is swapped atomically, which only succeeds if no one else has updated the head pointer in the meantime and it still points to blk->next. In case the CAS was not successful, the insert operation is retried in a loop until the block is inserted.

pop() on the other hand removes the block at the beginning of the list and sets the second block as the new beginning of the list. This is done by updating the head pointer to point to the second block entry in the list. Similar to push, the current head is read once and the head pointer swapped atomically. This operation is again repeated in a loop until the CAS is successful.

However the pop() method has a subtle difference compared to the push() method as it dereferences the current head block, which makes it vulnerable to the ABA problem. We will see in a moment what the ABA problem is and why it is called like that, but first let us slightly re-write the pop method to make it easier to spot the bug.

1 Block* pop() {
2   Block* blk;
3   Block* next;
4   do {
5     blk  = m_head;
6     next = blk->next;  // Deref "our" head block.
7   } while (!m_head.compare_exchange_weak(blk, next));
8   return blk;
9 }

Now let us assume we have two threads Thread 1 and Thread 2. Both threads access the shared BlockData instance which currently contains three free blocks (A), (B) and (C).

Thread 1 calls pop() and gets interrupted between line 6 and 7.

blk points to (A)
next points to (B)

In the meantime Thread 2 performs the following actions

pop() -> (A)
pop() -> (B)
push((A))

After this, Thread 2 owns block (B) and the BlockData instance contains the free blocks (A) and (C).

Next, Thread 1 is resumed and continues with the atomic swap of m_head. The swap will succeed as m_head (A) == blk (A) is true, however this operation will re-insert block (B) as free block since Thread 1 read next = (B) before being interrupted.

This leaves us with the following state after Thread 1 returns from the pop operation.

We can easily see that this state is catastrophic, as the next pop operation will return block (B) which is currently used and owned by Thread 2. Additionally, block (C) and the rest of the list is leaked.

The example above describes the ABA problem. Which is, the "inner" state of the block list is altered, by removing (B), but the observable "outer" state is unchanged, by removing and re-inserting (A) again, hence ABA.

The following program cas-aba-issue.cc demonstrates the above described issue and can be used for further experiments. It can be compiled as follows.

clang++ -o cas-aba-issue cas-aba-issue.cc -latomic

So, how can the current implementation be fixed?

The obvious approach would be to rewrite the push() and pop() methods to use a lock.

However, the point of this post is to look at the lock-free approaches and learn about the different atomic instructions. Therefore, we will discuss two approaches which require specific instructions, and therefore are only applicable if the underlying hardware supports that instructions.

Generational pointers using double-word CAS.
Exclusive accesses using load-linked store-conditional.

Generational pointers

In this approach the raw head pointer is replaced with an head object that holds the raw pointer as well as a counter (the generation). The idea being, when ever the head object is swapped the generation is increased. In practice, this makes it resilient against the ambiguous raw pointer in the ABA problem, as in the example discussed above, the head objects would not compare equal due to a different generation.

Theoretically, a case can be constructed where the ABA problem occurs, if the counter overflows, but that is not really relevant in practice.

For the following discussion we assume that we are in a x86_64 environment.

The GenerationPtr will be our new head object, holding the raw pointer as well as the generation counter.

struct GenerationPtr {
  Block*   ptr{nullptr};
  uint64_t gen{0};
};

To be able to atomically compare and swap such an object, the host must support a double-word CAS, and in this specific case a 16 byte CAS. Most x86_64 hosts support the cmpxchg16b instruction which allows to atomically compare and swap 16 byte when combined with the lock prefix.

We create ourselves a helper function for the 16 byte CAS based on the legacy __sync_bool_compare_and_swap builtin, as both gcc and clang directly emitted cmpxchg16b instructions instead of library calls with the -mcx16 target option enabled. It is irrelevant for this discussion and makes looking at the disassembly easier.

// bool __sync_bool_compare_and_swap (type *ptr, type oldval, type newval);

template <typename T>
bool cmpxchg16b(T* dest, T oldval, T newval) {
  static_assert(sizeof(T)  == 16, "Required for cmpxchg16b");
  static_assert(alignof(T) == 16, "Required for cmpxchg16b");

  const auto as_u128 = [](T* ptr) {
    return reinterpret_cast<unsigned __int128*>(ptr);
  };
  return __sync_bool_compare_and_swap(as_u128(dest), *as_u128(&oldval),
                                      *as_u128(&newval));
}

The codegen with clang 15.0.7 for this function gives the following result on my machine.

; generated by clang version 15.0.7 with -mcx16

00000000000016e0 <bool cmpxchg16b<..>(BlockStack::GenerationPtr* dest,
                                      BlockStack::GenerationPtr oldval,
                                      BlockStack::GenerationPtr newval)>:
  ; Per SystemV ABI
  ;   rdi = dest ptr
  ;   rsi = oldval.ptr
  ;   rdx = oldval.gen
  ;   rcx = newval.ptr
  ;   r8  = newval.gen
  16e0:       push   rbx
  16e1:       mov    rbx,rcx    ; shuffle args for cmpxchg16b
  16e4:       mov    rax,rsi    ; shuffle args for cmpxchg16b
  16e7:       mov    rcx,r8     ; shuffle args for cmpxchg16b
  ; CMPXCHG16B
  ;   compare rdx:rax (oldval) with [rdi] (dest*)
  ;   if equal then load rcx:rbx (newval) into [rdi] (dest*)
  16ea:       lock cmpxchg16b OWORD PTR [rdi]
  16ef:       sete   al
  16f2:       pop    rbx
  16f3:       ret

With that we can update the type of BlockStack::m_head to GenerationPtr and adapt the push() and pop() methods.

1 // BlockStack ..
2 
3 void push(Block* blk) {
4   GenerationPtr old_head, new_head;
5   do {
6     old_head     = m_head;
7     blk->next    = old_head.ptr;
8     new_head.ptr = blk;
9     new_head.gen = old_head.gen + 1;
10   } while (!cmpxchg16b(&m_head, old_head, new_head));
11 }
12 
13 Block* pop() {
14   GenerationPtr old_head, new_head;
15   do {
16     old_head     = m_head;
17     new_head.ptr = old_head.ptr->next;
18     new_head.gen = old_head.gen + 1;
19   } while (!cmpxchg16b(&m_head, old_head, new_head));
20   return old_head.ptr;
21 }

For the purpose of this discussion we assume that the __sync_bool_compare_and_swap builtin acts as full barrier. In practice the __sync builtins should not be used anymore, and instead the newer __atomic builtins which allow to specify the memory ordering.

A note about the loads of m_head in line 6 and line 16. Those loads are not atomic and the value of m_head could be updated from another thread in the midst when loading the values. However if we read an inconsistent m_head the CAS would fail and we would retry our operation.

The following program cas-ptr-generation.cc provides the full example. When being run it demonstrates how the generation pointer prevents the ABA problem. It can be compiled as follows.

clang++ -o cas-ptr-generation cas-ptr-generation.cc -latomic -mcx16

Load linked store conditional

So far we looked at atomic CAS instructions which have read-modify-write semantics and are typically found with CISC ISAs. RISC ISAs on the other hand usually provide load-linked (LL) store-conditional (SC) instruction pairs to implement atomic accesses. Compared to the compare-and-swap instruction, LL/SC are two distinct instructions. The LL instruction loads the current value and marks the memory region for exclusive access, and the SC instruction only succeeds and updates the memory location if the memory location is still marked as exclusive at the time of the store-conditional. If there was an update to the memory location between the LL/SC pair, the exclusive marker is removed.

The pseudo code below gives an example of how atomic operations are typically implemented with LL/SC instructions. The example shows an atomic increment of a location in memory.

/// Atomically increment *MEM and return OLD value.
uint64_t atomic_fetch_inc(uint64_t* mem) {
    uint64_t old;
    bool ok;

    do {
        old = ll(mem);           // Load current value and mark for ex access.
        ok  = sc(mem, old + 1);  // Try to store incremented value.
    } while (!ok);

    return old;
}

For the remaining discussion we will focus on the ARM architecture, and use the A64 instruction set as I have some ARM silicon at home to run the examples on.

We can take a look into the ARM ARM (Architecture Reference Manual) for the A-profile to find the description of the LL/SC instructions in the chapter B2.9 Synchronization and semaphores. Where Load-Exclusive refers to load-linked and Store-Exclusive refers to store-conditional.

The model for the use of a Load-Exclusive/Store-Exclusive instruction pair accessing a non-aborting memory address x is:

The Load-Exclusive instruction reads a value from memory address x.

The corresponding Store-Exclusive instruction succeeds in writing back to memory address x only if no other observer, process, or thread has performed a more recent store to address x. The Store-Exclusive instruction returns a status bit that indicates whether the Write Memory succeeded.

The A64 instruction set supports multiple LL/SC instruction pairs for different operand sizes and with different memory ordering semantics. We will make use of the LDXR and STXR instructions, which are exclusive load from register and store to register instructions without any memory ordering semantics. We will use explicit memory barriers rather than instruction pairs with release-acquire memory ordering semantics, as this post is not aimed to discuss the different memory orderings, but give an overview of different atomic instructions.

Table B2-3 Synchronization primitives and associated instruction, A64 instruction set in the ARM ARM has a full overview of all the available LL/SC instruction pairs in the A64 instruction set.

Following code gives a starting point to play with the LL/SC instructions on ARM, by providing inline assembly wrapper functions to emit the LDXR and STXR instructions.

/// Load exclusive register.
inline uint64_t ldxr(uint64_t* addr) {
  uint64_t ret;
  asm volatile("ldxr %0, %1" : "=r"(ret) : "Q"(*addr) : "memory");
  return ret;
}

/// Store exclusive register.
inline bool stxr(uint64_t* addr, uint64_t val) {
  uint32_t ret;
  asm volatile("stxr %w0, %2, %1" : "=&r"(ret), "=Q"(*addr) : "r"(val) : "memory");
  return ret == 0;
}

int main() {
  uint64_t mem = 42;

  auto T1 = std::thread([&mem]() {
    // Write to exclusive location (does clear exclusive monitor).
    mem = 2222;
    // Full memory barrier.
    __sync_synchronize();
  });

  uint64_t old = ldxr(&mem);

  // Some artificial delay w/o an explicit context switch (eg syscall) as that
  // would clear the exclusive monitor, though it can still be interupted by
  // the scheduler.
  // Delay is "tuned" for my ARM device.
  for (int i = 0; i < (1 << 14); ++i) {
    asm volatile("nop");
  }

  // Full memory barrier.
  __sync_synchronize();

  bool ok = stxr(&mem, 1111);
  printf("old: %lu -> mem: %lu | ok: %d\n", old, mem, ok);

  T1.join();
  return ok ? 0 : 1;
}

The full source code with additional comments is available here a64-basic-llsc.cc. It can be compiled and run natively on an ARM system as shown below. We run the program in a loop until we hit the case where T1 wins the race and does the memory store in between the LDXR/STXR pair on the main thread.

> g++ -o a64-basic-llsc a64-basic-llsc.cc -lpthread
> while ./a64-basic-llsc; do : ; done
...
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 1111 | ok: 1
old: 42 -> mem: 2222 | ok: 0

With those basics covered we can also implement the LL/SC version of the BlockStack. The push() and pop() methods now leverage the exclusive load and store instructions to atomically update the m_head pointer.

struct BlockStack {
  struct Block {
    Block* next{nullptr};
    // ..
  };

  void push(Block* blk) {
    do {
      blk->next = load_head_ex();
    } while (!store_head_ex(blk));
  }

  Block* pop() {
    Block *blk, *next;
    do {
      blk  = load_head_ex();
      next = blk->next;
    } while (!store_head_ex(next));
    return blk;
  }

private:
  // Assume 64bit ptr size for ldxr/stxr instructions.
  static_assert(sizeof(void*) == sizeof(uint64_t), "Ptr size miss-match!");

  Block* load_head_ex() const {
    Block* blk = (Block*)ldxr((uint64_t*)&m_head);
    mem_barrier();
    return blk;
  }

  bool store_head_ex(Block* blk) {
    const bool success = stxr((uint64_t*)&m_head, (uint64_t)blk);
    mem_barrier();
    return success;
  }

  Block* m_head{nullptr};
};

The full version of this implementation with additional comments as explanation is available here a64-llsc.cc. It can be compiled and run as follows on some ARM system. Depending on the timing, the example must be run multiple times to pass the assertions, as all the artificial delays hacked into the implementation are tuned according to my ARM device.

g++ -o a64-llsc a64-llsc.cc -lpthread
./a64-llsc

In fact, passing the assertions in the example above actually does not guarantee that we drive the two threads and the BlockData object exactly into the state as in the CAS example, which triggered the ABA problem. In the CAS example we could use a sleep() to control the exact order of events, but here we can neither use any sleep()s nor any atomics to synchronize the threads execution, as any of those would most certainly lead to clearing the exclusive monitor of the ldxr instruction to read the head pointer. All we can do is to add some nop delays, hoping the exclusive monitor is not cleared due to any of those artificial hacks we introduced (or some interruption by the scheduler), praise the order of events is exactly as we want them to be and feel happy when the assertions pass.

The sources of the example are good to explain how the stronger guarantees of the LL/SC instructions prevent the ABA problem. But in practice, it is really hard to prove that the program ran exactly as we wanted it to :^)

The END

Personally, I would say that in 99% of the cases, nobody will have to write code as we did in this post. Of course, there are always exceptions.

Always prefer to write portable and generic algorithms, and start with a lock rather than a lock-free algorithm. And most important, measure and profile your code and do not do premature optimizations and use "cool" stuff just for the sake of using it. Use the right tool for the job and seek for the simplest solution.

However, we specifically wanted to write code like this to introduce CAS vs LL/SC atomics and the ABA problem in multi-threaded lock-free algorithms.

Any corrections, improvements or comments are happily welcome, for that find my mail on memzero.de

Appendix: QEMU ARM user emulation

Unfortunately, we can not use the qemu-aarch64 userspace emulator for the experiments here, as for the ARM guest target, QEMU implements the load / store exclusive instructions not with the full architectural semantics, but rather a simplification which seems sufficient for the typical guest images.

When QEMU emulates a load exclusive instruction, it saves the guest address as well as the data loaded from the guest memory in the cpu state CPUARMState::{exclusive_addr, exclusive_val}. On a store exclusive, the guest address of the store is compared with the saved exclusive_addr and the data to be written with the saved exclusive_val. Iff both compare true, the store exclusive is carried out successfully.

The following pseudo code taken from a comment in QEMU:translate-a64.c shows how QEMU emits store exclusive instructions where env refers to the cpu state.

///  gen_store_exclusive(..)
if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
    [addr] = {Rt};
    {Rd} = 0;
} else {
    {Rd} = 1;
}
env->exclusive_addr = -1;

The exclusive_val and exclusive_addr fields act as exclusive monitor. However, they are only maintained on load / store exclusive instructions and not in normal store instructions and this is where QEMU weakens the semantics of the exclusive instructions. A sequence of normal write instructions can invalidate and restore the state of the "exclusive monitor" in the qemu cpu state.

With everything learned about the ABA problem in this post, one can now understand why QEMU is a bad idea for running our experiments. QEMU will just trigger the ABA problem even with our a64-llsc.cc example using load / store exclusive instructions.

Abbreviations

CAS: compare-and-swap
LL/SC: load-linked store-conditional
CISC: complex instruction set computer
RISC: reduced instruction set computer