Master Advanced C++ Multithreading: Atomics, Memory Ordering, and Lock-Free Design
Master C++20 memory ordering, atomic operations, and lock-free patterns to write correct, high-performance concurrent code.
By the end of this page, you will understand the C++ memory model and why it matters, use std::atomic with explicit memory ordering to build correct concurrent code, recognize when acquire/release semantics are sufficient versus sequentially consistent, implement a lock-free data structure, and leverage C++20's std::latch, std::barrier, and std::jthread for cleaner thread coordination.
What and Why
Most C++ engineers reach for a std::mutex and call it done. Mutexes are correct, composable, and easy to reason about — but they serialize execution at the hardware level through expensive cache-line invalidation and kernel transitions. When you need maximum throughput in a hot path, you need to understand what actually happens at the memory level.
The C++ memory model defines the rules by which operations in one thread become visible to other threads. Without explicit synchronization, the compiler and CPU are both free to reorder instructions in ways that are locally valid but globally surprising. This is not a bug in your toolchain — it is a deliberate contract that enables performance on modern out-of-order and multi-core processors.
std::atomic<T> is the primary tool for crossing that boundary safely. It guarantees that reads and writes to the atomic variable are indivisible, and — critically — it lets you choose how much ordering you need, paying only for what you use.
C++20 is required for std::jthread, std::latch, and std::barrier. Atomic memory ordering has been available since C++11.
Step by Step
The problem with naive shared state
#include <thread>
#include <iostream>
int counter = 0; // plain int, not atomic
int main() {
std::thread t1([] { for (int i = 0; i < 100'000; ++i) ++counter; });
std::thread t2([] { for (int i = 0; i < 100'000; ++i) ++counter; });
t1.join();
t2.join();
std::cout << counter << '\n'; // undefined behavior; result varies
}The increment ++counter compiles to a read-modify-write sequence. Two threads can interleave those steps, losing updates silently.
Fixing it with std::atomic — default ordering
#include <atomic>
#include <thread>
#include <iostream>
std::atomic<int> counter{0};
int main() {
std::thread t1([] { for (int i = 0; i < 100'000; ++i) counter.fetch_add(1); });
std::thread t2([] { for (int i = 0; i < 100'000; ++i) counter.fetch_add(1); });
t1.join();
t2.join();
std::cout << counter.load() << '\n'; // always 200000
}The default memory order is std::memory_order_seq_cst — the strongest and most expensive guarantee. Every sequentially consistent operation establishes a single total order visible to all threads.
Relaxing the ordering: acquire/release
For a producer–consumer flag, you do not need a global total order. You only need the producer's writes to be visible to the consumer once the flag is set. This is the acquire/release contract:
#include <atomic>
#include <thread>
#include <string>
#include <cassert>
std::atomic<bool> ready{false};
std::string data; // non-atomic payload
void producer() {
data = "hello from producer"; // (1) write payload
// release: everything written before this store
// is visible to any thread that later acquires this flag
ready.store(true, std::memory_order_release); // (2)
}
void consumer() {
// acquire: don't hoist any reads past this load
while (!ready.load(std::memory_order_acquire)) {} // (3) spin
assert(data == "hello from producer"); // (4) safe: (1) happened-before (4)
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
}The key insight: release and acquire on the same atomic variable create a happens-before edge. Everything written before the release store is guaranteed visible after the acquire load succeeds. No global ordering is established — just a directed edge between these two threads, which is cheaper.
std::memory_order_relaxed provides no ordering at all — only atomicity. Use it for statistics counters where a slightly stale read is acceptable.
Lock-free stack
Armed with acquire/release, here is a minimal lock-free Treiber stack:
#include <atomic>
#include <memory>
#include <optional>
template<typename T>
class LockFreeStack {
struct Node {
T value;
Node* next;
};
std::atomic<Node*> head_{nullptr};
public:
void push(T val) {
Node* node = new Node{std::move(val), nullptr};
node->next = head_.load(std::memory_order_relaxed);
// compare_exchange_weak retries on spurious failure (preferred in loops)
while (!head_.compare_exchange_weak(node->next, node,
std::memory_order_release,
std::memory_order_relaxed)) {}
}
std::optional<T> pop() {
Node* old = head_.load(std::memory_order_acquire);
while (old && !head_.compare_exchange_weak(old, old->next,
std::memory_order_acquire,
std::memory_order_acquire)) {}
if (!old) return std::nullopt;
T val = std::move(old->value);
delete old; // safe only if no other thread holds 'old' — see ABA note below
return val;
}
};Note: This stack has the ABA problem and unsafe deletion in a real multi-reader scenario. Production lock-free code requires hazard pointers or epoch-based reclamation. This example demonstrates the pattern; always add memory reclamation before shipping.
C++20 synchronization primitives
std::latch is a single-use counter for fan-in: threads decrement it, one waiter unblocks when it reaches zero.
#include <latch>
#include <thread>
#include <vector>
#include <iostream>
int main() {
constexpr int N = 4;
std::latch all_ready{N};
std::vector<std::jthread> workers;
for (int i = 0; i < N; ++i) {
workers.emplace_back([&, i] {
// simulate setup work
std::this_thread::sleep_for(std::chrono::milliseconds(i * 10));
std::cout << "worker " << i << " ready\n";
all_ready.count_down(); // decrement; last one unblocks the waiter
});
}
all_ready.wait(); // blocks until count reaches zero
std::cout << "all workers ready, starting main task\n";
} // jthread destructor joins automaticallystd::barrier is reusable — after all threads arrive, it resets and can be used again. It also accepts a completion callback run by exactly one thread at each sync point, which is ideal for phase-based pipelines.
Common Patterns
Double-checked locking with atomics (C++11+)
#include <atomic>
#include <mutex>
#include <memory>
class Config {
static std::atomic<Config*> instance_;
static std::mutex init_mutex_;
public:
static Config* get() {
Config* p = instance_.load(std::memory_order_acquire);
if (!p) {
std::lock_guard lock(init_mutex_);
p = instance_.load(std::memory_order_relaxed);
if (!p) {
p = new Config();
instance_.store(p, std::memory_order_release);
}
}
return p;
}
};
std::atomic<Config*> Config::instance_{nullptr};
std::mutex Config::init_mutex_;Cooperative cancellation with std::stop_token (C++20)
#include <thread>
#include <stop_token>
#include <iostream>
int main() {
std::jthread worker([](std::stop_token token) {
while (!token.stop_requested()) {
// do incremental work
std::this_thread::sleep_for(std::chrono::milliseconds(50));
}
std::cout << "worker cancelled cleanly\n";
});
std::this_thread::sleep_for(std::chrono::milliseconds(200));
worker.request_stop(); // signals the stop_token; jthread joins on destruction
}What Can Go Wrong
Using memory_order_relaxed for a flag. If the consumer spins on a relaxed load of your ready flag, the compiler is free to hoist the load out of the loop entirely — it sees no observable side effects requiring a re-read. Always use at least memory_order_acquire on the consumer side of a flag.
Mixing std::atomic access with plain access. Accessing the same variable through both an atomic and a non-atomic reference is undefined behavior. Every access to a shared variable must go through the same atomic object.
Assuming compare_exchange_strong is always better than weak. In a retry loop, compare_exchange_weak is preferred because on architectures with LL/SC instructions (ARM, RISC-V), the strong variant emits an internal retry loop that doubles the loop overhead. Use strong only when a spurious failure would be semantically incorrect (i.e., outside a loop).
Over-synchronizing with seq_cst. Sequentially consistent operations on x86 compile to MFENCE or LOCK XCHG, which stall the CPU pipeline. Profile before reaching for the default; acquire/release is almost always sufficient for producer–consumer patterns.
Quick Reference
| Need | Memory Order |
|---|---|
| Statistic counter, no ordering needed | relaxed |
| Publish data to a consumer | producer: release, consumer: acquire |
| Read-modify-write in a data structure | acq_rel |
| Global total order required | seq_cst (default) |
| One-shot fan-in | std::latch |
| Reusable phase barrier | std::barrier |
| Cooperative cancellation | std::jthread + std::stop_token |
| Lock-free CAS loop | compare_exchange_weak |
- Happens-before is established by a
releasestore followed by anacquireload on the same atomic. std::atomic<T>requiresTto be trivially copyable.- On x86,
acquire/releaseare free (the hardware is TSO); they cost only a compiler fence. On ARM they emit real barriers.
What's Next
- Coroutines and async I/O — suspend-and-resume as an alternative to thread-per-task models
- Memory model deep dive — how value semantics and move interact with concurrent ownership
- Lambda captures in concurrent code — lifetime hazards when lambdas outlive their enclosing scope
- Lock-free memory reclamation — hazard pointers and epoch-based reclamation to make the lock-free stack production-ready