Master Advanced C++ Multithreading: Atomics, Memory Ordering, and Lock-Free Design

By the end of this page, you will understand the C++ memory model and why it matters, use std::atomic with explicit memory ordering to build correct concurrent code, recognize when acquire/release semantics are sufficient versus sequentially consistent, implement a lock-free data structure, and leverage C++20's std::latch, std::barrier, and std::jthread for cleaner thread coordination.

What and Why

Most C++ engineers reach for a std::mutex and call it done. Mutexes are correct, composable, and easy to reason about — but they serialize execution at the hardware level through expensive cache-line invalidation and kernel transitions. When you need maximum throughput in a hot path, you need to understand what actually happens at the memory level.

The C++ memory model defines the rules by which operations in one thread become visible to other threads. Without explicit synchronization, the compiler and CPU are both free to reorder instructions in ways that are locally valid but globally surprising. This is not a bug in your toolchain — it is a deliberate contract that enables performance on modern out-of-order and multi-core processors.

std::atomic<T> is the primary tool for crossing that boundary safely. It guarantees that reads and writes to the atomic variable are indivisible, and — critically — it lets you choose how much ordering you need, paying only for what you use.

C++20 is required for std::jthread, std::latch, and std::barrier. Atomic memory ordering has been available since C++11.

Step by Step

The problem with naive shared state

cpp

Godbolt

#include <thread>
#include <iostream>

int counter = 0; // plain int, not atomic

int main() {
    std::thread t1([] { for (int i = 0; i < 100'000; ++i) ++counter; });
    std::thread t2([] { for (int i = 0; i < 100'000; ++i) ++counter; });
    t1.join();
    t2.join();
    std::cout << counter << '\n'; // undefined behavior; result varies
}

The increment ++counter compiles to a read-modify-write sequence. Two threads can interleave those steps, losing updates silently.

Fixing it with `std::atomic` — default ordering

cpp

Godbolt

#include <atomic>
#include <thread>
#include <iostream>

std::atomic<int> counter{0};

int main() {
    std::thread t1([] { for (int i = 0; i < 100'000; ++i) counter.fetch_add(1); });
    std::thread t2([] { for (int i = 0; i < 100'000; ++i) counter.fetch_add(1); });
    t1.join();
    t2.join();
    std::cout << counter.load() << '\n'; // always 200000
}

The default memory order is std::memory_order_seq_cst — the strongest and most expensive guarantee. Every sequentially consistent operation establishes a single total order visible to all threads.

Relaxing the ordering: acquire/release

For a producer–consumer flag, you do not need a global total order. You only need the producer's writes to be visible to the consumer once the flag is set. This is the acquire/release contract:

cpp

Godbolt

#include <atomic>
#include <thread>
#include <string>
#include <cassert>

std::atomic<bool> ready{false};
std::string data; // non-atomic payload

void producer() {
    data = "hello from producer"; // (1) write payload
    // release: everything written before this store
    // is visible to any thread that later acquires this flag
    ready.store(true, std::memory_order_release); // (2)
}

void consumer() {
    // acquire: don't hoist any reads past this load
    while (!ready.load(std::memory_order_acquire)) {} // (3) spin
    assert(data == "hello from producer"); // (4) safe: (1) happened-before (4)
}

int main() {
    std::thread t1(producer);
    std::thread t2(consumer);
    t1.join();
    t2.join();
}

The key insight: release and acquire on the same atomic variable create a happens-before edge. Everything written before the release store is guaranteed visible after the acquire load succeeds. No global ordering is established — just a directed edge between these two threads, which is cheaper.

std::memory_order_relaxed provides no ordering at all — only atomicity. Use it for statistics counters where a slightly stale read is acceptable.

Lock-free stack

Armed with acquire/release, here is a minimal lock-free Treiber stack:

cpp

Godbolt

#include <atomic>
#include <memory>
#include <optional>

template<typename T>
class LockFreeStack {
    struct Node {
        T value;
        Node* next;
    };
    std::atomic<Node*> head_{nullptr};

public:
    void push(T val) {
        Node* node = new Node{std::move(val), nullptr};
        node->next = head_.load(std::memory_order_relaxed);
        // compare_exchange_weak retries on spurious failure (preferred in loops)
        while (!head_.compare_exchange_weak(node->next, node,
                                            std::memory_order_release,
                                            std::memory_order_relaxed)) {}
    }

    std::optional<T> pop() {
        Node* old = head_.load(std::memory_order_acquire);
        while (old && !head_.compare_exchange_weak(old, old->next,
                                                   std::memory_order_acquire,
                                                   std::memory_order_acquire)) {}
        if (!old) return std::nullopt;
        T val = std::move(old->value);
        delete old; // safe only if no other thread holds 'old' — see ABA note below
        return val;
    }
};

Note: This stack has the ABA problem and unsafe deletion in a real multi-reader scenario. Production lock-free code requires hazard pointers or epoch-based reclamation. This example demonstrates the pattern; always add memory reclamation before shipping.

C++20 synchronization primitives

std::latch is a single-use counter for fan-in: threads decrement it, one waiter unblocks when it reaches zero.

cpp

Godbolt

#include <latch>
#include <thread>
#include <vector>
#include <iostream>

int main() {
    constexpr int N = 4;
    std::latch all_ready{N};

    std::vector<std::jthread> workers;
    for (int i = 0; i < N; ++i) {
        workers.emplace_back([&, i] {
            // simulate setup work
            std::this_thread::sleep_for(std::chrono::milliseconds(i * 10));
            std::cout << "worker " << i << " ready\n";
            all_ready.count_down(); // decrement; last one unblocks the waiter
        });
    }

    all_ready.wait(); // blocks until count reaches zero
    std::cout << "all workers ready, starting main task\n";
} // jthread destructor joins automatically

std::barrier is reusable — after all threads arrive, it resets and can be used again. It also accepts a completion callback run by exactly one thread at each sync point, which is ideal for phase-based pipelines.

Common Patterns

Double-checked locking with atomics (C++11+)

cpp

Godbolt

#include <atomic>
#include <mutex>
#include <memory>

class Config {
    static std::atomic<Config*> instance_;
    static std::mutex init_mutex_;
public:
    static Config* get() {
        Config* p = instance_.load(std::memory_order_acquire);
        if (!p) {
            std::lock_guard lock(init_mutex_);
            p = instance_.load(std::memory_order_relaxed);
            if (!p) {
                p = new Config();
                instance_.store(p, std::memory_order_release);
            }
        }
        return p;
    }
};
std::atomic<Config*> Config::instance_{nullptr};
std::mutex Config::init_mutex_;

Cooperative cancellation with std::stop_token (C++20)

cpp

Godbolt

#include <thread>
#include <stop_token>
#include <iostream>

int main() {
    std::jthread worker([](std::stop_token token) {
        while (!token.stop_requested()) {
            // do incremental work
            std::this_thread::sleep_for(std::chrono::milliseconds(50));
        }
        std::cout << "worker cancelled cleanly\n";
    });

    std::this_thread::sleep_for(std::chrono::milliseconds(200));
    worker.request_stop(); // signals the stop_token; jthread joins on destruction
}

What Can Go Wrong

Using memory_order_relaxed for a flag. If the consumer spins on a relaxed load of your ready flag, the compiler is free to hoist the load out of the loop entirely — it sees no observable side effects requiring a re-read. Always use at least memory_order_acquire on the consumer side of a flag.

Mixing std::atomic access with plain access. Accessing the same variable through both an atomic and a non-atomic reference is undefined behavior. Every access to a shared variable must go through the same atomic object.

Assuming compare_exchange_strong is always better than weak. In a retry loop, compare_exchange_weak is preferred because on architectures with LL/SC instructions (ARM, RISC-V), the strong variant emits an internal retry loop that doubles the loop overhead. Use strong only when a spurious failure would be semantically incorrect (i.e., outside a loop).

Over-synchronizing with seq_cst. Sequentially consistent operations on x86 compile to MFENCE or LOCK XCHG, which stall the CPU pipeline. Profile before reaching for the default; acquire/release is almost always sufficient for producer–consumer patterns.

Quick Reference

Need	Memory Order
Statistic counter, no ordering needed	`relaxed`
Publish data to a consumer	producer: `release`, consumer: `acquire`
Read-modify-write in a data structure	`acq_rel`
Global total order required	`seq_cst` (default)
One-shot fan-in	`std::latch`
Reusable phase barrier	`std::barrier`
Cooperative cancellation	`std::jthread` + `std::stop_token`
Lock-free CAS loop	`compare_exchange_weak`

Happens-before is established by a release store followed by an acquire load on the same atomic.
std::atomic<T> requires T to be trivially copyable.
On x86, acquire/release are free (the hardware is TSO); they cost only a compiler fence. On ARM they emit real barriers.

What's Next

Coroutines and async I/O — suspend-and-resume as an alternative to thread-per-task models
Memory model deep dive — how value semantics and move interact with concurrent ownership
Lambda captures in concurrent code — lifetime hazards when lambdas outlive their enclosing scope
Lock-free memory reclamation — hazard pointers and epoch-based reclamation to make the lock-free stack production-ready