thread_local Storage Duration
C++ thread_local — per-thread variables, initialization semantics, destruction order, async pitfalls, and performance characteristics.
thread_localsince C++11thread_local is a storage-class specifier (C++11) that gives each thread its own independent instance of a variable, initialized on first use per thread and destroyed when that thread exits.
Overview
C++11 defines four storage durations: automatic, static, dynamic, and thread. thread_local variables have thread storage duration — semantically equivalent to static storage duration, but scoped to a thread rather than the program. Every thread gets its own copy; reads and writes never race with other threads, so no synchronization is needed for the variable itself.
The keyword can be combined with static or extern at namespace scope (redundant but legal), is required for static data members, and implies static for local variables. A function-local thread_local variable is initialized the first time control passes through its declaration in that thread, not the first time any thread reaches it.
thread_local objects have their destructors called when the owning thread exits, in reverse order of construction — mirroring the rules for static objects at program exit.
Syntax
// Namespace scope — external linkage by default
thread_local int g_counter = 0;
// Internal linkage (C++11)
static thread_local std::string t_prefix;
// Function-local — initialized once per thread on first entry
void worker() {
thread_local int call_count = 0; // implicitly static; not re-initialized
++call_count;
}
// Static data member — must be defined at namespace scope
struct Ctx {
thread_local static std::vector<int> scratch; // C++11
};
thread_local std::vector<int> Ctx::scratch;
// C++26: thread_local in structured bindings (P1091 accepted for C++26)
// thread_local auto [x, y] = make_pair(1, 2); // C++26 onlyExamples
Thread-Local RNG
The canonical motivation: std::mt19937 is not thread-safe, but wrapping it in a mutex serializes every random draw. A thread-local engine eliminates both the race and the contention.
// C++11 — each thread seeds independently from hardware entropy
thread_local std::mt19937 tls_rng{std::random_device{}()};
int random_int(int lo, int hi) {
return std::uniform_int_distribution{lo, hi}(tls_rng); // no mutex
}
// Works correctly with parallel STL (C++17)
std::for_each(std::execution::par_unseq, v.begin(), v.end(), // C++17
[](int& x) { x = random_int(0, 99); });Per-Thread Cache
// Thread-local read-through cache: zero contention on hits
struct ItemCache {
thread_local static std::unordered_map<int, Item> store; // C++11
static const Item& get(int id) {
auto [it, inserted] = store.emplace(id, Item{}); // C++17 structured binding
if (inserted)
it->second = global_store.fetch(id); // global_store must be thread-safe
return it->second;
}
static void invalidate(int id) { store.erase(id); }
};
thread_local std::unordered_map<int, Item> ItemCache::store;Reusable Scratch Buffer
// Avoids per-call heap allocation in hot formatting paths
std::string_view format_record(const Record& r, std::span<char> /*unused*/) {
thread_local std::string buf;
buf.clear();
std::format_to(std::back_inserter(buf), // C++20
"[{}] {}: {}", r.timestamp, r.level, r.message);
return buf; // stable until next call on this thread
}errno-style Per-Thread Error State
// C's errno has been thread_local since POSIX; replicate the pattern in C++
thread_local int tls_error_code = 0;
thread_local char tls_error_msg[256] = {};
void set_error(int code, std::string_view msg) noexcept {
tls_error_code = code;
msg.copy(tls_error_msg, sizeof(tls_error_msg) - 1);
tls_error_msg[std::min(msg.size(), sizeof(tls_error_msg) - 1)] = '\0';
}
int get_error() noexcept { return tls_error_code; }
void clear_error() noexcept { tls_error_code = 0; tls_error_msg[0] = '\0'; }Best Practices
Prefer static function-local over global thread_local. A function-local thread_local is initialized lazily on first call, making the initialization point deterministic and keeping the variable out of the global namespace.
Bound cache size explicitly. Thread-local caches grow without bound because no shared eviction policy can reach them. Cap size or add an LRU limit; otherwise long-lived thread pools accumulate unbounded memory.
Seed RNGs from a truly random source. std::random_device{}() is adequate in most environments, but avoid seeding all engines from std::chrono::steady_clock — threads created in rapid succession receive identical seeds and produce correlated sequences.
Check for fiber compatibility. TLS is implemented per OS thread. If your application uses fibers or coroutines that migrate between OS threads (Windows fibers, some green-thread libraries), a thread_local variable's value changes when the fiber is rescheduled to a different carrier thread. MSVC's /GT flag adds fiber-safety guards for static TLS; dynamic TLS (via TlsAlloc) is always fiber-unsafe.
Common Pitfalls
std::async and the Default Launch Policy
thread_local int t_id = current_thread_id();
auto fut = std::async([]() { // C++11 — default policy: async | deferred
return t_id; // UNDEFINED which thread's t_id this reads
});
// With std::launch::deferred, the lambda runs on the calling thread at .get() time.
// With std::launch::async, it runs on a new thread.
// The default policy gives the runtime freedom to choose either.If you access thread_local variables inside a task launched with std::async, always specify std::launch::async explicitly — otherwise which thread's storage is accessed is unspecified.
Initialization of Non-POD Variables Can Throw
thread_local variables with dynamic initialization (constructors, non-trivial initializers) run that initialization inside the thread. An uncaught exception propagates to the thread's entry point — not to the creator. A std::vector that throws std::bad_alloc during thread-local construction will terminate if the thread's callable does not catch it.
DLL Unloading on Windows
On Windows, thread_local destructors for variables inside a DLL run during DLL_THREAD_DETACH and DLL_PROCESS_DETACH. If the DLL is unloaded while threads that hold those thread-local objects are still alive, the destructors will be called with a partially unloaded module — accessing vtables, static data, or calling into the DLL's code at that point is undefined behavior. Use explicit cleanup APIs (TlsAlloc/TlsFree) for shared library code that must tolerate hot-unload.
Destruction After std::atexit Handlers
Static-duration objects are destroyed after main() returns, but thread_local objects for the main thread are destroyed before static objects — the standard requires thread-local destructors to run before the first static destructor. This means a thread_local destructor must not call into a static object that has already been destroyed; the risk is the reverse of the usual static-destruction-order problem.
Unused Thread-Local Variables in Dead Threads
Thread pool threads that initialize heavy thread_local objects (e.g., a 1 MB arena) never release that memory until the thread itself exits. With a pool of 32 threads each holding a large TLS object, the resident memory footprint is 32× that object's size for the pool's lifetime. Size thread-local caches appropriately for the thread count, not just the workload.
Comparison with Alternatives
| Mechanism | Per-thread | Synchronization | Typical overhead | Best for |
|---|---|---|---|---|
thread_local | Yes | None | TLS segment lookup (1–5 ns) | RNG, caches, scratch buffers |
std::mutex + shared | No | Required | Lock/unlock (10–100 ns) | Shared mutable state |
std::atomic | No | Built-in | CAS (1–10 ns) | Counters, flags, lock-free queues |
boost::thread_specific_ptr | Yes | None | Similar to TLS | Pre-C++11 or pointer semantics |
On x86-64 Linux (ELF TLS), accessing a thread_local variable in the same shared object costs roughly one mov plus a segment-relative load — effectively free. Cross-DSO access (calling into another .so that holds the TLS) goes through a __tls_get_addr call and is measurably slower. Minimize cross-DSO thread-local access in hot paths.
See Also
std::mutex— synchronization for shared mutable statestd::atomic— lock-free operations on single valuesstd::jthread— C++20 joinable thread with stop tokenstd::async— launch policies and their interaction with TLS