CPU Affinity and Thread Isolation in HFT C++
Pinning C++ threads to CPU cores, isolating cores from the OS scheduler, setting real-time priority, and measuring scheduling jitter for HFT systems.
Why thread pinning matters in HFT
On a modern NUMA machine, the OS scheduler can migrate your thread to any core. Each migration costs:
- L1/L2 cache cold start: 50–200 ns
- NUMA cross-socket access: 50–100 ns extra memory latency per access
- Scheduler latency jitter: 1–50 µs interruptions
For a strategy that needs to act in < 1 µs, these are catastrophic. Solution: pin critical threads to dedicated, isolated cores.
Pinning a thread to a specific core
#include <pthread.h>
#include <sched.h>
#include <stdexcept>
void PinThreadToCore(int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
int ret = pthread_setaffinity_np(
pthread_self(), sizeof(cpu_set_t), &cpuset);
if (ret != 0)
throw std::system_error(ret, std::generic_category(), "setaffinity");
}
// Pin at thread creation
std::thread CreatePinnedThread(int core_id, auto&& fn) {
std::thread t([core_id, fn = std::forward<decltype(fn)>(fn)]() mutable {
PinThreadToCore(core_id);
SetRealtimePriority(99);
fn();
});
return t;
}
// Usage
auto feed_thread = CreatePinnedThread(4, [&] {
while (!shutdown) {
feed_handler.Poll();
}
});NUMA topology awareness
#include <numaif.h>
#include <numa.h>
// Allocate memory on the same NUMA node as the CPU core
void* AllocOnNuma(size_t size, int numa_node) {
void* p = numa_alloc_onnode(size, numa_node);
if (!p) throw std::bad_alloc{};
return p;
}
// Check which NUMA node a core belongs to
int GetNumaNode(int core_id) {
return numa_node_of_cpu(core_id);
}
// For shared memory between threads on different sockets:
// prefer inter-socket DMA over software copies if DDIO is available
// Print NUMA topology
void PrintNumaLayout() {
int nodes = numa_max_node() + 1;
for (int n = 0; n < nodes; ++n) {
printf("NUMA node %d: ", n);
struct bitmask* mask = numa_allocate_cpumask();
numa_node_to_cpus(n, mask);
for (int c = 0; c < numa_num_possible_cpus(); ++c)
if (numa_bitmask_isbitset(mask, c)) printf("%d ", c);
printf("\n");
numa_free_cpumask(mask);
}
}Isolating cores from the OS (isolcpus)
The most important step: prevent the kernel from scheduling ANY task on your HFT cores.
In /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=4,5,6,7 nohz_full=4,5,6,7 rcu_nocbs=4,5,6,7"Then run update-grub and reboot.
isolcpus: removes cores from the general scheduler poolnohz_full: disables the periodic timer tick on those cores (eliminates 1–4 µs timer interrupts)rcu_nocbs: offloads RCU callbacks to other cores
After boot, verify:
cat /sys/devices/system/cpu/isolated # should show 4,5,6,7
cat /proc/interrupts | grep CPU4 # should show very few interruptsReal-time scheduling priority
#include <sched.h>
void SetRealtimePriority(int priority) {
// SCHED_FIFO: real-time, never preempted by normal tasks
// priority 1–99 (99 = highest)
struct sched_param param { .sched_priority = priority };
if (sched_setscheduler(0, SCHED_FIFO, ¶m) != 0)
throw std::system_error(errno, std::generic_category(), "setscheduler");
}
// Lock all memory into RAM (prevent page faults during execution)
void LockMemory() {
if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0)
throw std::system_error(errno, std::generic_category(), "mlockall");
}
// Pre-fault the stack
void PrefaultStack(size_t stack_size = 8 * 1024 * 1024) {
volatile char touch[stack_size];
std::memset(const_cast<char*>(touch), 0, stack_size);
}
// Call all three before entering the hot loop
void InitRealtimeThread(int core_id) {
PinThreadToCore(core_id);
SetRealtimePriority(99);
LockMemory();
PrefaultStack();
}Measuring scheduler jitter
#include <cstdint>
#include <algorithm>
#include <numeric>
inline uint64_t Rdtsc() {
uint32_t lo, hi;
__asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
return (static_cast<uint64_t>(hi) << 32) | lo;
}
inline uint64_t RdtscNs() {
static const double NS_PER_TICK = 1e9 / 3.8e9; // calibrate for your CPU
return static_cast<uint64_t>(Rdtsc() * NS_PER_TICK);
}
struct JitterStats {
uint64_t min_ns, max_ns, p99_ns, p999_ns;
double mean_ns;
};
JitterStats MeasureSchedulerJitter(int iterations = 100000) {
std::vector<uint64_t> samples(iterations);
uint64_t prev = RdtscNs();
for (int i = 0; i < iterations; ++i) {
// Tight spin — the time between iterations is the scheduler wake latency
uint64_t now = RdtscNs();
samples[i] = now - prev;
prev = now;
}
std::sort(samples.begin(), samples.end());
return {
.min_ns = samples.front(),
.max_ns = samples.back(),
.p99_ns = samples[iterations * 99 / 100],
.p999_ns = samples[iterations * 999 / 1000],
.mean_ns = static_cast<double>(
std::accumulate(samples.begin(), samples.end(), 0ULL)) / iterations,
};
}Typical results on a well-tuned HFT box:
| Setup | P99 jitter | Max jitter |
|---|---|---|
| Default Linux | 10–100 µs | 1–10 ms |
| isolcpus + nohz_full | 1–5 µs | 10–100 µs |
| isolcpus + SCHED_FIFO | 0.5–2 µs | 5–20 µs |
| Full RT kernel (PREEMPT_RT) | 0.2–1 µs | 2–10 µs |
Hyperthreading considerations
On HT (SMT) CPUs, two logical cores share one physical core's execution units. For latency-critical work: disable hyperthreading or pin only to physical cores.
# Disable HT at runtime (Linux)
for f in /sys/devices/system/cpu/cpu*/topology/thread_siblings_list; do
core=$(cat "$f" | cut -d, -f2) # second sibling = HT peer
echo 0 > /sys/devices/system/cpu/cpu${core}/online
doneOr set noht / nosmt in kernel cmdline.
When HT helps: throughput-oriented workloads (many independent tasks). When HT hurts: latency-critical single-thread performance — the sibling thread competes for L1 bandwidth, branch predictor state, and execution ports.