CPU Affinity and Thread Isolation in HFT C++

Why thread pinning matters in HFT

On a modern NUMA machine, the OS scheduler can migrate your thread to any core. Each migration costs:

L1/L2 cache cold start: 50–200 ns
NUMA cross-socket access: 50–100 ns extra memory latency per access
Scheduler latency jitter: 1–50 µs interruptions

For a strategy that needs to act in < 1 µs, these are catastrophic. Solution: pin critical threads to dedicated, isolated cores.

Pinning a thread to a specific core

cpp

Godbolt

#include <pthread.h>
#include <sched.h>
#include <stdexcept>

void PinThreadToCore(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);

    int ret = pthread_setaffinity_np(
        pthread_self(), sizeof(cpu_set_t), &cpuset);
    if (ret != 0)
        throw std::system_error(ret, std::generic_category(), "setaffinity");
}

// Pin at thread creation
std::thread CreatePinnedThread(int core_id, auto&& fn) {
    std::thread t([core_id, fn = std::forward<decltype(fn)>(fn)]() mutable {
        PinThreadToCore(core_id);
        SetRealtimePriority(99);
        fn();
    });
    return t;
}

// Usage
auto feed_thread = CreatePinnedThread(4, [&] {
    while (!shutdown) {
        feed_handler.Poll();
    }
});

NUMA topology awareness

cpp

Godbolt

#include <numaif.h>
#include <numa.h>

// Allocate memory on the same NUMA node as the CPU core
void* AllocOnNuma(size_t size, int numa_node) {
    void* p = numa_alloc_onnode(size, numa_node);
    if (!p) throw std::bad_alloc{};
    return p;
}

// Check which NUMA node a core belongs to
int GetNumaNode(int core_id) {
    return numa_node_of_cpu(core_id);
}

// For shared memory between threads on different sockets:
// prefer inter-socket DMA over software copies if DDIO is available

// Print NUMA topology
void PrintNumaLayout() {
    int nodes = numa_max_node() + 1;
    for (int n = 0; n < nodes; ++n) {
        printf("NUMA node %d: ", n);
        struct bitmask* mask = numa_allocate_cpumask();
        numa_node_to_cpus(n, mask);
        for (int c = 0; c < numa_num_possible_cpus(); ++c)
            if (numa_bitmask_isbitset(mask, c)) printf("%d ", c);
        printf("\n");
        numa_free_cpumask(mask);
    }
}

Isolating cores from the OS (isolcpus)

The most important step: prevent the kernel from scheduling ANY task on your HFT cores.

In /etc/default/grub:

bash

GRUB_CMDLINE_LINUX="isolcpus=4,5,6,7 nohz_full=4,5,6,7 rcu_nocbs=4,5,6,7"

Then run update-grub and reboot.

isolcpus: removes cores from the general scheduler pool
nohz_full: disables the periodic timer tick on those cores (eliminates 1–4 µs timer interrupts)
rcu_nocbs: offloads RCU callbacks to other cores

After boot, verify:

bash

cat /sys/devices/system/cpu/isolated      # should show 4,5,6,7
cat /proc/interrupts | grep CPU4          # should show very few interrupts

Real-time scheduling priority

cpp

Godbolt

#include <sched.h>

void SetRealtimePriority(int priority) {
    // SCHED_FIFO: real-time, never preempted by normal tasks
    // priority 1–99 (99 = highest)
    struct sched_param param { .sched_priority = priority };
    if (sched_setscheduler(0, SCHED_FIFO, &param) != 0)
        throw std::system_error(errno, std::generic_category(), "setscheduler");
}

// Lock all memory into RAM (prevent page faults during execution)
void LockMemory() {
    if (mlockall(MCL_CURRENT | MCL_FUTURE) != 0)
        throw std::system_error(errno, std::generic_category(), "mlockall");
}

// Pre-fault the stack
void PrefaultStack(size_t stack_size = 8 * 1024 * 1024) {
    volatile char touch[stack_size];
    std::memset(const_cast<char*>(touch), 0, stack_size);
}

// Call all three before entering the hot loop
void InitRealtimeThread(int core_id) {
    PinThreadToCore(core_id);
    SetRealtimePriority(99);
    LockMemory();
    PrefaultStack();
}

Measuring scheduler jitter

cpp

Godbolt

#include <cstdint>
#include <algorithm>
#include <numeric>

inline uint64_t Rdtsc() {
    uint32_t lo, hi;
    __asm__ volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return (static_cast<uint64_t>(hi) << 32) | lo;
}

inline uint64_t RdtscNs() {
    static const double NS_PER_TICK = 1e9 / 3.8e9; // calibrate for your CPU
    return static_cast<uint64_t>(Rdtsc() * NS_PER_TICK);
}

struct JitterStats {
    uint64_t min_ns, max_ns, p99_ns, p999_ns;
    double mean_ns;
};

JitterStats MeasureSchedulerJitter(int iterations = 100000) {
    std::vector<uint64_t> samples(iterations);
    uint64_t prev = RdtscNs();

    for (int i = 0; i < iterations; ++i) {
        // Tight spin — the time between iterations is the scheduler wake latency
        uint64_t now = RdtscNs();
        samples[i] = now - prev;
        prev = now;
    }

    std::sort(samples.begin(), samples.end());

    return {
        .min_ns  = samples.front(),
        .max_ns  = samples.back(),
        .p99_ns  = samples[iterations * 99 / 100],
        .p999_ns = samples[iterations * 999 / 1000],
        .mean_ns = static_cast<double>(
            std::accumulate(samples.begin(), samples.end(), 0ULL)) / iterations,
    };
}

Typical results on a well-tuned HFT box:

Setup	P99 jitter	Max jitter
Default Linux	10–100 µs	1–10 ms
isolcpus + nohz_full	1–5 µs	10–100 µs
isolcpus + SCHED_FIFO	0.5–2 µs	5–20 µs
Full RT kernel (PREEMPT_RT)	0.2–1 µs	2–10 µs

Hyperthreading considerations

On HT (SMT) CPUs, two logical cores share one physical core's execution units. For latency-critical work: disable hyperthreading or pin only to physical cores.

bash

# Disable HT at runtime (Linux)
for f in /sys/devices/system/cpu/cpu*/topology/thread_siblings_list; do
    core=$(cat "$f" | cut -d, -f2)  # second sibling = HT peer
    echo 0 > /sys/devices/system/cpu/cpu${core}/online
done

Or set noht / nosmt in kernel cmdline.

When HT helps: throughput-oriented workloads (many independent tasks). When HT hurts: latency-critical single-thread performance — the sibling thread competes for L1 bandwidth, branch predictor state, and execution ports.