C++ Performance Optimization
Cache efficiency, branch prediction, SIMD, move semantics, PMR, profiling, and compiler flags — the full C++ performance toolkit.
Performance Optimizationsince C++11Systematic reduction of execution time and memory usage through algorithmic selection, data layout, and low-level tuning — validated by profilers, not intuition.
Overview
Performance comes in layers: algorithm complexity first, data layout second, micro-level tuning last. The order matters because a cache-friendly O(n²) is still O(n²). Profilers expose where time is actually spent; everything else is speculation.
The Memory Hierarchy
L1 cache ~4 cycles 32–64 KB per core
L2 cache ~12 cycles 256 KB–1 MB per core
L3 cache ~40 cycles 4–64 MB shared
DRAM ~100+ cycles GBs main memoryA single L3 miss to DRAM wastes ~100 instructions of compute time on a 4-IPC machine. Sequential access triggers hardware prefetch automatically; random pointer chasing does not. Most performance failures trace back to this single fact.
Examples
Cache-Friendly Data Layout
Blocked matrix traversal
constexpr int N = 1024;
float A[N][N], B[N][N];
// Naive: B[j][i] strides N*sizeof(float) per write — one cache miss per element
void transpose_naive() {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
B[j][i] = A[i][j];
}
// Blocked: each tile fits in L1; sequential access within the tile — 10–20× faster
void transpose_blocked(int block = 32) {
for (int i = 0; i < N; i += block)
for (int j = 0; j < N; j += block)
for (int ii = i, ie = std::min(i + block, N); ii < ie; ++ii)
for (int jj = j, je = std::min(j + block, N); jj < je; ++jj)
B[jj][ii] = A[ii][jj];
}Struct of Arrays vs Array of Structs
// AoS — default layout; loads 32 bytes per particle, uses 24 (flags rarely touched)
struct Particle { float px, py, pz, vx, vy, vz, mass; uint32_t flags; };
Particle particles[100'000]; // digit separators: C++14
// SoA — 100% cache utilisation for position/velocity loops; trivially auto-vectorised
struct ParticlePool {
float px[100'000], py[100'000], pz[100'000];
float vx[100'000], vy[100'000], vz[100'000];
float mass[100'000];
uint32_t flags[100'000]; // cold — in its own cache lines
};
// Compiles to AVX2 on -O2 -march=native; processes 8 floats per instruction
void update_positions(ParticlePool& pool, float dt) {
for (int i = 0; i < 100'000; ++i) {
pool.px[i] += pool.vx[i] * dt;
pool.py[i] += pool.vy[i] * dt;
pool.pz[i] += pool.vz[i] * dt;
}
}Branch Prediction
Mispredicts drain the pipeline for ~15–20 cycles. Unsorted or randomly-valued data causes ~50% mispredict on a threshold branch.
// Branchless: arithmetic replaces the conditional; no pipeline flush
void process_branchless(int* data, int n) {
for (int i = 0; i < n; ++i)
data[i] *= 1 + (data[i] > 128); // adds 0 or 1 to multiplier
}
// C++20: annotate cold paths so the compiler places hot code inline
int parse(const char* input) {
if (!input) [[unlikely]] throw std::invalid_argument{"null"};
return std::atoi(input);
}[[likely]] and [[unlikely]] (C++20) affect code layout, not runtime speculation. They are most valuable on paths with known, extreme probabilities — error handlers, rarely-taken forks.
Move Semantics and Allocation Avoidance
C++11 move semantics eliminate copies on temporaries. NRVO constructs the return value directly in the caller's stack frame — zero copies, zero moves. Since C++17, copy elision for prvalues is mandatory.
// NRVO — v is constructed in-place at the call site (C++11+; mandatory elision C++17)
std::vector<int> make_range(int n) {
std::vector<int> v(n);
std::iota(v.begin(), v.end(), 0); // C++11
return v;
}
// Sink idiom: caller pays one copy (lvalue) or one move (rvalue) — never two
class Index {
std::vector<int> keys_;
public:
explicit Index(std::vector<int> keys) // C++11
: keys_(std::move(keys)) {}
};Avoid heap allocation in hot paths by reusing buffers:
class Processor {
std::vector<int> scratch_; // member; capacity is never released
void run(std::span<const int> input) { // std::span: C++20
scratch_.clear(); // O(1) — capacity preserved
scratch_.reserve(input.size());
for (int x : input)
if (x > 0) scratch_.push_back(x);
}
};C++17 std::pmr::monotonic_buffer_resource provides arena allocation — allocate from stack memory, free everything at once with zero per-object cost:
#include <memory_resource> // C++17
void batch_process(std::span<std::string_view> inputs) { // std::span: C++20
std::array<std::byte, 64 * 1024> buf;
std::pmr::monotonic_buffer_resource arena{buf.data(), buf.size()};
std::pmr::vector<std::pmr::string> tmp{&arena}; // C++17
for (auto sv : inputs)
tmp.emplace_back(sv, &arena); // no operator new for small strings
} // entire arena freed here in one shotSIMD and Auto-Vectorisation
SoA loops like update_positions above auto-vectorise with -O2 -march=native. Verify with:
g++ -O2 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed main.cppFor explicit SIMD control when the compiler fails, use intrinsics (x86-only; requires -mavx2):
#include <immintrin.h> // AVX2 intrinsics — not portable across ISAs
void saxpy_avx2(float* __restrict__ y, const float* __restrict__ x,
float a, int n) {
__m256 va = _mm256_set1_ps(a);
int i = 0;
for (; i + 8 <= n; i += 8)
_mm256_storeu_ps(y + i,
_mm256_fmadd_ps(va, _mm256_loadu_ps(x + i), _mm256_loadu_ps(y + i)));
for (; i < n; ++i) y[i] += a * x[i]; // scalar tail
}Prefer auto-vectorisation. Hand-written intrinsics tie you to an ISA, generate fragile code, and are rarely faster than what a modern compiler produces with proper data layout and alignment hints.
Profiling Workflow
# Compile with frame pointers — required for accurate call stacks
g++ -O2 -g -fno-omit-frame-pointer -o myapp main.cpp
# Sample-based profiling (Linux)
perf record -g -- ./myapp
perf report --call-graph dwarf
# Cache miss breakdown
perf stat -e cycles,instructions,cache-misses,cache-references ./myapp
# Flame graph visualisation
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svgFor micro-benchmarks, Google Benchmark prevents dead-code elimination and accounts for statistical noise:
#include <benchmark/benchmark.h>
static void BM_transpose_blocked(benchmark::State& state) {
for (auto _ : state)
transpose_blocked();
}
BENCHMARK(BM_transpose_blocked)->Unit(benchmark::kMicrosecond);
BENCHMARK_MAIN();Compiler Flags
-O2 # safe; production default
-O3 # aggressive loop transforms; benchmark before committing
-march=native # all features of the build host — not portable
-march=x86-64-v3 # SSE4.2 + AVX2; safe for post-2015 x86 deployments
-flto=thin # cross-TU inlining (LLVM ThinLTO); use -flto for GCC
-fprofile-generate # pass 1: instrument binary; run representative workload
-fprofile-use # pass 2: recompile with PGO data — 10–30% gain on real code
-ffast-math # allows FP reassociation; breaks strict IEEE 754 semantics
-fno-omit-frame-pointer # profiler-friendly; ~1% overheadProfile-Guided Optimization is the highest-leverage single change available for long-running services. The compiler inlines and lays out hot paths based on actual execution frequencies, not static heuristics.
Best Practices
- Profile before touching anything — hotspots are rarely where intuition points; profilers are the only ground truth
- Algorithm first — O(n log n) beats a perfectly cache-tuned O(n²) past a few thousand elements
- Reserve early —
vector::reserveat construction eliminates all reallocation; check withcapacity()after a run - Separate hot and cold fields — rarely-accessed members (flags, debug state) should be at the end of a struct or in a separate allocation
- SoA for data-parallel loops — array-per-field enables auto-vectorisation and eliminates wasted cache bandwidth
- Return by value — NRVO and mandatory copy elision (C++17) make
return local_varzero-cost; don't use output parameters to avoid copies - LTO in production builds —
-flto=thinis low-risk and consistently improves cross-module call sites with no source changes - PGO for high-traffic binaries — two-pass build pays for itself in reduced CPU budget within weeks
Common Pitfalls
False sharing in multithreaded code. Cache lines are 64 bytes. Two threads writing to adjacent variables in the same cache line cause constant invalidation across cores, serialising what should be parallel work:
// BAD: counters[0] and counters[1] share a 64-byte cache line
std::atomic<int> counters[2]; // C++11
// Fix: pad each counter to its own cache line
struct alignas(64) PaddedCounter { // alignas: C++11
std::atomic<int> value;
};
PaddedCounter counters[2];-ffast-math in numerics code. The flag allows the compiler to reorder floating-point operations and eliminate NaN/Inf checks. This breaks Kahan summation, makes std::isnan unreliable, and can silently corrupt numerically sensitive algorithms. Enable only after validating output against a reference.
Over-inlining. Marking every small function [[clang::always_inline]] or [[gnu::always_inline]] inflates the instruction cache footprint. Inlining saves nanoseconds per call but wastes kilobytes of I-cache. Use only after the profiler identifies the call site as a measured hotspot.
Premature branchless conversion. Branchless arithmetic eliminates pipeline flushes but adds latency on predictable branches and prevents short-circuit evaluation. On sorted or otherwise predictable data, the original branchy code is often faster. Profile both.
Container constant-factor blindness. std::unordered_map is O(1) amortised but has high constant factors: hash computation, modulo, and pointer chasing. For maps under ~20 entries, a sorted std::array plus std::lower_bound is often faster due to cache locality. Always benchmark before choosing a container for a hot path.
See Also
- Move Semantics — rvalue references, forwarding references, and the Rule of Five
- RAII — deterministic resource management that eliminates allocation overhead in hot paths
- Concepts — C++20 constraints that improve inlining opportunities by reducing template instantiation depth
- Coroutines — C++20 cooperative multitasking without the thread-switch overhead of OS threads