Skip to content
C++
Library
since C++26
Advanced

std::simd — Portable SIMD (C++26)

std::simd<T, Abi>: type-safe portable SIMD in C++26 (P1928) — write vectorized code once, compile to SSE/AVX/NEON/SVE automatically without intrinsics or platform ifdefs.

std::simd<T, Abi>since C++26

std::simd<T, Abi> is a portable, type-safe abstraction over platform SIMD registers in <simd>. Write arithmetic on width-agnostic vector types and the compiler maps it to the best available instruction set (SSE2, AVX2, AVX-512, NEON, SVE) — no intrinsics, no #ifdef.

Why std::simd?

Before C++26, portable vectorization required one of:

ApproachProblem
Raw intrinsics (_mm256_add_ps)Platform-specific, unreadable, error-prone
Auto-vectorizationUnpredictable, breaks silently with aliasing
Highway / xsimd / VCLGood but non-standard, divergent APIs
OpenMP SIMD pragmasPortable but limited expressiveness

std::simd standardizes the "write once, vectorize everywhere" model: the same source compiles to SSE2 on x86-baseline, AVX2 with -mavx2, NEON on ARM, and SVE on AArch64.

Basic Types

cpp
#include <simd>
namespace stdx = std::experimental;  // may be std:: post-C++26 ratification

// Fixed-width: always 4 floats regardless of hardware
std::simd<float, stdx::simd_abi::fixed_size<4>> v4;

// Native width: optimal register width for the target
std::simd<float> vn;   // 4 floats on SSE2, 8 on AVX, 16 on AVX-512

// Common type aliases
using float4  = std::simd<float,  std::simd_abi::fixed_size<4>>;
using double2 = std::simd<double, std::simd_abi::fixed_size<2>>;
using int8v   = std::simd<int,    std::simd_abi::native<int>>;

Construction and Load

cpp
#include <simd>

void example() {
    // Broadcast scalar
    std::simd<float> v1(3.14f);         // all lanes = 3.14

    // From generator
    std::simd<int> v2([](int i) { return i * 2; });  // 0, 2, 4, 6, ...

    // Load from contiguous memory
    float buf[8] = {1,2,3,4,5,6,7,8};
    std::simd<float> v3(buf, std::vector_aligned);   // aligned load
    std::simd<float> v4(buf, std::element_aligned);  // unaligned load
}

Arithmetic

std::simd overloads all arithmetic operators — they operate lane-wise:

cpp
#include <simd>
#include <cmath>

void vector_ops() {
    std::simd<float> a(1.f), b(2.f), c(3.f);

    auto sum  = a + b;          // lane-wise add
    auto prod = a * b + c;      // FMA hint
    auto neg  = -a;             // negate
    auto abs_ = std::abs(a);    // std:: math functions are overloaded

    // Horizontal reduction
    float total = std::reduce(a + b);   // sum all lanes

    // Comparison → mask
    auto mask = a < b;   // std::simd_mask<float>
}

Masked Operations

Masking is how SIMD handles conditional logic without branches:

cpp
#include <simd>

void masked_example() {
    std::simd<float> a{1.f, -2.f, 3.f, -4.f};
    std::simd<float> zeros(0.f);

    // Mask: which lanes satisfy a > 0?
    auto positive = a > zeros;   // simd_mask: true, false, true, false

    // Conditional select: where(mask, true_val, false_val)
    auto clamped = std::experimental::where(positive, a, zeros);
    // result: {1, 0, 3, 0}

    // Masked store
    float out[4];
    std::experimental::where(positive, a).copy_to(out, std::element_aligned);
    // out[0]=1, out[1]=unchanged, out[2]=3, out[3]=unchanged
}

Real-World: Vectorized Dot Product

cpp
#include <simd>
#include <span>

float dot_product(std::span<const float> a, std::span<const float> b) {
    using V = std::simd<float>;
    constexpr auto W = V::size();   // lane count (hardware-optimal)

    V acc(0.f);
    size_t i = 0;

    // Main SIMD loop
    for (; i + W <= a.size(); i += W) {
        V va(a.data() + i, std::element_aligned);
        V vb(b.data() + i, std::element_aligned);
        acc += va * vb;
    }

    // Horizontal reduce
    float result = std::reduce(acc);

    // Scalar tail
    for (; i < a.size(); ++i) result += a[i] * b[i];

    return result;
}

This compiles to vdpps (SSE4) or a multiply+hadd chain (AVX), or NEON vmlaq_f32 + vaddvq_f32 — automatically.

simd_mask

cpp
#include <simd>

void mask_ops() {
    std::simd<int> v{1, -2, 3, -4};
    auto neg_mask = v < std::simd<int>(0);  // {false, true, false, true}

    // Logical ops on masks
    auto pos_mask = !neg_mask;

    // Count matching lanes
    int n = std::popcount(neg_mask);   // 2

    // any/all/none
    bool any_neg = std::any_of(neg_mask);   // true
    bool all_pos = std::all_of(pos_mask);   // false
}

Gather and Scatter

cpp
#include <simd>

void gather_scatter(float* base, int* indices) {
    using V = std::simd<float, std::simd_abi::fixed_size<4>>;
    using I = std::simd<int,   std::simd_abi::fixed_size<4>>;

    I idx(indices, std::element_aligned);

    // Gather: load non-contiguous elements
    V gathered = [&](int i) { return base[idx[i]]; };

    // Scatter: store non-contiguous
    for (int i = 0; i < 4; ++i) base[idx[i]] = gathered[i];
}

ABI Tags and Portability

cpp
namespace abi = std::simd_abi;

// Fixed-size: exact width, portable across platforms
std::simd<float, abi::fixed_size<4>> v4;   // always 128 bits of float

// Native: best width for target ISA
std::simd<float, abi::native<float>> vn;  // 4 on SSE2, 8 on AVX2

// Scalar fallback (always width 1)
std::simd<float, abi::scalar> vs;

Use fixed_size<N> when you have an algorithm designed around a specific width. Use the default (native) for maximum throughput.

Compiler Support

CompilerStatus (2026)
GCC 13+std::experimental::simd (libstdc++)
Clang 17+Partial via std::experimental
MSVCIn progress

Include <experimental/simd> with namespace stdx = std::experimental until the <simd> header is finalized.

Key Rules

  • std::simd<T> size (::size()) is compile-time but hardware-dependent — don't hard-code it
  • Prefer aligned loads (std::vector_aligned) for performance; use std::element_aligned when the address is unknown
  • Branch-free code with masks is faster than scalar branching; avoid if inside SIMD loops
  • The where() helper replaces blend/select intrinsics — it generates VBLENDVPS, VMOVAPS, or conditional MOVs
  • SIMD registers are ephemeral — they don't survive function calls unless the callee is inlined or the ABI passes them in registers
  • For algorithmic code, fixed_size<N> is portable; for peak throughput, use native<T> and let the compiler choose width