std::simd — Portable SIMD (C++26)
std::simd<T, Abi>: type-safe portable SIMD in C++26 (P1928) — write vectorized code once, compile to SSE/AVX/NEON/SVE automatically without intrinsics or platform ifdefs.
std::simd<T, Abi>since C++26std::simd<T, Abi> is a portable, type-safe abstraction over platform SIMD registers in <simd>. Write arithmetic on width-agnostic vector types and the compiler maps it to the best available instruction set (SSE2, AVX2, AVX-512, NEON, SVE) — no intrinsics, no #ifdef.
Why std::simd?
Before C++26, portable vectorization required one of:
| Approach | Problem |
|---|---|
Raw intrinsics (_mm256_add_ps) | Platform-specific, unreadable, error-prone |
| Auto-vectorization | Unpredictable, breaks silently with aliasing |
| Highway / xsimd / VCL | Good but non-standard, divergent APIs |
| OpenMP SIMD pragmas | Portable but limited expressiveness |
std::simd standardizes the "write once, vectorize everywhere" model: the same source compiles to SSE2 on x86-baseline, AVX2 with -mavx2, NEON on ARM, and SVE on AArch64.
Basic Types
#include <simd>
namespace stdx = std::experimental; // may be std:: post-C++26 ratification
// Fixed-width: always 4 floats regardless of hardware
std::simd<float, stdx::simd_abi::fixed_size<4>> v4;
// Native width: optimal register width for the target
std::simd<float> vn; // 4 floats on SSE2, 8 on AVX, 16 on AVX-512
// Common type aliases
using float4 = std::simd<float, std::simd_abi::fixed_size<4>>;
using double2 = std::simd<double, std::simd_abi::fixed_size<2>>;
using int8v = std::simd<int, std::simd_abi::native<int>>;Construction and Load
#include <simd>
void example() {
// Broadcast scalar
std::simd<float> v1(3.14f); // all lanes = 3.14
// From generator
std::simd<int> v2([](int i) { return i * 2; }); // 0, 2, 4, 6, ...
// Load from contiguous memory
float buf[8] = {1,2,3,4,5,6,7,8};
std::simd<float> v3(buf, std::vector_aligned); // aligned load
std::simd<float> v4(buf, std::element_aligned); // unaligned load
}Arithmetic
std::simd overloads all arithmetic operators — they operate lane-wise:
#include <simd>
#include <cmath>
void vector_ops() {
std::simd<float> a(1.f), b(2.f), c(3.f);
auto sum = a + b; // lane-wise add
auto prod = a * b + c; // FMA hint
auto neg = -a; // negate
auto abs_ = std::abs(a); // std:: math functions are overloaded
// Horizontal reduction
float total = std::reduce(a + b); // sum all lanes
// Comparison → mask
auto mask = a < b; // std::simd_mask<float>
}Masked Operations
Masking is how SIMD handles conditional logic without branches:
#include <simd>
void masked_example() {
std::simd<float> a{1.f, -2.f, 3.f, -4.f};
std::simd<float> zeros(0.f);
// Mask: which lanes satisfy a > 0?
auto positive = a > zeros; // simd_mask: true, false, true, false
// Conditional select: where(mask, true_val, false_val)
auto clamped = std::experimental::where(positive, a, zeros);
// result: {1, 0, 3, 0}
// Masked store
float out[4];
std::experimental::where(positive, a).copy_to(out, std::element_aligned);
// out[0]=1, out[1]=unchanged, out[2]=3, out[3]=unchanged
}Real-World: Vectorized Dot Product
#include <simd>
#include <span>
float dot_product(std::span<const float> a, std::span<const float> b) {
using V = std::simd<float>;
constexpr auto W = V::size(); // lane count (hardware-optimal)
V acc(0.f);
size_t i = 0;
// Main SIMD loop
for (; i + W <= a.size(); i += W) {
V va(a.data() + i, std::element_aligned);
V vb(b.data() + i, std::element_aligned);
acc += va * vb;
}
// Horizontal reduce
float result = std::reduce(acc);
// Scalar tail
for (; i < a.size(); ++i) result += a[i] * b[i];
return result;
}This compiles to vdpps (SSE4) or a multiply+hadd chain (AVX), or NEON vmlaq_f32 + vaddvq_f32 — automatically.
simd_mask
#include <simd>
void mask_ops() {
std::simd<int> v{1, -2, 3, -4};
auto neg_mask = v < std::simd<int>(0); // {false, true, false, true}
// Logical ops on masks
auto pos_mask = !neg_mask;
// Count matching lanes
int n = std::popcount(neg_mask); // 2
// any/all/none
bool any_neg = std::any_of(neg_mask); // true
bool all_pos = std::all_of(pos_mask); // false
}Gather and Scatter
#include <simd>
void gather_scatter(float* base, int* indices) {
using V = std::simd<float, std::simd_abi::fixed_size<4>>;
using I = std::simd<int, std::simd_abi::fixed_size<4>>;
I idx(indices, std::element_aligned);
// Gather: load non-contiguous elements
V gathered = [&](int i) { return base[idx[i]]; };
// Scatter: store non-contiguous
for (int i = 0; i < 4; ++i) base[idx[i]] = gathered[i];
}ABI Tags and Portability
namespace abi = std::simd_abi;
// Fixed-size: exact width, portable across platforms
std::simd<float, abi::fixed_size<4>> v4; // always 128 bits of float
// Native: best width for target ISA
std::simd<float, abi::native<float>> vn; // 4 on SSE2, 8 on AVX2
// Scalar fallback (always width 1)
std::simd<float, abi::scalar> vs;Use fixed_size<N> when you have an algorithm designed around a specific width. Use the default (native) for maximum throughput.
Compiler Support
| Compiler | Status (2026) |
|---|---|
| GCC 13+ | std::experimental::simd (libstdc++) |
| Clang 17+ | Partial via std::experimental |
| MSVC | In progress |
Include <experimental/simd> with namespace stdx = std::experimental until the <simd> header is finalized.
Key Rules
std::simd<T>size (::size()) is compile-time but hardware-dependent — don't hard-code it- Prefer aligned loads (
std::vector_aligned) for performance; usestd::element_alignedwhen the address is unknown - Branch-free code with masks is faster than scalar branching; avoid
ifinside SIMD loops - The
where()helper replaces blend/select intrinsics — it generatesVBLENDVPS,VMOVAPS, or conditional MOVs - SIMD registers are ephemeral — they don't survive function calls unless the callee is inlined or the ABI passes them in registers
- For algorithmic code,
fixed_size<N>is portable; for peak throughput, usenative<T>and let the compiler choose width