Skip to content
C++
Library
since C++17
Intermediate

Execution Policies

Tag types and global constants that control whether STL algorithms run sequentially, in parallel, or with SIMD vectorization — introduced in C++17.

Execution Policysince C++17

An execution policy is a tag type passed as the first argument to a parallelism-enabled STL algorithm to specify whether execution must be sequential, may use multiple threads, or may additionally interleave instructions across SIMD lanes.

Overview

Prior to C++17, every STL algorithm executed sequentially on the calling thread. C++17 introduced a parallel algorithm overload family: pass an execution policy as the first argument and the algorithm may exploit hardware parallelism. The policy is not a hint — it constrains the guarantees the implementation must uphold and the guarantees the caller must provide.

The <execution> header defines four policy types and one global constant of each type:

TypeGlobal constantThread modelSIMDSince
std::execution::sequenced_policystd::execution::seqsingle threadnoC++17
std::execution::parallel_policystd::execution::parmultiple threadsnoC++17
std::execution::parallel_unsequenced_policystd::execution::par_unseqmultiple threadsyesC++17
std::execution::unsequenced_policystd::execution::unseqsingle threadyesC++20

Sequenced (seq) — identical to the classic non-policy overload. Operations on element i happen-before operations on element i+1. Useful to enforce sequential semantics explicitly, or when benchmarking against the parallel variants.

Parallel (par) — the implementation may distribute loop iterations across an unspecified number of threads. Each individual element operation executes without interleaving with other element operations, so per-element work is effectively serialized within itself. The caller is responsible for ensuring no data races across elements and no deadlocks (e.g., no mutex that the algorithm's functor also tries to acquire on the same thread).

Parallel-unsequenced (par_unseq) — the implementation may additionally vectorize: a single thread may interleave instructions from multiple element operations within a single SIMD execution unit. This means element operations may overlap even on one thread. Consequently, the functor must not acquire mutexes, call new/delete, or perform any operation that is not safe to re-enter or interleave.

Unsequenced (unseq, C++20) — SIMD vectorization on a single thread. Like par_unseq but restricted to one thread, which makes it safe when the algorithm itself is already inside a parallel region managed by the caller.

C++17 specifies 77 standard algorithms that accept an execution policy overload. These include std::sort, std::transform, std::reduce, std::for_each, std::find, std::count_if, std::copy_if, and many others. std::accumulate is notably absent; its parallel equivalent is std::reduce.

Syntax

cpp
#include <algorithm>
#include <execution>
#include <numeric>
#include <vector>

std::sort(std::execution::par_unseq, v.begin(), v.end());   // C++17
std::reduce(std::execution::par, v.begin(), v.end(), 0);    // C++17
std::for_each(std::execution::unseq, v.begin(), v.end(), f); // C++20

The policy argument must be the first argument and must satisfy std::is_execution_policy_v<T>. Passing a policy to an algorithm that has no parallel overload is a compile error.

Examples

Parallel sort with timing

cpp
#include <algorithm>
#include <chrono>
#include <execution>
#include <iostream>
#include <random>
#include <vector>

int main() {
    std::vector<double> v(10'000'000);
    std::mt19937_64 rng{42};
    std::uniform_real_distribution<double> dist;
    std::generate(v.begin(), v.end(), [&] { return dist(rng); });

    auto data_par = v;
    auto data_seq = v;

    auto t0 = std::chrono::steady_clock::now();
    std::sort(std::execution::par_unseq, data_par.begin(), data_par.end()); // C++17
    auto t1 = std::chrono::steady_clock::now();
    std::sort(std::execution::seq, data_seq.begin(), data_seq.end());       // C++17
    auto t2 = std::chrono::steady_clock::now();

    std::cout << "par_unseq: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count()
              << " ms\n";
    std::cout << "seq:       "
              << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count()
              << " ms\n";
}

Transform-reduce pipeline

std::transform_reduce (C++17) is the parallel-friendly substitute for the classic inner-product pattern:

cpp
#include <execution>
#include <numeric>
#include <vector>

double dot_product(const std::vector<double>& a, const std::vector<double>& b) {
    return std::transform_reduce(
        std::execution::par_unseq,          // C++17
        a.begin(), a.end(),
        b.begin(),
        0.0
    );
}

Correct accumulation into a shared result with par

When using par, concurrent writes to a shared variable are a data race. Use std::reduce (which requires commutativity and associativity) instead of manually accumulating:

cpp
// WRONG — data race: multiple threads write to `total` simultaneously
int total = 0;
std::for_each(std::execution::par, v.begin(), v.end(),
    [&](int x) { total += x; }); // undefined behaviour

// CORRECT — std::reduce handles parallelism internally
int total = std::reduce(std::execution::par, v.begin(), v.end(), 0); // C++17

Per-element mutation without sharing

par is safe when each functor invocation touches only its own element:

cpp
std::for_each(std::execution::par, v.begin(), v.end(),
    [](double& x) { x = std::sqrt(x); }); // safe: no shared state

Best Practices

Profile before committing to a policy. Parallel dispatch has non-trivial overhead (thread-pool wakeup, work stealing, SIMD lane alignment). For collections under roughly 10,000 elements, seq or no policy often wins. Measure with representative data.

Prefer par_unseq over par for pure transformations. When your functor has no locks and no heap allocation, par_unseq gives the implementation maximum freedom to vectorize, often yielding the best throughput.

Use std::reduce instead of std::accumulate when parallelizing sums. std::accumulate has no parallel overload. std::reduce requires the binary operation to be commutative and associative — floating-point addition technically violates strict associativity, but the difference in summation order is usually acceptable for practical workloads.

Reserve unseq (C++20) for code already running inside a parallel region. If you are inside an OpenMP parallel section or a thread-pool task and want vectorization without spawning additional threads, unseq is the right tool.

Check compiler and standard-library support. GCC links against Intel oneTBB or similar for par. MSVC ships its own parallel backend. Clang on macOS historically required explicit linking. Confirm <execution> is available before relying on it in portable code.

Common Pitfalls

Mutex acquisition inside a par_unseq functor causes undefined behaviour. The policy explicitly permits instruction interleaving within a single thread; trying to acquire a mutex from an interleaved context will deadlock or corrupt the mutex state. Restrict par_unseq functors to lock-free, allocation-free code.

Non-commutative binary operations with std::reduce produce indeterminate results. Unlike std::accumulate, std::reduce may reorder and group partial sums arbitrarily. Subtraction, string concatenation, and matrix multiplication are not safe binary operations for std::reduce without careful design.

Capturing mutable shared state in a par lambda without synchronization. A lambda that increments a shared counter without atomics or a mutex introduces a data race regardless of which policy is used — the policy controls scheduling, not synchronization.

Forgetting that par does not guarantee progress on the calling thread. The calling thread may or may not participate in the work. Waiting for a condition variable in the functor while holding a lock that the calling thread would otherwise release is a deadlock pattern specific to parallel execution.

Treating small workloads as trivially parallelizable. Thread-pool overhead can make par measurably slower than seq for small vectors. The crossover point depends on the per-element work; a square-root is cheap enough that you often need millions of elements before par pays off.

See Also

  • std::reduce — parallel-safe accumulation requiring commutativity and associativity (C++17)
  • std::transform_reduce — fused parallel map-reduce pipeline (C++17)
  • std::for_each parallel overload — parallel application of a side-effect functor (C++17)
  • std::is_execution_policy — trait to detect execution policy types (C++17)
  • reference/library/concurrency/execution — lower-level executor and scheduling primitives