Skip to content
C++
Library
since C++17
Intermediate

Parallel Algorithms (C++17)

C++17 execution policies for STL algorithms — par, par_unseq, reduce, transform_reduce, scan, thread safety, and platform requirements.

Parallel Algorithmssince C++17

C++17 adds execution policy overloads to 77 STL algorithms, allowing sequential, parallel, or parallel-plus-vectorized execution by passing std::execution::seq, std::execution::par, or std::execution::par_unseq as the first argument.

Overview

Before C++17, parallelizing STL algorithms required manual thread management or third-party libraries. C++17 standardizes execution policies as an orthogonal axis: the same algorithm, the same iterators, a different first argument. The implementation is free to use a thread pool, SIMD instructions, or both.

C++20 adds a fourth policy: std::execution::unseq — single-threaded vectorization without parallelism.

The policies are distinct types, not an enum, which allows overload resolution and if constexpr branching. All four live in <execution>.

cpp
#include <algorithm>
#include <execution>
#include <numeric>
#include <vector>

std::vector<double> v(5'000'000);
std::iota(v.begin(), v.end(), 0.0);

std::sort(std::execution::seq,        v.begin(), v.end()); // C++17 — sequential
std::sort(std::execution::par,        v.begin(), v.end()); // C++17 — thread pool
std::sort(std::execution::par_unseq,  v.begin(), v.end()); // C++17 — threads + SIMD
std::sort(std::execution::unseq,      v.begin(), v.end()); // C++20 — SIMD only

Syntax

Execution policy types

cpp
namespace std::execution {
    // C++17
    inline constexpr sequenced_policy            seq{};
    inline constexpr parallel_policy             par{};
    inline constexpr parallel_unsequenced_policy par_unseq{};

    // C++20
    inline constexpr unsequenced_policy          unseq{};
}

par guarantees that element operations may execute on separate threads but are sequenced within each thread. This means you can use mutexes (carefully), but shared mutable state without synchronization is a data race.

par_unseq additionally permits SIMD vectorization: operations on multiple elements may be interleaved within a single thread. This forbids acquiring locks (a vectorized operation may start on one SIMD lane before a prior lane has released), throwing exceptions, and calling memory-allocation functions.

The 77 parallel algorithm overloads

Every algorithm that accepts parallel policies follows the same pattern — insert the policy before the iterator arguments:

cpp
// seq/par/par_unseq overloads exist for all of these (C++17):
std::for_each          std::for_each_n
std::transform         std::copy         std::copy_if       std::copy_n
std::fill              std::fill_n       std::generate      std::generate_n
std::replace           std::replace_if   std::replace_copy  std::replace_copy_if
std::remove            std::remove_if    std::remove_copy   std::remove_copy_if
std::sort              std::stable_sort  std::partial_sort  std::partial_sort_copy
std::nth_element       std::merge        std::inplace_merge std::partition
std::stable_partition  std::partition_copy
std::find              std::find_if      std::find_if_not   std::find_end
std::find_first_of     std::search       std::search_n      std::adjacent_find
std::count             std::count_if
std::all_of            std::any_of       std::none_of
std::equal             std::mismatch     std::lexicographical_compare
std::unique            std::unique_copy  std::reverse       std::reverse_copy
std::rotate            std::rotate_copy  std::is_sorted     std::is_sorted_until
std::is_partitioned    std::is_permutation
std::min_element       std::max_element  std::minmax_element
std::reduce            std::transform_reduce
std::inclusive_scan    std::exclusive_scan
std::transform_inclusive_scan            std::transform_exclusive_scan
std::adjacent_difference
std::uninitialized_copy std::uninitialized_copy_n
std::uninitialized_fill std::uninitialized_fill_n
std::uninitialized_move std::uninitialized_move_n
std::destroy            std::destroy_n

Examples

std::reduce and std::transform_reduce

std::reduce (C++17, <numeric>) is the parallel-safe counterpart to std::accumulate. Unlike accumulate, it does not guarantee left-to-right evaluation — the operation must be associative and commutative for correct results. This matters for floating-point: you will get different answers depending on how the runtime partitions the work.

cpp
#include <numeric>
#include <execution>
#include <vector>

std::vector<double> v(1'000'000);
// populate ...

// accumulate — sequential, strictly left-to-right (C++98)
double s1 = std::accumulate(v.begin(), v.end(), 0.0);

// reduce — reordering permitted, parallel-friendly (C++17)
double s2 = std::reduce(std::execution::par, v.begin(), v.end(), 0.0);
// s1 == s2 for integers; may differ for doubles due to fp reordering

// Parallel dot product via transform_reduce (C++17)
std::vector<double> a(1'000'000), b(1'000'000);
double dot = std::transform_reduce(
    std::execution::par,
    a.begin(), a.end(),
    b.begin(),
    0.0                    // identity for addition
    // default ops: multiply then add
);

Custom reduction with transform_reduce

cpp
struct Stats {
    double sum    = 0.0;
    double sum_sq = 0.0;
    long   count  = 0;
};

// Single pass, parallel, computes mean and stddev simultaneously
Stats result = std::transform_reduce(
    std::execution::par,
    data.begin(), data.end(),
    Stats{},
    // Binary reduce: merge two partial Stats
    [](Stats a, const Stats& b) noexcept -> Stats {
        return {a.sum + b.sum, a.sum_sq + b.sum_sq, a.count + b.count};
    },
    // Unary transform: element → Stats
    [](double x) noexcept -> Stats {
        return {x, x * x, 1};
    }
);

double mean   = result.sum / result.count;
double stddev = std::sqrt(result.sum_sq / result.count - mean * mean);

Parallel prefix scan

std::inclusive_scan and std::exclusive_scan (C++17) compute prefix sums and generalizations. Unlike std::partial_sum (C++98), they accept execution policies and relax left-to-right ordering:

cpp
std::vector<int> input = {1, 2, 3, 4, 5};
std::vector<int> output(input.size());

// [1, 3, 6, 10, 15]
std::inclusive_scan(std::execution::par,
    input.begin(), input.end(), output.begin());

// [0, 1, 3, 6, 10]
std::exclusive_scan(std::execution::par,
    input.begin(), input.end(), output.begin(), 0);

Parallel image processing

cpp
struct Pixel { uint8_t r, g, b, a; };

std::vector<Pixel> image(width * height);
const float inv_gamma = 1.0f / 2.2f;

// par_unseq is safe here: pure arithmetic, no locks, no exceptions
std::transform(std::execution::par_unseq,
    image.begin(), image.end(), image.begin(),
    [inv_gamma](Pixel p) noexcept -> Pixel {
        auto gc = [inv_gamma](uint8_t c) noexcept -> uint8_t {
            return static_cast<uint8_t>(
                std::pow(c * (1.0f / 255.0f), inv_gamma) * 255.0f + 0.5f);
        };
        return {gc(p.r), gc(p.g), gc(p.b), p.a};
    });

Best Practices

Use reduce instead of accumulate for parallel sums. std::accumulate has no execution policy overload — it is always sequential. Switching to std::reduce is the correct path to parallelism.

Benchmark before committing to parallelism. Thread pool overhead (task scheduling, cache thrashing, false sharing) commonly dominates for small inputs. Rough empirical thresholds on modern desktop hardware:

AlgorithmSequential wins belowNotes
sort~50K elementsThread overhead per partition chunk
transform~500K elementsOnly when work/element > ~10ns
reduce~100K elementsMemory bandwidth saturates early
for_each~10K if expensiveDepends entirely on functor cost

Prefer par_unseq for pure numeric kernels. When your functor is arithmetic with no side effects, par_unseq enables SIMD vectorization on top of threading — often a 4–8× speedup over par alone on AVX2 hardware.

Keep functors small and cheap to copy. The runtime may copy functors across threads. Closures capturing large objects by value have hidden cost; capture by reference is fine for par (not par_unseq if the object has locking internals).

Common Pitfalls

Data race with shared mutable state under par

cpp
// WRONG — total += x is a data race; undefined behavior
int total = 0;
std::for_each(std::execution::par, v.begin(), v.end(),
    [&total](int x) { total += x; });

// CORRECT — use the dedicated reduction algorithm
int total = std::reduce(std::execution::par, v.begin(), v.end(), 0);

// CORRECT — atomic if you genuinely need for_each side effects
std::atomic<int> total{0};
std::for_each(std::execution::par, v.begin(), v.end(),
    [&total](int x) { total.fetch_add(x, std::memory_order_relaxed); });

Locking inside par_unseq

SIMD execution interleaves multiple elements' operations within a single thread. Acquiring a mutex inside a vectorized operation can deadlock because the lock-acquire for element N may be issued on a SIMD lane before the lock-release for element M (processed earlier in program order) has fired:

cpp
// WRONG — deadlock risk with par_unseq + mutex
std::mutex mtx;
std::for_each(std::execution::par_unseq, v.begin(), v.end(),
    [&](int& x) {
        std::lock_guard lock(mtx); // UB / deadlock under vectorization
        x *= 2;
    });

// CORRECT — no shared state needed here
std::transform(std::execution::par_unseq,
    v.begin(), v.end(), v.begin(),
    [](int x) noexcept { return x * 2; });

Exceptions inside par_unseq

Under par_unseq, throwing an exception from a functor is undefined behavior. Under par, the implementation calls std::terminate. If your functor can throw, use seq or restructure the computation:

cpp
// WRONG — exception under par_unseq is UB; under par calls std::terminate
std::transform(std::execution::par_unseq, v.begin(), v.end(), out.begin(),
    [](int x) -> int {
        if (x < 0) throw std::range_error{"negative"};
        return x * 2;
    });

// CORRECT — encode the error as a sentinel value, check afterward
std::transform(std::execution::par_unseq, v.begin(), v.end(), out.begin(),
    [](int x) noexcept -> int { return x < 0 ? INT_MIN : x * 2; });

Floating-point nondeterminism with par

std::reduce with par may produce different results on different runs due to nondeterministic operation ordering. This is not a bug — it is the documented behavior. For reproducible floating-point results, use std::accumulate (always sequential).

Platform Support

As of 2025, platform support is uneven:

bash
# GCC (9.1+) — requires Intel TBB for par/par_unseq policies
g++ -std=c++17 -O3 source.cpp -ltbb

# CMake
find_package(TBB REQUIRED)
target_link_libraries(myapp PRIVATE TBB::tbb)

# MSVC (VS 2017+) — built-in support, no extra library
cl /std:c++17 /O2 source.cpp

# Clang (15+) — requires PSTL or libc++ with TBB backend
clang++ -std=c++17 -O3 source.cpp -ltbb

GCC links the PSTL (Parallel STL) frontend against TBB at runtime. If TBB is not present, par and par_unseq silently fall back to sequential execution on some implementations — always confirm your build links -ltbb and the TBB runtime is available.

See Also

  • std::accumulate — sequential predecessor to std::reduce
  • std::transform — element-wise transformation
  • std::sort — sorting with optional execution policy
  • Ranges (C++20) — ranges algorithms do not yet accept execution policies as of C++23
  • Coroutines — cooperative concurrency, complementary to data parallelism