Parallel Algorithms (C++17)
C++17 execution policies for STL algorithms — par, par_unseq, reduce, transform_reduce, scan, thread safety, and platform requirements.
Parallel Algorithmssince C++17C++17 adds execution policy overloads to 77 STL algorithms, allowing sequential, parallel, or parallel-plus-vectorized execution by passing std::execution::seq, std::execution::par, or std::execution::par_unseq as the first argument.
Overview
Before C++17, parallelizing STL algorithms required manual thread management or third-party libraries. C++17 standardizes execution policies as an orthogonal axis: the same algorithm, the same iterators, a different first argument. The implementation is free to use a thread pool, SIMD instructions, or both.
C++20 adds a fourth policy: std::execution::unseq — single-threaded vectorization without parallelism.
The policies are distinct types, not an enum, which allows overload resolution and if constexpr branching. All four live in <execution>.
#include <algorithm>
#include <execution>
#include <numeric>
#include <vector>
std::vector<double> v(5'000'000);
std::iota(v.begin(), v.end(), 0.0);
std::sort(std::execution::seq, v.begin(), v.end()); // C++17 — sequential
std::sort(std::execution::par, v.begin(), v.end()); // C++17 — thread pool
std::sort(std::execution::par_unseq, v.begin(), v.end()); // C++17 — threads + SIMD
std::sort(std::execution::unseq, v.begin(), v.end()); // C++20 — SIMD onlySyntax
Execution policy types
namespace std::execution {
// C++17
inline constexpr sequenced_policy seq{};
inline constexpr parallel_policy par{};
inline constexpr parallel_unsequenced_policy par_unseq{};
// C++20
inline constexpr unsequenced_policy unseq{};
}par guarantees that element operations may execute on separate threads but are sequenced within each thread. This means you can use mutexes (carefully), but shared mutable state without synchronization is a data race.
par_unseq additionally permits SIMD vectorization: operations on multiple elements may be interleaved within a single thread. This forbids acquiring locks (a vectorized operation may start on one SIMD lane before a prior lane has released), throwing exceptions, and calling memory-allocation functions.
The 77 parallel algorithm overloads
Every algorithm that accepts parallel policies follows the same pattern — insert the policy before the iterator arguments:
// seq/par/par_unseq overloads exist for all of these (C++17):
std::for_each std::for_each_n
std::transform std::copy std::copy_if std::copy_n
std::fill std::fill_n std::generate std::generate_n
std::replace std::replace_if std::replace_copy std::replace_copy_if
std::remove std::remove_if std::remove_copy std::remove_copy_if
std::sort std::stable_sort std::partial_sort std::partial_sort_copy
std::nth_element std::merge std::inplace_merge std::partition
std::stable_partition std::partition_copy
std::find std::find_if std::find_if_not std::find_end
std::find_first_of std::search std::search_n std::adjacent_find
std::count std::count_if
std::all_of std::any_of std::none_of
std::equal std::mismatch std::lexicographical_compare
std::unique std::unique_copy std::reverse std::reverse_copy
std::rotate std::rotate_copy std::is_sorted std::is_sorted_until
std::is_partitioned std::is_permutation
std::min_element std::max_element std::minmax_element
std::reduce std::transform_reduce
std::inclusive_scan std::exclusive_scan
std::transform_inclusive_scan std::transform_exclusive_scan
std::adjacent_difference
std::uninitialized_copy std::uninitialized_copy_n
std::uninitialized_fill std::uninitialized_fill_n
std::uninitialized_move std::uninitialized_move_n
std::destroy std::destroy_nExamples
std::reduce and std::transform_reduce
std::reduce (C++17, <numeric>) is the parallel-safe counterpart to std::accumulate. Unlike accumulate, it does not guarantee left-to-right evaluation — the operation must be associative and commutative for correct results. This matters for floating-point: you will get different answers depending on how the runtime partitions the work.
#include <numeric>
#include <execution>
#include <vector>
std::vector<double> v(1'000'000);
// populate ...
// accumulate — sequential, strictly left-to-right (C++98)
double s1 = std::accumulate(v.begin(), v.end(), 0.0);
// reduce — reordering permitted, parallel-friendly (C++17)
double s2 = std::reduce(std::execution::par, v.begin(), v.end(), 0.0);
// s1 == s2 for integers; may differ for doubles due to fp reordering
// Parallel dot product via transform_reduce (C++17)
std::vector<double> a(1'000'000), b(1'000'000);
double dot = std::transform_reduce(
std::execution::par,
a.begin(), a.end(),
b.begin(),
0.0 // identity for addition
// default ops: multiply then add
);Custom reduction with transform_reduce
struct Stats {
double sum = 0.0;
double sum_sq = 0.0;
long count = 0;
};
// Single pass, parallel, computes mean and stddev simultaneously
Stats result = std::transform_reduce(
std::execution::par,
data.begin(), data.end(),
Stats{},
// Binary reduce: merge two partial Stats
[](Stats a, const Stats& b) noexcept -> Stats {
return {a.sum + b.sum, a.sum_sq + b.sum_sq, a.count + b.count};
},
// Unary transform: element → Stats
[](double x) noexcept -> Stats {
return {x, x * x, 1};
}
);
double mean = result.sum / result.count;
double stddev = std::sqrt(result.sum_sq / result.count - mean * mean);Parallel prefix scan
std::inclusive_scan and std::exclusive_scan (C++17) compute prefix sums and generalizations. Unlike std::partial_sum (C++98), they accept execution policies and relax left-to-right ordering:
std::vector<int> input = {1, 2, 3, 4, 5};
std::vector<int> output(input.size());
// [1, 3, 6, 10, 15]
std::inclusive_scan(std::execution::par,
input.begin(), input.end(), output.begin());
// [0, 1, 3, 6, 10]
std::exclusive_scan(std::execution::par,
input.begin(), input.end(), output.begin(), 0);Parallel image processing
struct Pixel { uint8_t r, g, b, a; };
std::vector<Pixel> image(width * height);
const float inv_gamma = 1.0f / 2.2f;
// par_unseq is safe here: pure arithmetic, no locks, no exceptions
std::transform(std::execution::par_unseq,
image.begin(), image.end(), image.begin(),
[inv_gamma](Pixel p) noexcept -> Pixel {
auto gc = [inv_gamma](uint8_t c) noexcept -> uint8_t {
return static_cast<uint8_t>(
std::pow(c * (1.0f / 255.0f), inv_gamma) * 255.0f + 0.5f);
};
return {gc(p.r), gc(p.g), gc(p.b), p.a};
});Best Practices
Use reduce instead of accumulate for parallel sums. std::accumulate has no execution policy overload — it is always sequential. Switching to std::reduce is the correct path to parallelism.
Benchmark before committing to parallelism. Thread pool overhead (task scheduling, cache thrashing, false sharing) commonly dominates for small inputs. Rough empirical thresholds on modern desktop hardware:
| Algorithm | Sequential wins below | Notes |
|---|---|---|
sort | ~50K elements | Thread overhead per partition chunk |
transform | ~500K elements | Only when work/element > ~10ns |
reduce | ~100K elements | Memory bandwidth saturates early |
for_each | ~10K if expensive | Depends entirely on functor cost |
Prefer par_unseq for pure numeric kernels. When your functor is arithmetic with no side effects, par_unseq enables SIMD vectorization on top of threading — often a 4–8× speedup over par alone on AVX2 hardware.
Keep functors small and cheap to copy. The runtime may copy functors across threads. Closures capturing large objects by value have hidden cost; capture by reference is fine for par (not par_unseq if the object has locking internals).
Common Pitfalls
Data race with shared mutable state under par
// WRONG — total += x is a data race; undefined behavior
int total = 0;
std::for_each(std::execution::par, v.begin(), v.end(),
[&total](int x) { total += x; });
// CORRECT — use the dedicated reduction algorithm
int total = std::reduce(std::execution::par, v.begin(), v.end(), 0);
// CORRECT — atomic if you genuinely need for_each side effects
std::atomic<int> total{0};
std::for_each(std::execution::par, v.begin(), v.end(),
[&total](int x) { total.fetch_add(x, std::memory_order_relaxed); });Locking inside par_unseq
SIMD execution interleaves multiple elements' operations within a single thread. Acquiring a mutex inside a vectorized operation can deadlock because the lock-acquire for element N may be issued on a SIMD lane before the lock-release for element M (processed earlier in program order) has fired:
// WRONG — deadlock risk with par_unseq + mutex
std::mutex mtx;
std::for_each(std::execution::par_unseq, v.begin(), v.end(),
[&](int& x) {
std::lock_guard lock(mtx); // UB / deadlock under vectorization
x *= 2;
});
// CORRECT — no shared state needed here
std::transform(std::execution::par_unseq,
v.begin(), v.end(), v.begin(),
[](int x) noexcept { return x * 2; });Exceptions inside par_unseq
Under par_unseq, throwing an exception from a functor is undefined behavior. Under par, the implementation calls std::terminate. If your functor can throw, use seq or restructure the computation:
// WRONG — exception under par_unseq is UB; under par calls std::terminate
std::transform(std::execution::par_unseq, v.begin(), v.end(), out.begin(),
[](int x) -> int {
if (x < 0) throw std::range_error{"negative"};
return x * 2;
});
// CORRECT — encode the error as a sentinel value, check afterward
std::transform(std::execution::par_unseq, v.begin(), v.end(), out.begin(),
[](int x) noexcept -> int { return x < 0 ? INT_MIN : x * 2; });Floating-point nondeterminism with par
std::reduce with par may produce different results on different runs due to nondeterministic operation ordering. This is not a bug — it is the documented behavior. For reproducible floating-point results, use std::accumulate (always sequential).
Platform Support
As of 2025, platform support is uneven:
# GCC (9.1+) — requires Intel TBB for par/par_unseq policies
g++ -std=c++17 -O3 source.cpp -ltbb
# CMake
find_package(TBB REQUIRED)
target_link_libraries(myapp PRIVATE TBB::tbb)
# MSVC (VS 2017+) — built-in support, no extra library
cl /std:c++17 /O2 source.cpp
# Clang (15+) — requires PSTL or libc++ with TBB backend
clang++ -std=c++17 -O3 source.cpp -ltbbGCC links the PSTL (Parallel STL) frontend against TBB at runtime. If TBB is not present, par and par_unseq silently fall back to sequential execution on some implementations — always confirm your build links -ltbb and the TBB runtime is available.
See Also
std::accumulate— sequential predecessor tostd::reducestd::transform— element-wise transformationstd::sort— sorting with optional execution policy- Ranges (C++20) — ranges algorithms do not yet accept execution policies as of C++23
- Coroutines — cooperative concurrency, complementary to data parallelism