Multithreading & Job Systems in C++ Games

TL;DR

Modern games run CPU-bound logic on all cores via a job system — a work queue where threads pick up tasks and execute them. The game loop becomes a graph of dependent jobs rather than a sequential loop. C++17 parallel algorithms and std::jthread (C++20) are good starting points; production engines use custom fiber-based or work-stealing job schedulers.

The problem with naive multithreading

cpp

Godbolt

// Wrong: data races on shared state
std::vector<Entity> entities;

auto t1 = std::thread([&]{ for (auto& e : entities) updatePhysics(e); });
auto t2 = std::thread([&]{ for (auto& e : entities) updateAI(e); });
// Physics and AI accessing the same entities concurrently → UB

The solution isn't locking everything — locks serialize execution and kill performance. The solution is partitioning work so threads don't share mutable state.

Partition-based parallelism

cpp

Godbolt

// Divide entities into N chunks, one per thread
void parallelUpdate(std::vector<Entity>& entities, int num_threads) {
    size_t chunk = entities.size() / num_threads;
    std::vector<std::jthread> threads;

    for (int t = 0; t < num_threads; ++t) {
        size_t start = t * chunk;
        size_t end = (t == num_threads - 1) ? entities.size() : start + chunk;

        threads.emplace_back([&entities, start, end] {
            for (size_t i = start; i < end; ++i)
                updatePhysics(entities[i]);
        });
    }
    // jthread joins on destruction
}

This works when entities are independent (no inter-entity writes). Physics integration is embarrassingly parallel; collision detection is not.

C++17 parallel algorithms

The simplest entry point to parallelism:

cpp

Godbolt

#include <algorithm>
#include <execution>

// Transform with TBB/OpenMP under the hood
std::transform(std::execution::par_unseq,
    positions.begin(), positions.end(),
    velocities.begin(),
    positions.begin(),
    [dt](const Vec3& pos, const Vec3& vel) {
        return pos + vel * dt;
    });

// Parallel for_each
std::for_each(std::execution::par_unseq,
    entities.begin(), entities.end(),
    [](Entity& e) { e.update(); });

Requires libtbb (Intel TBB) or similar on Linux. Good for non-game code; production games usually need finer control.

Simple job queue

cpp

Godbolt

#include <thread>
#include <functional>
#include <queue>
#include <mutex>
#include <condition_variable>
#include <atomic>

class JobQueue {
public:
    using Job = std::function<void()>;

    explicit JobQueue(int num_workers) {
        for (int i = 0; i < num_workers; ++i)
            workers_.emplace_back([this] { workerLoop(); });
    }

    ~JobQueue() {
        {
            std::lock_guard lock(mutex_);
            done_ = true;
        }
        cv_.notify_all();
    }

    void submit(Job job) {
        {
            std::lock_guard lock(mutex_);
            queue_.push(std::move(job));
        }
        cv_.notify_one();
    }

    void waitAll() {
        std::unique_lock lock(mutex_);
        cv_.wait(lock, [this] { return queue_.empty() && active_ == 0; });
    }

private:
    void workerLoop() {
        while (true) {
            Job job;
            {
                std::unique_lock lock(mutex_);
                cv_.wait(lock, [this] { return !queue_.empty() || done_; });
                if (done_ && queue_.empty()) return;
                job = std::move(queue_.front());
                queue_.pop();
                ++active_;
            }
            job();
            {
                std::lock_guard lock(mutex_);
                --active_;
            }
            cv_.notify_all();
        }
    }

    std::vector<std::jthread> workers_;
    std::queue<Job> queue_;
    std::mutex mutex_;
    std::condition_variable cv_;
    int active_ = 0;
    bool done_ = false;
};

// Usage
JobQueue jobs(std::thread::hardware_concurrency() - 1);

void gameLoop() {
    // Split physics update into jobs
    for (int chunk = 0; chunk < num_chunks; ++chunk) {
        jobs.submit([chunk] { updatePhysicsChunk(chunk); });
    }
    jobs.waitAll();  // sync point before rendering

    render();        // single-threaded render command recording
}

Dependency graph (job graph)

Real game engines use a dependency graph where jobs declare what they read/write:

cpp

Godbolt

Frame N:
  [Input]──────────────────┐
  [Physics]────────────────┤
  [AI Decisions]───────────┤──[Collision]──[Apply Forces]──[Render]
  [Animation Sampling]─────┘

cpp

Godbolt

struct JobHandle { std::atomic<bool> done{false}; };

struct Job {
    std::function<void()> fn;
    std::vector<JobHandle*> dependencies;
    JobHandle* handle;
};

void submitWithDeps(JobQueue& q, Job job) {
    q.submit([j = std::move(job)] {
        // Wait for all dependencies
        for (auto* dep : j.dependencies)
            while (!dep->done.load(std::memory_order_acquire)) 
                std::this_thread::yield();
        j.fn();
        j.handle->done.store(true, std::memory_order_release);
    });
}

Production engines (Naughty Dog's Fibers, Unreal's Task Graph) implement this with lock-free queues and OS fibers for better throughput.

Thread-safe patterns

Read-many / write-once (frame flip)

cpp

Godbolt

// Double-buffered state — writer updates "back", reader uses "front"
struct DoubleBuffered {
    EntityState state[2];
    std::atomic<int> front{0};

    EntityState& write() { return state[front ^ 1]; }
    const EntityState& read() const { return state[front]; }

    void flip() { front.store(front ^ 1, std::memory_order_release); }
};

Atomic counters for task completion

cpp

Godbolt

std::atomic<int> remaining_jobs{num_jobs};

auto job = [&] {
    doWork();
    if (remaining_jobs.fetch_sub(1, std::memory_order_acq_rel) == 1) {
        // Last job — signal completion
        completion_event.set();
    }
};

What to parallelize

System	Parallelizable?	Notes
Physics integration	Yes	Embarrassingly parallel
Collision detection broad phase	Yes	Spatial partitioning per thread
Collision response	Careful	May need write ordering
AI state machines	Yes	Independent per entity
Animation sampling	Yes	No shared state
Render command recording	Yes	Per-object draw calls
Audio mixing	Yes	Mix per-source, sum at end
Asset streaming I/O	Yes	Async I/O + thread pool
Gameplay scripting	Risky	Usually keep single-threaded