Profiler
Updated 2025-01-01T00:00:00.000ZLinux perf Profiler
Using Linux perf — perf stat, perf record/report, flame graphs, hardware counters, cache misses, branch mispredictions, and perf annotate.
TL;DR
perf stat for overall hardware counter summary. perf record + perf report for function-level hotspots. Flame graphs (via Brendan Gregg's scripts) give the most actionable view.
bash
# Quick summary
perf stat ./myapp
# Record and report
perf record -g ./myapp
perf report
# Flame graph
perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svgInstallation
bash
# Ubuntu/Debian
sudo apt install linux-perf linux-tools-common linux-tools-$(uname -r)
# Fedora/RHEL
sudo dnf install perf
# Allow non-root profiling (set to -1 for no restriction, 0 for root only)
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoidperf stat — Hardware Counters
bash
# Basic stats: cycles, instructions, cache misses
perf stat ./myapp
# Output example:
# 3,241,123,456 cycles
# 8,012,345,678 instructions # 2.47 insns per cycle
# 12,345,678 cache-references
# 123,456 cache-misses # 1.00% of all cache refs
# 1,234,567 branch-misses
# Specific events
perf stat -e cycles,instructions,L1-dcache-misses,LLC-misses ./myapp
# Compare multiple runs
perf stat -r 5 ./myapp # repeat 5 times, show std deviation
# All events
perf list # list all available events
perf stat -e 'cache-misses,cache-references,branch-misses,page-faults' ./myappperf record / report — Sampling
bash
# Record with call graph (frame pointer)
perf record -g ./myapp
# Record with DWARF call graph (more accurate, more data)
perf record --call-graph dwarf ./myapp
# Record at specific frequency
perf record -F 1000 -g ./myapp # 1000 samples/sec
# Record a running process
perf record -F 99 -g -p <PID> -- sleep 30
# Record all CPUs
perf record -F 99 -g -a -- sleep 10
# Interactive report
perf report
# Report to stdout
perf report --stdio
# Show top functions
perf report --sort=dso,sym --stdio | head -50perf report Navigation
cpp
# Interactive TUI keys:
Enter — expand/collapse call tree
d — annotate with source/assembly
? — help
q — quit
A — toggle annotation
s — sort order
z/Z — zoom in/outperf annotate — Source + Assembly
bash
# Annotate source with hot instructions
perf annotate --stdio -s my_function
# Or in report TUI: press 'a' on a functioncpp
# Output shows percentage of time per instruction:
0.52 : 400ca0: mov (%rdi),%eax
12.38 : 400ca3: add %eax,%esi ← hot instruction
0.11 : 400ca5: mov %esi,(%rdi)perf diff — Compare Two Profiles
bash
perf record -o before.data ./myapp_before
perf record -o after.data ./myapp_after
perf diff before.data after.data
# Shows % change per symbolFlame Graphs
bash
# Install Brendan Gregg's flamegraph scripts
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:./FlameGraph
# Record
perf record -F 99 -g ./myapp -- args
# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
firefox flame.svg
# Off-CPU flame graph (time waiting, not running)
perf record -e sched:sched_switch -ag ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl --color=io > offcpu.svgSpecific Use Cases
Cache Miss Analysis
bash
perf stat -e \
L1-dcache-loads,L1-dcache-load-misses,\
L1-icache-loads,L1-icache-load-misses,\
LLC-loads,LLC-load-misses \
./myappBranch Misprediction
bash
perf stat -e branches,branch-misses ./myapp
perf record -e branch-misses -g ./myapp
perf reportMemory Bandwidth
bash
perf stat -e \
uncore_imc/data_reads/,\
uncore_imc/data_writes/ \
./myappSyscall Tracing
bash
# Count syscalls
perf stat -e 'syscalls:sys_enter_*' ./myapp
# Trace specific syscall
perf trace -e mmap ./myapp
# All system calls with latency
perf trace --summary ./myappBuild Flags for Profiling
bash
# Keep symbols, enable call graphs
g++ -O2 -g -fno-omit-frame-pointer -o myapp myapp.cpp
# For accurate DWARF call graphs
g++ -O2 -g -gdwarf-4 -fno-optimize-sibling-calls -o myapp myapp.cppcmake
# CMake
add_compile_options(-fno-omit-frame-pointer)
set(CMAKE_BUILD_TYPE RelWithDebInfo) # O2 + debug infoperf vs Other Profilers
| Tool | Mechanism | Overhead | Granularity |
|---|---|---|---|
| perf | Hardware PMU sampling | Very low | Function + line |
| Valgrind/callgrind | Instrumentation | ~10-50x | Every instruction |
| gprof | Instrumented + sampling | Low | Function |
| Heaptrack | Heap allocation tracking | Low | Allocation site |
| Intel VTune | Hardware PMU | Very low | Microarchitecture |
| Tracy | Manual instrumentation | Very low | Custom zones |
Common Workflows
bash
# 1. Find CPU hotspot
perf record -g ./myapp && perf report
# 2. Check cache efficiency
perf stat -e cache-misses,cache-references ./myapp
# 3. Understand where time goes (flame graph)
perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# 4. Compare before/after optimization
perf record -o v1.data ./myapp_v1
perf record -o v2.data ./myapp_v2
perf diff v1.data v2.data
# 5. Identify lock contention
perf lock record ./myapp
perf lock reportEdit on GitHubUpdated 2025-01-01T00:00:00.000Z