Linux perf Profiler

TL;DR

perf stat for overall hardware counter summary. perf record + perf report for function-level hotspots. Flame graphs (via Brendan Gregg's scripts) give the most actionable view.

bash

# Quick summary
perf stat ./myapp

# Record and report
perf record -g ./myapp
perf report

# Flame graph
perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg

Installation

bash

# Ubuntu/Debian
sudo apt install linux-perf linux-tools-common linux-tools-$(uname -r)

# Fedora/RHEL
sudo dnf install perf

# Allow non-root profiling (set to -1 for no restriction, 0 for root only)
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

perf stat — Hardware Counters

bash

# Basic stats: cycles, instructions, cache misses
perf stat ./myapp

# Output example:
#  3,241,123,456  cycles
#  8,012,345,678  instructions              #  2.47  insns per cycle
#     12,345,678  cache-references
#        123,456  cache-misses              #    1.00% of all cache refs
#      1,234,567  branch-misses

# Specific events
perf stat -e cycles,instructions,L1-dcache-misses,LLC-misses ./myapp

# Compare multiple runs
perf stat -r 5 ./myapp  # repeat 5 times, show std deviation

# All events
perf list                # list all available events
perf stat -e 'cache-misses,cache-references,branch-misses,page-faults' ./myapp

perf record / report — Sampling

bash

# Record with call graph (frame pointer)
perf record -g ./myapp

# Record with DWARF call graph (more accurate, more data)
perf record --call-graph dwarf ./myapp

# Record at specific frequency
perf record -F 1000 -g ./myapp  # 1000 samples/sec

# Record a running process
perf record -F 99 -g -p <PID> -- sleep 30

# Record all CPUs
perf record -F 99 -g -a -- sleep 10

# Interactive report
perf report

# Report to stdout
perf report --stdio

# Show top functions
perf report --sort=dso,sym --stdio | head -50

cpp

Godbolt

# Interactive TUI keys:
Enter     — expand/collapse call tree
d         — annotate with source/assembly
?         — help
q         — quit
A         — toggle annotation
s         — sort order
z/Z       — zoom in/out

perf annotate — Source + Assembly

bash

# Annotate source with hot instructions
perf annotate --stdio -s my_function

# Or in report TUI: press 'a' on a function

cpp

Godbolt

# Output shows percentage of time per instruction:
     0.52 :  400ca0:  mov    (%rdi),%eax
    12.38 :  400ca3:  add    %eax,%esi    ← hot instruction
     0.11 :  400ca5:  mov    %esi,(%rdi)

perf diff — Compare Two Profiles

bash

perf record -o before.data ./myapp_before
perf record -o after.data ./myapp_after

perf diff before.data after.data
# Shows % change per symbol

Flame Graphs

bash

# Install Brendan Gregg's flamegraph scripts
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:./FlameGraph

# Record
perf record -F 99 -g ./myapp -- args

# Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
firefox flame.svg

# Off-CPU flame graph (time waiting, not running)
perf record -e sched:sched_switch -ag ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl --color=io > offcpu.svg

Specific Use Cases

Cache Miss Analysis

bash

perf stat -e \
  L1-dcache-loads,L1-dcache-load-misses,\
  L1-icache-loads,L1-icache-load-misses,\
  LLC-loads,LLC-load-misses \
  ./myapp

Branch Misprediction

bash

perf stat -e branches,branch-misses ./myapp
perf record -e branch-misses -g ./myapp
perf report

Memory Bandwidth

bash

perf stat -e \
  uncore_imc/data_reads/,\
  uncore_imc/data_writes/ \
  ./myapp

Syscall Tracing

bash

# Count syscalls
perf stat -e 'syscalls:sys_enter_*' ./myapp

# Trace specific syscall
perf trace -e mmap ./myapp

# All system calls with latency
perf trace --summary ./myapp

Build Flags for Profiling

bash

# Keep symbols, enable call graphs
g++ -O2 -g -fno-omit-frame-pointer -o myapp myapp.cpp

# For accurate DWARF call graphs
g++ -O2 -g -gdwarf-4 -fno-optimize-sibling-calls -o myapp myapp.cpp

cmake

# CMake
add_compile_options(-fno-omit-frame-pointer)
set(CMAKE_BUILD_TYPE RelWithDebInfo)  # O2 + debug info

perf vs Other Profilers

Tool	Mechanism	Overhead	Granularity
perf	Hardware PMU sampling	Very low	Function + line
Valgrind/callgrind	Instrumentation	~10-50x	Every instruction
gprof	Instrumented + sampling	Low	Function
Heaptrack	Heap allocation tracking	Low	Allocation site
Intel VTune	Hardware PMU	Very low	Microarchitecture
Tracy	Manual instrumentation	Very low	Custom zones

Common Workflows

bash

# 1. Find CPU hotspot
perf record -g ./myapp && perf report

# 2. Check cache efficiency
perf stat -e cache-misses,cache-references ./myapp

# 3. Understand where time goes (flame graph)
perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# 4. Compare before/after optimization
perf record -o v1.data ./myapp_v1
perf record -o v2.data ./myapp_v2
perf diff v1.data v2.data

# 5. Identify lock contention
perf lock record ./myapp
perf lock report