You want to optimize your code. You make a change, run the benchmark, and… the numbers are different every time. Sometimes faster, sometimes slower, even when you changed nothing. How can you tell if your optimization actually worked if the measurement itself is unreliable?
This is the fundamental problem of performance engineering: before you can optimize, you need to be able to measure. And on a modern Linux system, there is a shocking amount of noise that contaminates your measurements.
I ran a series of experiments on an AMD Ryzen 9 5950X (16 cores, 32 threads) to understand where this noise comes from and how to eliminate it. The test system runs Linux 6.8 with the ondemand frequency governor and turbo boost enabled — a typical default configuration.
The Workload
The benchmark is deliberately simple: sum a 4MB array of 1M uint32_t elements, 100 times, measuring each iteration independently with rdtsc (the CPU timestamp counter). This workload touches memory (so it exercises the cache hierarchy) but has a completely predictable access pattern (so the prefetcher handles it well). Any variance we see is from the system, not the algorithm.
void __attribute__((noinline)) workload(uint32_t *arr, int n) {
uint32_t sum = 0;
for (int i = 0; i < n; i++) {
sum += arr[i];
}
sink = sum;
}Baseline: The State of Things
With no mitigations at all — just compile and run — the coefficient of variation (CV = stddev/mean) is consistently 9-14% across runs:
baseline (no mitigations) CV=9.29% min=907,528 max=1,299,344 range=391,816
A ~9% CV means that if your optimization yields a 3% improvement, you literally cannot tell. The noise is 3x larger than the signal.
The Sources of Noise
Using perf stat, we can see what’s going on under the hood. Here’s what a baseline run looks like:
12 context-switches ( +- 2.64% )
1 cpu-migrations ( +- 54.77% ) <-- huge variance!
1,090 page-faults ( +- 0.06% )
The biggest culprits:
1. CPU Frequency Scaling (The Biggest Offender)
The system runs with the ondemand governor and AMD Precision Boost enabled. This means the CPU dynamically adjusts its frequency between ~3.4 GHz and ~4.7 GHz depending on load, thermal headroom, and what other cores are doing. During the benchmark, I observed actual frequencies varying between 4.625–4.675 GHz within a single run.
The TSC (timestamp counter) ticks at a constant rate, but the CPU does different amounts of work per tick depending on frequency. perf stat reported 4.354–4.451 GHz across different runs — that’s ~2% variation in effective clock speed, which directly becomes ~2% variance in cycle counts.
The fix: Set the governor to performance and disable boost:
sudo cpupower frequency-set --governor performance
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boostThis locks the CPU at a fixed frequency. On this system, that requires root.
2. CPU Migration (The Scheduler Bouncing You Around)
Without CPU pinning, the scheduler can move your process from one core to another. Each migration means:
- Cold L1/L2 caches (your 4MB working set has to be re-fetched)
- Possible NUMA effects (different memory controllers)
- The migration itself takes time
perf stat showed cpu-migrations varying by ±54.77% — the most variable counter by far.
The fix: Pin to a specific core with taskset:
taskset -c 15 ./benchResults:
CPU pin (core 0) CV=1.62% min=861,458 max=951,626 range=90,168
CPU pin (core 15) CV=1.30% min=920,958 max=981,104 range=60,146
CPU pinning alone dropped the CV from ~13% to ~1.3%. This is the single most impactful mitigation.
Core 15 slightly outperformed core 0 — core 0 typically handles more system interrupts (it’s the BSP — bootstrap processor), so higher-numbered cores tend to be quieter.
3. Page Faults (The Memory Manager Interrupting You)
Without mlockall, the kernel may evict your pages under memory pressure. When you access them again, you get a minor page fault — the kernel has to update page tables, which takes microseconds. With mlockall:
Without mlockall: 1,090 page-faults
With mlockall: 183 page-faults
The fix: Call mlockall(MCL_CURRENT | MCL_FUTURE) early in your program:
#include <sys/mman.h>
mlockall(MCL_CURRENT | MCL_FUTURE);This tells the kernel: “keep all my pages in physical RAM, don’t even think about swapping them.”
4. Address Space Layout Randomization (ASLR)
ASLR randomizes the base addresses of your stack, heap, and shared libraries on every execution. This changes cache line alignments and can shift hot loops across page boundaries, affecting iTLB behavior.
The fix:
setarch $(uname -m) -R ./bench # per-process
# or globally (not recommended for production):
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space5. Context Switches and Interrupts
Even on a pinned core, timer interrupts and context switches still happen. perf stat showed ~12–15 context switches per run even with pinning. Each context switch pollutes caches and takes microseconds.
With root, you could go further:
# Real-time scheduling — preempt everything else
sudo chrt -f 80 ./bench
# Isolate CPUs from the scheduler entirely (kernel boot param)
# isolcpus=15 nohz_full=15 rcu_nocbs=15Combining Mitigations
Here’s what happens as we stack mitigations:
| Configuration | CV (cycles) | Min | Max | Range |
|---|---|---|---|---|
| Baseline | 9.29% | 907,528 | 1,299,344 | 391,816 |
| CPU pin (core 15) | 1.30% | 920,958 | 981,104 | 60,146 |
| mlockall only | 5.91% | 865,674 | 1,197,786 | 332,112 |
| No ASLR only | 4.93% | 900,558 | 1,086,674 | 186,116 |
| Pin + mlock + no ASLR | 2.14% | 900,150 | 993,786 | 93,636 |
| Pin + mlock + no ASLR (run 2) | 1.11% | 939,420 | 1,004,054 | 64,634 |
The best configuration — CPU pinning + mlockall + ASLR disabled — consistently achieves CV around 1-2%, compared to the baseline’s ~9-14%. That’s a 7-10x improvement in measurement stability.
The perf stat View
Here’s how the hardware counters compare between baseline and our best configuration:
Baseline (±%) Best Config (±%)
task-clock +- 1.40% +- 0.46%
context-switches +- 2.13% +- 1.04%
cpu-migrations +- 54.77% +- 0.00% (zero migrations!)
page-faults +- 0.03% +- 0.10%
cycles +- 0.55% +- 0.37%
stalled-cycles-frontend +- 4.75% +- 4.53%
instructions +- 0.31% +- 0.00%
branch-misses +- 2.92% +- 0.08%
cache-misses +- 0.69% +- 0.39%
Notice: stalled-cycles-frontend has ~4.5% variance in both configurations. This is noise from contention on shared CPU resources (L3 cache, memory controller) that we can’t eliminate without isolating the core at the kernel level.
A Note on Estimators: Use the Minimum, Not the Mean
Here’s a crucial insight: noise only ever makes your program slower, never faster. A context switch adds time. A cache miss adds time. A frequency drop adds time. None of these make your program run faster than its true speed.
This means the minimum across multiple runs is a better estimator of the true execution time than the mean. The mean includes all the noise; the minimum gets closest to the noise-free execution.
When I took the minimum of each run (100 iterations) and compared the variance of those minimums across 10 runs:
CV of minimums
baseline 2.24%
CPU pin (core 15) 2.59%
mlockall 1.48%
no ASLR 2.07%
pin + mlock + noASLR 2.45%
Using the minimum as your estimator, the CV drops to ~2% even without any mitigations. Combine this with the mitigations above, and you can reliably detect optimizations of just 1-2%.
Practical Checklist
Here’s the order of operations, from highest impact to lowest, for setting up a stable benchmarking environment:
- Pin to a core (
taskset -c N): eliminates CPU migrations, the single largest source of variance - Set performance governor (
cpupower frequency-set --governor performance): locks clock speed (requires root) - Disable boost/turbo (
echo 0 > /sys/.../boost): removes frequency variation (requires root) - Call
mlockall(): prevents page faults from memory pressure - Disable ASLR (
setarch -R): removes address layout randomization - Use minimum of N runs: better estimator than mean since noise is one-directional
- Warm up first: run the workload a few times before measuring to populate caches
- Use
perf stat -r N: built-in statistical repeating with variance reporting
If you have root access, the additional high-impact options:
chrt -f 80: real-time FIFO scheduling, preempts everythingisolcpus=N: kernel boot parameter to fully isolate a core from the schedulernohz_full=N: stops timer ticks on isolated cores- Disable SMT: eliminates sibling thread interference (BIOS or
echo 0 > /sys/devices/system/cpu/cpuN/onlinefor sibling threads)
The Code
The full benchmark code and experiment scripts are in performance-playground. The key files:
bench.c— array-sum workload with rdtsc timing, supports-c CPUpinning and-mmlockall flagsbench_tight.c— tight compute-only loop for measuring overhead of mitigations themselvesrun_experiments.sh— runs all configurations and computes statistics
References
- How to get consistent results when benchmarking on Linux — Denis Bakhvalov’s excellent guide
- Reducing Variance — Google Benchmark documentation
- Low Latency Tuning Guide — Erik Rigtorp’s comprehensive guide
- Linux perf Examples — Brendan Gregg’s perf reference