You want to optimize your code. You make a change, run the benchmark, and… the numbers are different every time. Sometimes faster, sometimes slower, even when you changed nothing. How can you tell if your optimization actually worked if the measurement itself is unreliable?

This is the fundamental problem of performance engineering: before you can optimize, you need to be able to measure. And on a modern Linux system, there is a shocking amount of noise that contaminates your measurements.

I ran a series of experiments on an AMD Ryzen 9 5950X (16 cores, 32 threads) to understand where this noise comes from and how to eliminate it. The test system runs Linux 6.8 with the ondemand frequency governor and turbo boost enabled — a typical default configuration.

The Workload

The benchmark is deliberately simple: sum a 4MB array of 1M uint32_t elements, 100 times, measuring each iteration independently with rdtsc (the CPU timestamp counter). This workload touches memory (so it exercises the cache hierarchy) but has a completely predictable access pattern (so the prefetcher handles it well). Any variance we see is from the system, not the algorithm.

void __attribute__((noinline)) workload(uint32_t *arr, int n) {
    uint32_t sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    sink = sum;
}

Baseline: The State of Things

With no mitigations at all — just compile and run — the coefficient of variation (CV = stddev/mean) is consistently 9-14% across runs:

baseline (no mitigations)    CV=9.29%   min=907,528  max=1,299,344  range=391,816

A ~9% CV means that if your optimization yields a 3% improvement, you literally cannot tell. The noise is 3x larger than the signal.

The Sources of Noise

Using perf stat, we can see what’s going on under the hood. Here’s what a baseline run looks like:

 12      context-switches          ( +-  2.64% )
  1      cpu-migrations            ( +- 54.77% )   <-- huge variance!
1,090    page-faults               ( +-  0.06% )

The biggest culprits:

1. CPU Frequency Scaling (The Biggest Offender)

The system runs with the ondemand governor and AMD Precision Boost enabled. This means the CPU dynamically adjusts its frequency between ~3.4 GHz and ~4.7 GHz depending on load, thermal headroom, and what other cores are doing. During the benchmark, I observed actual frequencies varying between 4.625–4.675 GHz within a single run.

The TSC (timestamp counter) ticks at a constant rate, but the CPU does different amounts of work per tick depending on frequency. perf stat reported 4.354–4.451 GHz across different runs — that’s ~2% variation in effective clock speed, which directly becomes ~2% variance in cycle counts.

The fix: Set the governor to performance and disable boost:

sudo cpupower frequency-set --governor performance
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

This locks the CPU at a fixed frequency. On this system, that requires root.

2. CPU Migration (The Scheduler Bouncing You Around)

Without CPU pinning, the scheduler can move your process from one core to another. Each migration means:

  • Cold L1/L2 caches (your 4MB working set has to be re-fetched)
  • Possible NUMA effects (different memory controllers)
  • The migration itself takes time

perf stat showed cpu-migrations varying by ±54.77% — the most variable counter by far.

The fix: Pin to a specific core with taskset:

taskset -c 15 ./bench

Results:

CPU pin (core 0)     CV=1.62%   min=861,458  max=951,626  range=90,168
CPU pin (core 15)    CV=1.30%   min=920,958  max=981,104  range=60,146

CPU pinning alone dropped the CV from ~13% to ~1.3%. This is the single most impactful mitigation.

Core 15 slightly outperformed core 0 — core 0 typically handles more system interrupts (it’s the BSP — bootstrap processor), so higher-numbered cores tend to be quieter.

3. Page Faults (The Memory Manager Interrupting You)

Without mlockall, the kernel may evict your pages under memory pressure. When you access them again, you get a minor page fault — the kernel has to update page tables, which takes microseconds. With mlockall:

Without mlockall:  1,090 page-faults
With mlockall:       183 page-faults

The fix: Call mlockall(MCL_CURRENT | MCL_FUTURE) early in your program:

#include <sys/mman.h>
mlockall(MCL_CURRENT | MCL_FUTURE);

This tells the kernel: “keep all my pages in physical RAM, don’t even think about swapping them.”

4. Address Space Layout Randomization (ASLR)

ASLR randomizes the base addresses of your stack, heap, and shared libraries on every execution. This changes cache line alignments and can shift hot loops across page boundaries, affecting iTLB behavior.

The fix:

setarch $(uname -m) -R ./bench    # per-process
# or globally (not recommended for production):
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

5. Context Switches and Interrupts

Even on a pinned core, timer interrupts and context switches still happen. perf stat showed ~12–15 context switches per run even with pinning. Each context switch pollutes caches and takes microseconds.

With root, you could go further:

# Real-time scheduling — preempt everything else
sudo chrt -f 80 ./bench
 
# Isolate CPUs from the scheduler entirely (kernel boot param)
# isolcpus=15 nohz_full=15 rcu_nocbs=15

Combining Mitigations

Here’s what happens as we stack mitigations:

ConfigurationCV (cycles)MinMaxRange
Baseline9.29%907,5281,299,344391,816
CPU pin (core 15)1.30%920,958981,10460,146
mlockall only5.91%865,6741,197,786332,112
No ASLR only4.93%900,5581,086,674186,116
Pin + mlock + no ASLR2.14%900,150993,78693,636
Pin + mlock + no ASLR (run 2)1.11%939,4201,004,05464,634

The best configuration — CPU pinning + mlockall + ASLR disabled — consistently achieves CV around 1-2%, compared to the baseline’s ~9-14%. That’s a 7-10x improvement in measurement stability.

The perf stat View

Here’s how the hardware counters compare between baseline and our best configuration:

                          Baseline (±%)     Best Config (±%)
task-clock                  +-  1.40%          +-  0.46%
context-switches            +-  2.13%          +-  1.04%
cpu-migrations              +- 54.77%          +-  0.00%   (zero migrations!)
page-faults                 +-  0.03%          +-  0.10%
cycles                      +-  0.55%          +-  0.37%
stalled-cycles-frontend     +-  4.75%          +-  4.53%
instructions                +-  0.31%          +-  0.00%
branch-misses               +-  2.92%          +-  0.08%
cache-misses                +-  0.69%          +-  0.39%

Notice: stalled-cycles-frontend has ~4.5% variance in both configurations. This is noise from contention on shared CPU resources (L3 cache, memory controller) that we can’t eliminate without isolating the core at the kernel level.

A Note on Estimators: Use the Minimum, Not the Mean

Here’s a crucial insight: noise only ever makes your program slower, never faster. A context switch adds time. A cache miss adds time. A frequency drop adds time. None of these make your program run faster than its true speed.

This means the minimum across multiple runs is a better estimator of the true execution time than the mean. The mean includes all the noise; the minimum gets closest to the noise-free execution.

When I took the minimum of each run (100 iterations) and compared the variance of those minimums across 10 runs:

                              CV of minimums
baseline                         2.24%
CPU pin (core 15)                2.59%
mlockall                         1.48%
no ASLR                          2.07%
pin + mlock + noASLR             2.45%

Using the minimum as your estimator, the CV drops to ~2% even without any mitigations. Combine this with the mitigations above, and you can reliably detect optimizations of just 1-2%.

Practical Checklist

Here’s the order of operations, from highest impact to lowest, for setting up a stable benchmarking environment:

  1. Pin to a core (taskset -c N): eliminates CPU migrations, the single largest source of variance
  2. Set performance governor (cpupower frequency-set --governor performance): locks clock speed (requires root)
  3. Disable boost/turbo (echo 0 > /sys/.../boost): removes frequency variation (requires root)
  4. Call mlockall(): prevents page faults from memory pressure
  5. Disable ASLR (setarch -R): removes address layout randomization
  6. Use minimum of N runs: better estimator than mean since noise is one-directional
  7. Warm up first: run the workload a few times before measuring to populate caches
  8. Use perf stat -r N: built-in statistical repeating with variance reporting

If you have root access, the additional high-impact options:

  • chrt -f 80: real-time FIFO scheduling, preempts everything
  • isolcpus=N: kernel boot parameter to fully isolate a core from the scheduler
  • nohz_full=N: stops timer ticks on isolated cores
  • Disable SMT: eliminates sibling thread interference (BIOS or echo 0 > /sys/devices/system/cpu/cpuN/online for sibling threads)

The Code

The full benchmark code and experiment scripts are in performance-playground. The key files:

  • bench.c — array-sum workload with rdtsc timing, supports -c CPU pinning and -m mlockall flags
  • bench_tight.c — tight compute-only loop for measuring overhead of mitigations themselves
  • run_experiments.sh — runs all configurations and computes statistics

References