0

My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd expect getting rid of frequency scaling effects to reduce variance, but for some reason on my Zen 5 it increases it!

$ echo; for cycle_type in cycles ref-cycles; do echo -n "$cycle_type stddev: "; for i in {1..100}; do perf stat -e $cycle_type true 2>&1 | grep -E "(cycles|ref-cycles)" | grep -v "seconds"; done | awk '{gsub(/,/,""); print $1}' | awk '{sum+=$1; sumsq+=$1*$1; n++} END {mean=sum/n; print sqrt(sumsq/n - mean*mean)}'; done

cycles stddev: 146434
ref-cycles stddev: 353483

On my Haswell server I get the expected behavior:

$ echo; for cycle_type in cycles ref-cycles; do echo -n "$cycle_type stddev: "; for i in {1..100}; do perf stat -e $cycle_type true 2>&1 | grep -E "(cycles|ref-cycles)" | grep -v "seconds"; done | awk '{gsub(/,/,""); print $1}' | awk '{sum+=$1; sumsq+=$1*$1; n++} END {mean=sum/n; print sqrt(sumsq/n - mean*mean)}'; done

cycles stddev: 64606.4
ref-cycles stddev: 46084.4

Zen 5 CPU model: AMD Ryzen AI 9 HX 370 w/ Radeon 890M

Haswell CPU model: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz

Perf version: 6.14.2

EDIT: Peter Cordes rightly questions whether this benchmark is misleading because it mostly measures process startup overhead. Even when I use perf_event_open to configure the counters and then use rdpmc myself to sample in process on a simple microbench, I see the same phenomenon.

1 Answer 1

3

It's common that tasks take a somewhat-constant number of core clock cycles, especially if they aren't too sensitive to DRAM latency / bandwidth. You're timing process startup overhead by measuring /bin/true, so a lot of time spent in the kernel making system calls, and we'd expect a fair amount of variance compared to something that spends a good fraction of a second in user-space hitting in L1d cache, like awk 'BEGIN{for(i=0;i<10000000;i++){}}' (0.4 secs on my 3.9GHz Skylake).

A core clock cycles event like ..._CPU_CYCLES factors out variations in core clock speed from idle to max boost.

PERF_COUNT_HW_REF_CPU_CYCLES indeed is like rdtsc reference cycles (except only counts when the core isn't halted), so it's basically equivalent to wall-clock time. With the CPU at idle frequency, the count will be 4x to 8x higher than with the core at max boost.

You have it exactly backwards: you were expecting REF_CPU_CYCLES to "get rid of" CPU frequency variations, which would be true if you were timing the number of cycles it takes for a disk read or for a 100 ms timer, but you're timing work running on a CPU at variable frequency.


Zen 5 is new enough to have its frequency managed by the hardware itself (hardware P-state), so it can change frequency more often. And AMD CPUs don't just have a fixed max boost that they always run at given thermal headroom; they tend to vary their frequency a lot to stay right at the limit of what thermals / power allows. (Up to some fixed max frequency if thermal/power limits aren't a factor, though.) https://en.wikipedia.org/wiki/AMD_Turbo_Core#Features mentions that frequency can change in 25MHz increments on Ryzen CPUs.

Your CPU is a laptop CPU with 4 high-performance cores (full Zen 5 with a max boost clock of 5.1G Hz) and 8 efficiency cores, Zen 5c with a max boost clock of 3.3GHz. If Linux is running some of your processes on E-cores, that introduces another wrinkle. The reference clock probably still ticks at the same speed across all cores (so it's easy to use for gettimeofday), but at max boost the same work will take many more reference cycles. You could use taskset -c 1 perf stat -r1000 whatever to always run on core 1. (With -r 1000 running the workload 1000 times and doing its own statistics.) Also, you could count both events in the same perf run: You can use -e multiple times, or -e task-clock,instructions,cycles,etc,etc comma-separated list of events.

Also, being a laptop, it's possible you might run into thermal or power limits at some point.


Your Xeon v3 Haswell CPU has P-state managed by the OS, although turbo above the highest P-state is still managed by the hardware itself. (Hardware P-state management was new in Skylake). It's a server, so if other cores are idle then it has plenty of thermal headroom to always run at max turbo.

Of if your Haswell isn't idle, then the first run might not have as much wakeup / warmup work to do.


100 iterations of /bin/true from a shell loop is something I'd expect to have significant variations, especially in the first couple runs, depending on the previous state of the OS's data structures. Although the shell's own forking, and startup of perf itself, could provide some warmup before we get to the workload being measured, /bin/true.

Sign up to request clarification or add additional context in comments.

6 Comments

Ah I think I simplified too much in trying to make a reproducible question. Even when I use perf_event_open to configure the counters and then use rdpmc myself to sample in process on a simple microbench, I see the same phenomenon
@JosephGarvin: The first half of my answer applies there, too, even moreso with a simple microbench that runs at a constant number of core clock cycles, without all the noise of the workload being mostly process-startup. And the point about AMD CPUs often not running at a fixed frequency, but rather boosting as high as they can, which changes frequently. Or sometimes running on Zen5c cores.
I see, IIUC the main point here is that since REF ticks at a constant rate, regardless of the actual cpu freq, it introduces the kind of variation I was trying to get rid of because an operation that takes 10 real cpu cycles will be reported as being <10 ref cycles when the frequency is low or >10 ref cycles when the frequency is high. I think the reason it looks inverted for Haswell vs Zen5 might be that Haswell uses base freq for ref but Zen5 uses peak freq?
@JosephGarvin: Oh, I hadn't noticed that you had more variance in core clocks on Haswell. Maybe Haswell is more bottlenecked on memory latency or bandwidth, so things are more often a constant number of nanoseconds instead of a constant number of core clocks. Zen 5 has large fast L3 caches and big L2; Haswell Xeon probably has higher L3 latency, and probably much higher DRAM latency. And much lower single-core DRAM bandwidth, being a many-core Intel (16c per socket) with a large ring bus. And if that's a multi-socket system (otherwise why use an E5), there's inter-socket effects.
@JosephGarvin: But yes, it's also possible that stddev as a fraction of the total is a similar ratio, but you're only reporting absolute stddev. Your Haswell is old enough for me to assume it will run its TSC at its base frequency, 2.30 GHz. Your Zen 5's TSC frequency could well be much higher than its base clock of 2 GHz. Perhaps 5.1 MHz, or maybe 3.3 GHz (max boost for the Zen5c cores, so they don't have to handle anything with higher frequency than that.) Your kernel boot logs should say something like Refined TSC clocksource calibration: 4008.057 MHz (on my i7-6700k 4.0 / 4.2 GHz)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.