It's common that tasks take a somewhat-constant number of core clock cycles, especially if they aren't too sensitive to DRAM latency / bandwidth. You're timing process startup overhead by measuring /bin/true, so a lot of time spent in the kernel making system calls, and we'd expect a fair amount of variance compared to something that spends a good fraction of a second in user-space hitting in L1d cache, like awk 'BEGIN{for(i=0;i<10000000;i++){}}' (0.4 secs on my 3.9GHz Skylake).
A core clock cycles event like ..._CPU_CYCLES factors out variations in core clock speed from idle to max boost.
PERF_COUNT_HW_REF_CPU_CYCLES indeed is like rdtsc reference cycles (except only counts when the core isn't halted), so it's basically equivalent to wall-clock time. With the CPU at idle frequency, the count will be 4x to 8x higher than with the core at max boost.
You have it exactly backwards: you were expecting REF_CPU_CYCLES to "get rid of" CPU frequency variations, which would be true if you were timing the number of cycles it takes for a disk read or for a 100 ms timer, but you're timing work running on a CPU at variable frequency.
Zen 5 is new enough to have its frequency managed by the hardware itself (hardware P-state), so it can change frequency more often. And AMD CPUs don't just have a fixed max boost that they always run at given thermal headroom; they tend to vary their frequency a lot to stay right at the limit of what thermals / power allows. (Up to some fixed max frequency if thermal/power limits aren't a factor, though.) https://en.wikipedia.org/wiki/AMD_Turbo_Core#Features mentions that frequency can change in 25MHz increments on Ryzen CPUs.
Your CPU is a laptop CPU with 4 high-performance cores (full Zen 5 with a max boost clock of 5.1G Hz) and 8 efficiency cores, Zen 5c with a max boost clock of 3.3GHz. If Linux is running some of your processes on E-cores, that introduces another wrinkle. The reference clock probably still ticks at the same speed across all cores (so it's easy to use for gettimeofday), but at max boost the same work will take many more reference cycles. You could use taskset -c 1 perf stat -r1000 whatever to always run on core 1. (With -r 1000 running the workload 1000 times and doing its own statistics.) Also, you could count both events in the same perf run: You can use -e multiple times, or -e task-clock,instructions,cycles,etc,etc comma-separated list of events.
Also, being a laptop, it's possible you might run into thermal or power limits at some point.
Your Xeon v3 Haswell CPU has P-state managed by the OS, although turbo above the highest P-state is still managed by the hardware itself. (Hardware P-state management was new in Skylake). It's a server, so if other cores are idle then it has plenty of thermal headroom to always run at max turbo.
Of if your Haswell isn't idle, then the first run might not have as much wakeup / warmup work to do.
100 iterations of /bin/true from a shell loop is something I'd expect to have significant variations, especially in the first couple runs, depending on the previous state of the OS's data structures. Although the shell's own forking, and startup of perf itself, could provide some warmup before we get to the workload being measured, /bin/true.