First of all, as @user555045 commented, very short intervals where this overhead would matter are very hard to microbenchmark. If this overhead is a problem, your measurements probably won't mean much even if you do avoid it unless you know exactly what you're doing and what you're trying to figure out about the CPU you're doing this on. It's not usually useful to answer a "which is faster" question.
If you can use just the low 32 bits of _rdtsc (e.g. for measuring intervals much less than 2^32 reference cycles), not using the high half of the result at all does let the shift/or optimize away. If end - start fits in 32 bits, it doesn't change the result if end and start are truncated to 32 bits first. (Carry/borrow only propagates from low to high.)
Use u32 end.wrapping_sub(start), since the full TSC could have crossed a 32-bit boundary between start and end.
It seems _rdtsc() uses a fixed recipe to produce a single u64 which LLVM isn't able to optimize except to optimize it away if the result is unused. As if the shift/OR was a non-volatile asm statement which the optimizer doesn't understand, except as a black box whose result is either used or not. The LLVM-IR for it is %x = tail call noundef i64 @llvm.x86.rdtsc(), producing a single 64-bit integer return value.
use std::arch::x86_64::_rdtsc;
pub fn tick_low() -> u32 {
let x = unsafe { _rdtsc() };
x as u32
}
(Godbolt, for this and the later examples)
# rustc 1.89 -O
example::tick_low:
rdtsc
ret
If you ask for it zero-extended to 64-bit with x as u32 as u64 or x & 0xFFFFFFFF, the compiler unfortunately adds a mov eax, eax instruction to zero-extend EAX into RAX, even though it already is zero-extended from rdtsc writing EAX.
Using u32 halves instead of u64 in your tick_count function can avoid the mov eax, edx you got, but otherwise is the same mess:
pub fn tick_count() -> (u32, u32) {
let x = unsafe { _rdtsc() };
(x as u32, (x >> 32) as u32) // low 32 in EAX, high 32 in EDX
}
example::tick_count:
rdtsc
shl rdx, 32
or rdx, rax
# mov eax, edx for the low half is *not* present, unlike with u64
shr rdx, 32
ret
There doesn't seem much hope for doing anything with the high half without having LLVM pack and unpack it instead of just rdtsc; mov eax, edx
pub fn tick_high() -> u32 {
let x = unsafe { _rdtsc() };
(x>>32) as u32
}
example::tick_high:
rdtsc
shl rdx, 32
or rax, rdx
shr rax, 32
ret
LLVM has no problem with this optimization normally, so it's definitely something about the canned sequence of operations that the optimizer doesn't see into at the stages of compilation where this kind of optimization gets looked for.
#[inline(never)]
pub fn combine_and_unpack(lo :u32, hi :u32) -> u64 {
let combined = lo as u64 | ((hi as u64)<<32);
return combined >> 32; // LLVM can do this optimization normally
}
example::combine_and_unpack:
mov eax, esi
ret
Microbenchmarking with rdtsc
You at least need lfence to block instruction reordering of the work wrt. rdtsc (or use rdtscp at the end.) And it's basically going to be time for the instructions to retire, which is neither throughput nor latency, the things that are normally useful to measure about a tiny fragment of code.
Usually much better to put something in a repeat loop either with or without a data dependency from previous output to next input, and time across many iterations. (Or count core clock cycles with perf stat, especially for code which doesn't do I/O or spend a lot of its time stalled waiting for memory, so usually runs in the same number of core clocks regardless of clock speed. TSC reference cycles are wall-clock time on CPUs from the past couple decades, not core clocks.)
See also
llvm.x86.rdtsc, you can look at LLVM source if you want). If true there's no way to avoid inline assembly.