1

i want to try optimizing this line of code:

for i in 0..len {
    slots[values[i].slot].count += 1;
}

(both of these lists are extremely long and the same size)

i have already optimized it like this(2.5x faster), to acess it in cache:

const PREFETCH_DISTANCE: usize = 32;
for i in (0..len).rev() {
    if i >= PREFETCH_DISTANCE {
        let pfs = values[i - PREFETCH_DISTANCE].slot;
        let fetch = &slots[pfs] as *const Slot as *const i8;
        unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
    }
    slots[values[i].slot].count += 1;
}

however it is still not as fast, as i was expecting. for example, this code, for the same list length is more than twice as fast, just because it doesn't index anything:

for val in &*arr {
    let ptr = val as *const T;
    let val = unsafe { ptr.read() };
    let dif = integrated_expected_distribution(&val) - start_val;
    debug_assert!(dif >= 0.);
    let slot = (dif * end_mult) as usize;
    debug_assert!(slot < len);
    values.push(Slottable { value: val, slot });
}

Am i missing something else, that i need to preload for that indexing? Or am i missing something, that causes performance issues?

here is a reproducable example:

    let amount = 100_000_000;
    let highest_possible_num = 1_000_000_000;
    let seed = 44;

    let mut rng = rand::rngs::StdRng::seed_from_u64(seed);

    let mut vec2: Vec<usize> = (0..amount)
        .map(|_| rng.gen_range(0..=highest_possible_num))
        .collect();
    let vec1: Vec<usize> = (0..amount).map(|_| rng.gen_range(0..amount)).collect();
    let len = amount;
    const PREFETCH_DISTANCE: usize = 32;
    let loop_ = Instant::now();
    for i in (0..len).rev() {
        if i >= PREFETCH_DISTANCE {
            let pfs = vec1[i - PREFETCH_DISTANCE];
            let fetch = &vec2[pfs] as *const usize as *const i8;
            unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
        }
        vec2[vec1[i]] += 1;
    }
    let loop_duration = loop_.elapsed();
    println!("Loop time: {:?}", loop_duration);

the removed count and slot are each just part of a datastructure of 8 byte, so the change shouldn't have any effect on the example

Edit: i think instruction-level parallelism can't be done on the loop, so it becomes slower.

Edit2: i've tested some more and Edit1 was wrong. This code is waaay faster than my prefetched code:

    let mut test = 0;
    for i in (0..len).rev() {
        test += values[i].slot;
    }
    println!("test: {}", test);
17
  • Please provide a Minimal Reproducible Example. For example, does LLVM knows the lengths are equal and are len? Commented Jul 28, 2024 at 21:05
  • Also, are there any patterns in your data? Commented Jul 28, 2024 at 21:06
  • 1
    Prefetching does not improve it 2x for me, only marginally (but it does improve). I tried to unroll the loop but no avail. Perhaps with scatter and gather instructions you can do better, but that requires AVX-512. Commented Jul 30, 2024 at 11:07
  • 1
    Changing prefetch distance makes it (slightly) worse, 32 is the best value. Overall prefetching gives a ~60ms advantage on my machine (I doubled the number of items 10x, total time ~750ms with prefetching). Commented Jul 30, 2024 at 17:39
  • 1
    So it does seem to be related to the CPU, I'm running laptop i9-13900HX, a bit more powerful than yours ;P Commented Jul 31, 2024 at 15:29

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.