Why is my program slow when access 4K offset array element?

Question

I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?

Here is my program, the access_array() traverses through array with different step size, and I measure the execution time for these step size.

// Program to calculate L1 cache line size, compile in g++ -O1

#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;

#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work

void access_array(char* arr, int steps)
{
    const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
    int idx = 0;
    for (int i = 0; i < loop_cnt; i++)
    {
        arr[idx] += 10;
        idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
    }
}

int main(int argc, char** argv){
    double cpu_us_used;
    struct timeval  start, end;

    for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
        char* arr = new char[ARRAY_SIZE];
        for(int i = 0 ; i < ARRAY_SIZE ; i++){
            arr[i] = 0;
        }

        gettimeofday(&start, NULL); // get start clock

        access_array(arr, step);

        gettimeofday(&end, NULL);   // get end clock

        cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);

        cout << step << " , " << cpu_us_used << endl;

        delete[] arr;
    }
    return 0;
}

Result Result

My question is:

From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?

Here are my assumptions.

For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.

As a result, there is a gap between step = 32 and 64.

By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.

For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.

For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.

The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.

Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.

Platform : Intel Xeon(R) CPU E5-2667 0 @ 2.90GHz

LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                8
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  262144
LEVEL2_CACHE_ASSOC                 8
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  15728640
LEVEL3_CACHE_ASSOC                 20
LEVEL3_CACHE_LINESIZE              64
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

Was your Sandybridge Xeon in a multi-socket system? Snooping the other core can matter a lot vs. a single-core system. (I haven't looked at the details of this to see if it would be relevant). — Peter Cordes
– Peter Cordes, Commented Aug 4, 2019 at 4:43

Brendan · Accepted Answer · 2019-08-04 06:03:12Z

1

This CPU probably has:

a hardware cache line prefetcher, that detects linear access patterns within the same physical 4 KiB page and prefetches them before an access is made. This stops prefetching at 4 KiB boundaries (because the physical address is likely to be very different and unknown).
a hardware TLB prefetcher, that detects linear access patterns in TLB usage and prefetches TLB entries.

From 1 to 16 the cache line prefetcher is doing its job, fetching cache lines before you access them, so execution time remains the same (uneffected by cache misses).

At 32, the cache line prefetcher starts to struggle (due to the "stop at 4 KiB page boundary" thing).

From 64 to 512 the TLB prefetcher is doing its job, fetching TLB entries before you access them, so execution time remains the same (unaffected by TLB misses).

From 512 to 4096 the TLB prefetcher is failing to keep up. The CPU stalls waiting for TLB info for every "4096/step" accesses; and these stalls are causing "linear-ish" growth in execution time.

From, 4096 to 131072; I'd like to assume that the "new char[ARRAY_SIZE];" allocates so much space that the library and/or OS decided to give you 2 MiB pages and/or 1 GiB pages, eliminating some TLB misses, and improving execution time as the number of pages being accessed decreases.

For "larger than 131072"; I'd assume you start to see the effects of "1 GiB page TLB miss".

Note that it's probably easier (and less error prone) to get the cache characteristics (size, associativity, how many logical CPUs are sharing it, ..) and cache line size from the CPUID instruction. The approach you're using is more suited to measuring cache latency (how long it takes to get data from one of the caches).

Also; to reduce TLB interference the OS might allow you to explicitly ask for 1 GiB pages (e.g. mmap(..., MAP_POPULATE | MAP_HUGE_1GB, ... ) on Linux); and you can "pre-warm" the TLB by doing a "touch then CLFLUSH" warm-up loop before you start measuring. The hardware cache-line prefetcher can be disabled via. a flag in an MSR (if you have permission), or can be defeated by using an "random" (unpredictable) access pattern.

edited Aug 4, 2019 at 6:03

answered Aug 4, 2019 at 5:55

Brendan

37.7k2 gold badges45 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tzu Yang Chien Over a year ago

Thanks! do you have any suggestion about how can I verify your assumption?

Brendan Over a year ago

@TzuYangChien: To verify assumptions, selectively enable/disable causes of interference and check if the results change in ways that the assumptions predict. For a normal program running under a normal OS, there aren't many ways to avoid/disable causes of interference (and I mentioned all of them I could think of in the last paragraph). For CPU specific code running on bare metal (e.g. booted without any OS) there's a lot more flexibility (e.g. being able to disable stuff using the CPU's MSRs/model specific registers)

Tzu Yang Chien · Accepted Answer · 2019-08-16 01:48:17Z

Finally I found the answer.

I tried to set the array size to 16KiB and smaller, but it slow on step = 4KiB too.

On the other hand, I tried to change the offset of steps from mutiplying 2 in each iteration to add 1 in each iteration, it still slow down when step = 4KiB.

Code

#define ARRAY_SIZE (4200) 

void access_array(char* arr, int steps)
{
    const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
    int idx = 0;
    for (int i = 0; i < loop_cnt; i++)
    {
        arr[idx] += 10;
        idx = idx + steps;
        if(idx >= ARRAY_SIZE)
            idx = 0;
    }

}

for(int step = 4090 ; step <= 4100 ; step ++){
    char* arr = new char[ARRAY_SIZE];
    for(int i = 0 ; i < ARRAY_SIZE ; i++){
        arr[i] = 0;
    }

    gettimeofday(&start, NULL); // get start clock

    access_array(arr, step);

    gettimeofday(&end, NULL);   // get end clock

    cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);

    cout << step << " , " << cpu_us_used << endl;

    delete[] arr;
}

Result

4090 , 48385
4091 , 48497
4092 , 48136
4093 , 48520
4094 , 48090
4095 , 48278
4096 , **51818**
4097 , 48196
4098 , 48600
4099 , 48185
4100 , 63149

As a result, I suspected it's not related to any of cache / TLB / prefetch mechanism.

With more and more googling the relation ship between performance and magic number "4K", I have found the 4K aliasing problem on Intel Platform, which slows down the performance of load.

This occurs when a load is issued after a store and their memory addresses are offset by (4K). When this is processed in the pipeline, the issue of the load will match the previous store (the full address is not used at this point), so pipeline will try to forward the results of the store and avoid doing the load (this is store forwarding). Later on when the address of the load is fully resolved, it will not match the store, and so the load will have to be re-issued from a later point in the pipe. This has a 5-cycle penalty in the normal case, but could be worse in certain situations, like with un-aligned loads that span 2 cache lines.

4k or other large powers of 2 are also bad because all your data will alias into one set of L1d cache, so you'll get conflict misses. But even 4095 and 4097 would still be almost as bad (moving to a new set only every 64 iterations). So I think you're right, by far the biggest problem is 4k aliasing breaking memory disambiguation. With arr[idx] += 10; you will get a store and then a load from a 4k-aliased address which will have a false dependency. (What should be an independent load instead waits for the store for some amount of time.)

Collectives™ on Stack Overflow

Why is my program slow when access 4K offset array element?

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related