I wrote a program to get the cache & cache line size for my computer, but I got the result that I can't explain, could anyone help me to explain it for me?
Here is my program, the access_array() traverses through array with different step size, and I measure the execution time for these step size.
// Program to calculate L1 cache line size, compile in g++ -O1
#include <iostream>
#include <string>
#include <sys/time.h>
#include <cstdlib>
using namespace std;
#define ARRAY_SIZE (256 * 1024) // arbitary array size, must in 2^N to let module work
void access_array(char* arr, int steps)
{
const int loop_cnt = 1024 * 1024 * 32; // arbitary loop count
int idx = 0;
for (int i = 0; i < loop_cnt; i++)
{
arr[idx] += 10;
idx = (idx + steps) & (ARRAY_SIZE - 1); // if use %, the latency will be too high to see the gap
}
}
int main(int argc, char** argv){
double cpu_us_used;
struct timeval start, end;
for(int step = 1 ; step <= ARRAY_SIZE ; step *= 2){
char* arr = new char[ARRAY_SIZE];
for(int i = 0 ; i < ARRAY_SIZE ; i++){
arr[i] = 0;
}
gettimeofday(&start, NULL); // get start clock
access_array(arr, step);
gettimeofday(&end, NULL); // get end clock
cpu_us_used = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
cout << step << " , " << cpu_us_used << endl;
delete[] arr;
}
return 0;
}
Result
My question is:
From 64 to 512, I can't explain why the execution time is almost the same, any why there is a linear growth from 1K to 4K?
Here are my assumptions.
For step = 1, every 64 iterations cause 1 cache line miss. And after 32K iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 64 iterations.
For step = 64, every 1 iteration cause 1 cache line miss. And after 512 iterations, the L1 Cache is full, so we have L1 collision & capacity miss every 1 iterations.
As a result, there is a gap between step = 32 and 64.
By observing the first gap, I can conclude that the L1 cache line size is 64 bytes.
For step = 512, every 1 iteration cause 1 cache line miss. And after 64 iterations, the Set 0,8,16,24,32,40,48,56 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For step = 4K, every 1 iteration cause 1 cache line miss. And after 8 iterations, the set 0 of L1 Cache is full, so we have L1 collision miss every 1 iterations.
For 128 to 4K cases, they all happened L1 collision miss, and the difference is that with the more steps, we start the collision miss earlier.
The only idea that I can come up with is that there are other mechanism (maybe page, TLB, etc.) to impact the execution time.
Here is the cache size & CPU info of my workstation. By the way, I have run this program on my PC either, and I got the similar results.
Platform : Intel Xeon(R) CPU E5-2667 0 @ 2.90GHz
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 262144
LEVEL2_CACHE_ASSOC 8
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 15728640
LEVEL3_CACHE_ASSOC 20
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0