Consider the following code:
int main(int argc, char** argv) {
int buf_size = 1024*1024*1024;
char* buffer = malloc(buf_size);
char* buffer2 = malloc(buf_size);
for (int i = 0; i < 10; i++){
int fd = open(argv[1], O_DIRECT | O_RDONLY);
read(fd, buffer, buf_size);
memcpy(buffer2, buffer, buf_size);
}
free(buffer);
free(buffer2);
return 0;
}
I get the following result using perf stat when I run the program on a 1 GiB input file:
# perf stat -B -e l2_request_g1.all_no_prefetch:k,l2_request_g1.l2_hw_pf:k,cache-references:k ./main sample.txt
Performance counter stats for './main sample.txt':
651,263,793 l2_request_g1.all_no_prefetch:k
600,476,712 l2_request_g1.l2_hw_pf:k
1,251,740,542 cache-references:k
When I comment out read(fd, buffer, buf_size);, I get the following:
36,037,824 l2_request_g1.all_no_prefetch:k
33,416,410 l2_request_g1.l2_hw_pf:k
69,454,244 cache-references:k
Looking at the cache line size, I get the following (the same for index 0-3):
# cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64
Transparent HugePage Support (THP) is enabled:
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
I've checked that huge pages are allocated at runtime. Putting hardware prefetch accesses aside, it seems to me that read is responsible for more than 3 GiB of cache references:
64 x (651,263,793 - 36,037,824) / (1024^3 x 10) = 3.6 GiB
I'm now wondering how reading a 1 GiB file generates 3.6 GiB of memory traffic.
[Update] More Info About the System:
This is running on a double-socket server powered by AMD EPYC 7H12 64-core processors. The Linux kernel version is 6.8.0-41, and the distribution is Ubuntu 24.04.1 LTS. I compile the code using the following command:
# gcc -D_GNU_SOURCE main.c -o main
The filesystem is ZFS:
# df -Th
Filesystem Type Size Used Avail Use% Mounted on
...
home zfs x.yT xyzG x.yT xy% /home
When I remove O_DIRECT, I get the following results (which are not significantly different from when it's included):
650,395,869 l2_request_g1.all_no_prefetch:k
599,548,912 l2_request_g1.l2_hw_pf:k
1,249,944,793 cache-references:k
Finally, if I replace malloc with valloc, I get the following results (again, not much different from the original values):
651,092,248 l2_request_g1.all_no_prefetch:k
558,542,553 l2_request_g1.l2_hw_pf:k
1,209,634,821 cache-references:k
O_DIRECT? What about ensuring you're reading into a page-aligned buffer?malloc()is not guaranteed to provide a page-aligned buffer and you should be able to simply replace it withvalloc(), which is obsolete but still in glibc.copy_from_userinto a user-space buffer, especially if the copy doesn't use NT stores so it has to pull the destination cache lines into L2. So I suspect @AndrewHenle is onto something. XFS should support it. FSes that support compression (like BTRFS) probably also couldn't do it for compressed files, and ZFS / BTRFS checksum data, which they have to verify at some point...copy_from_usercreates 2 GiB of accesses (one for reading and one for writing). Copying to ARC creates another 2GiB accesses (again reading and writing). I don't thinkperfcounts DMA accesses. They don't seem to go through the last level cache (even if they do, they are initiated by the device not the CPU).