3

Consider the following code:

int main(int argc, char** argv) {
  int buf_size = 1024*1024*1024;
  char* buffer = malloc(buf_size);
  char* buffer2 = malloc(buf_size);
  for (int i = 0; i < 10; i++){
    int fd = open(argv[1], O_DIRECT | O_RDONLY);
    read(fd, buffer, buf_size);
    memcpy(buffer2, buffer, buf_size);
  }
  free(buffer);
  free(buffer2);
  return 0;
}

I get the following result using perf stat when I run the program on a 1 GiB input file:

# perf stat -B -e l2_request_g1.all_no_prefetch:k,l2_request_g1.l2_hw_pf:k,cache-references:k ./main sample.txt 

 Performance counter stats for './main sample.txt':

       651,263,793      l2_request_g1.all_no_prefetch:k                                       
       600,476,712      l2_request_g1.l2_hw_pf:k                                              
     1,251,740,542      cache-references:k                                                    

When I comment out read(fd, buffer, buf_size);, I get the following:

        36,037,824      l2_request_g1.all_no_prefetch:k                                       
        33,416,410      l2_request_g1.l2_hw_pf:k                                              
        69,454,244      cache-references:k                                                    

Looking at the cache line size, I get the following (the same for index 0-3):

# cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size
64

Transparent HugePage Support (THP) is enabled:

# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

I've checked that huge pages are allocated at runtime. Putting hardware prefetch accesses aside, it seems to me that read is responsible for more than 3 GiB of cache references:

64 x (651,263,793 - 36,037,824) / (1024^3 x 10) = 3.6 GiB

I'm now wondering how reading a 1 GiB file generates 3.6 GiB of memory traffic.

[Update] More Info About the System:

This is running on a double-socket server powered by AMD EPYC 7H12 64-core processors. The Linux kernel version is 6.8.0-41, and the distribution is Ubuntu 24.04.1 LTS. I compile the code using the following command:

# gcc -D_GNU_SOURCE main.c -o main

The filesystem is ZFS:

# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
...
home           zfs       x.yT  xyzG  x.yT  xy% /home

When I remove O_DIRECT, I get the following results (which are not significantly different from when it's included):

       650,395,869      l2_request_g1.all_no_prefetch:k                                       
       599,548,912      l2_request_g1.l2_hw_pf:k                                              
     1,249,944,793      cache-references:k 

Finally, if I replace malloc with valloc, I get the following results (again, not much different from the original values):

       651,092,248      l2_request_g1.all_no_prefetch:k                                       
       558,542,553      l2_request_g1.l2_hw_pf:k                                              
     1,209,634,821      cache-references:k
11
  • What CPU is this on? O_DIRECT is forcing DMA directly into user-space pages, if I understand correctly, rather than copying from the pagecache (after populating it if necessary). IDK if the kernel would still zero out those pages on first access, although you're rewriting the same buffer 10 times so that amortizes the cost. Transparent-hugepage defrag runs in a different kernel thread so counts for that shouldn't be part of this even if it's happening. Commented Sep 25, 2024 at 5:34
  • AMD EPYC 7H12 64-Core Processor Commented Sep 25, 2024 at 5:38
  • 2
    IME support for direct IO on Linux is spotty (assuming you're using a Linux system...), and it usually requires page-aligned buffers. And some file systems just ignore it. Can you specify more details, such as specific Linux version and what file system you're reading from? What's your result if you remove O_DIRECT? What about ensuring you're reading into a page-aligned buffer? malloc() is not guaranteed to provide a page-aligned buffer and you should be able to simply replace it with valloc(), which is obsolete but still in glibc. Commented Sep 25, 2024 at 9:42
  • 2
    Does ZFS actually support O_DIRECT? phoronix.com/news/OpenZFS-Direct-IO says OpenZFS only merged support for it 5 days ago. Your ~4x numbers would I think make sense for DMA into the pagecache plus copy_from_user into a user-space buffer, especially if the copy doesn't use NT stores so it has to pull the destination cache lines into L2. So I suspect @AndrewHenle is onto something. XFS should support it. FSes that support compression (like BTRFS) probably also couldn't do it for compressed files, and ZFS / BTRFS checksum data, which they have to verify at some point... Commented Sep 25, 2024 at 20:32
  • 3
    OK, seems like we have an answer here. copy_from_user creates 2 GiB of accesses (one for reading and one for writing). Copying to ARC creates another 2GiB accesses (again reading and writing). I don't think perf counts DMA accesses. They don't seem to go through the last level cache (even if they do, they are initiated by the device not the CPU). Commented Sep 25, 2024 at 21:46

1 Answer 1

5

You're using ZFS, but your Linux kernel almost certainly doesn't support O_DIRECT on ZFS. https://www.phoronix.com/news/OpenZFS-Direct-IO says OpenZFS only merged support for it 5 days ago into mainline, so unless distro kernels have picked up that 2020 patch earlier, O_DIRECT is probably just silently ignored.

That probably explains your results of L2 traffic about 4X the size of your read. Two copies (ZFS's ARC to pagecache, and pagecache to user-space) each reading + writing the whole data. Or just one copy_to_user if it's not avoiding MESI RFO (Read For Ownership) would have to read the destination into cache before updating it with the newly stored values, so the total traffic is 3x the copy size. The extra .6 of a copy could be from the initial copy into pagecache, plus other L2 traffic that happens while your program runs.

There's potentially also extra reads for ZFS to verify checksums of the data (not just metadata). Hopefully they cache-block that somewhat so they get L1d or at least L2 hits, but IDK. But that verify only has to happen after reading from actual disk, and probably with O_DIRECT being fully ignored the data just stays hot in the pagecache and/or the ARC. IDK if any of that checksumming happens in a kernel thread rather than in your own process where perf stat (without -a) would count it.

Filesystems like XFS and EXT4 definitely support O_DIRECT. You will need valloc or aligned_alloc: Glibc malloc for big allocations uses mmap to get new pages and uses the first 16 bytes of that for its bookkeeping metadata, so big allocations are misaligned for every alignment of 32 and larger, including the page size.

FSes that support compression (like BTRFS) also couldn't do O_DIRECT for compressed files, and ZFS / BTRFS checksum data, which they have to verify at some point. XFS only checksums metadata.


DMA shouldn't be touching L2 except perhaps to evict cache lines its overwriting, and can happen while your process isn't current on a CPU core because it's blocked on I/O so it's asleep. So you'd actually expect no counts due to that I/O if O_DIRECT worked, unless you used system-wide mode (perf stat -a). And maybe only if you counted events for DRAM or L3. Or with some of the data hot in L2 from memcpy, that would have to be evicted before the next DMA.

x86 DMA is cache-coherent (since early CPUs didn't have cache and a requirement for software to invalidate before DMA wouldn't have been backwards-compatible). Intel Xeons can even DMA directly into L3, instead of just writing back and invalidating any cached data. I don't know if AMD Zen does anything similar. With each core-cluster (CCX) having its own L3, it would have to know which L3 to target to be most useful.

Sign up to request clarification or add additional context in comments.

2 Comments

(Does data really get cached twice, in the pagecache and ZFS's ARC? That seems wasteful. I'm not familiar with OpenZFS performance tuning or that level of detail on its internals; the explanation of how this could explain the 3.6x L2 traffic factor is very hand-wavy. I've also never worked with performance counters on AMD Zen-family CPUs. I wonder if L3 being a victim cache exclusive of L2 on Zen 2 at least could lead to extra transfers as data is evicted back to L3? EPYC 7H12 is a Zen 2)
Side note: I'd swap that recommendation of valloc for posix_memalign. valloc is nonstandard and obsolete.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.