4

I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data overlap. I suspect the problem is either I'm being bound how much data I'm writing (2gb in 600ms, I was expecting <250ms) or all the cores are evicting other cores items in the L3 cache making it essentially l1-l2 only. I don't suppose a core can evict another core L1/L2 data by evicting its spot from the l3?

I wanted to measure using perf but I can't figure out what events to use. I tried looking at write and found the below which I can't understand

$ perf list | grep -iP "(write|wcb|wt)"
    l2_wcb_req.cl_zero
        [LS to L2 WCB cache line zeroing requests. LS (Load/Store unit) to L2
            WCB (Write Combining Buffer) cache line zeroing requests]
    l2_wcb_req.wcb_close
        [LS to L2 WCB close requests. LS (Load/Store unit) to L2 WCB (Write
    l2_wcb_req.wcb_write
        [LS to L2 WCB write requests. LS (Load/Store unit) to L2 WCB (Write
            Combining Buffer) write requests]
    l2_wcb_req.zero_byte_store
        [LS to L2 WCB zero byte store requests. LS (Load/Store unit) to L2 WCB
            (Write Combining Buffer) zero byte store requests]
    ls_st_commit_cancel2.st_commit_cancel_wcb_full

I tried looking at TLB and it doesn't seem high. 'l2_cache_accesses_from_dc_misses' which shows I'm getting a lot of hits in L2 from L1 but I can't seem to get information on L3 (I'm on an AMD cpu).

What events should I try?

2
  • I don't suppose a core can evict another core L1/L2 data by evicting its spot from the l3? - On CPUs with inclusive L3 cache, that can in fact happen. (Intel i7 series, and Xeons before Skylake aka Scalable Xeon). L3 tags are the only way those CPUs keep track of which cores have exclusive ownership of a cache line, and which might have MESI Shared copies so need to get notified if another core wants to write it. (i.e. as the directory for tracking MESI state, because broadcasting and snooping a shared bus doesn't scale.) So evicting a line requires flushing it from any core's L2/L1. Commented Aug 28, 2023 at 20:50
  • Oh, you're on AMD. Zen 2's L3 caches are non-inclusive / non-exclusive. Their private L2 caches are inclusive of L1i/d. The coherence directory is separate: "The L3 cache maintains shadow tags for all cache lines of each L2 cache in the CCX" (en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core_Complex) - IDK if those shadow tags impose a limit on how many different lines the different L2 caches in a CCX can cache. Oh, and those shadow tags are also snoop filters for requests to other CCXs in CPUs with lots of cores. Commented Aug 28, 2023 at 20:57

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.