I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data overlap. I suspect the problem is either I'm being bound how much data I'm writing (2gb in 600ms, I was expecting <250ms) or all the cores are evicting other cores items in the L3 cache making it essentially l1-l2 only. I don't suppose a core can evict another core L1/L2 data by evicting its spot from the l3?
I wanted to measure using perf but I can't figure out what events to use. I tried looking at write and found the below which I can't understand
$ perf list | grep -iP "(write|wcb|wt)"
l2_wcb_req.cl_zero
[LS to L2 WCB cache line zeroing requests. LS (Load/Store unit) to L2
WCB (Write Combining Buffer) cache line zeroing requests]
l2_wcb_req.wcb_close
[LS to L2 WCB close requests. LS (Load/Store unit) to L2 WCB (Write
l2_wcb_req.wcb_write
[LS to L2 WCB write requests. LS (Load/Store unit) to L2 WCB (Write
Combining Buffer) write requests]
l2_wcb_req.zero_byte_store
[LS to L2 WCB zero byte store requests. LS (Load/Store unit) to L2 WCB
(Write Combining Buffer) zero byte store requests]
ls_st_commit_cancel2.st_commit_cancel_wcb_full
I tried looking at TLB and it doesn't seem high. 'l2_cache_accesses_from_dc_misses' which shows I'm getting a lot of hits in L2 from L1 but I can't seem to get information on L3 (I'm on an AMD cpu).
What events should I try?