While doing performance benchmarking on CPLEX, I ran across a strange issue. The details are posted in a blog: https://community.ibm.com/community/user/ai-datascience/blogs/xavier-nodet1/2021/07/08/performance-considerations-for-cplex-on-multiproce. It's just about two pages, but I don't want to copy it here, so these are the highlights:
On a multi-cpu machine we need to worry about NUMA access times. If the code's cache-hit ratio is low and memory is allocated in the memory bank of the CPU where your process runs then the code will run significantly faster than if memory is allocated in the other bank. Of course, the allocator prefers allocating in the same bank, but if that is fully used then the allocator will go to the other bank. Even if the bank is used only by cached stuff. Now, for whatever reason, the dentry/inode cache is exclusively in CPU0's bank. Therefore if my code happens to get scheduled on a core in CPU0 then its performance will be very different depending on whether the cache occupies a little or a lot of bank0. Not good for benchmarking...
I can force dropping the cache, but that has disadvantages as well. Also, it might not even be sufficient since sometimes I'm reading in huge files, so the cache might fill up bank0 before I get to processing the data.
So... I'm wondering if it is possible to tell the allocator to prefer dropping caches before allocating memory from the other bank?
PS: In the referenced blog I did mention that vfs_cache_pressure could be used to mitigate the problem, but lately I found that it is not a silver bullet either.