1

I'm going through some linux kernel documentation on per cpu variables. I understand that having separate variable for each CPU help prevent cache line invalidation and make things faster.

But in multiprocessor system , a task can get scheduled in any processor. So cache and tlb invalidation or clean keep happening plethora of other kernel data.So how having few per cpu variables increase performance of the system unless those variables are going to be used very frequently?

5
  • For variable that are read (really) very often (and that does not need to be protected by a lock), it is better to use one shared variable because separate variables would take more space in CPU caches which takes time to reload after cache misses (especially due to context switches). Commented Mar 21 at 1:11
  • In high-performance (HPC) systems, users are strongly encouraged to pin threads (of HPC applications) to cores so to avoid threads migration from one core to another. When this is not done, performance is often bad anyway. It still matters in some use-case though. For example, during builds, or even in PC games (where thread pinning is not only hard but also often detrimental to performance due to the inherent dynamic nature of the computations and often more threads to schedule than available cores). Commented Mar 21 at 1:18
  • Note that if a thread switch from one core to another in the same NUMA node, data can still lies in shared cache (L3 on mainstream CPUs) and in the shared TLB. Fetch from the L3 are not so expensive on modern CPU, especially thanks to out-of-order execution avoiding and speculative execution stalls (not to mention multiple loads can be in flight at the same time). The overhead of a context switch is significantly bigger than the L3 latency on most machine. It still matters a bit though. Commented Mar 21 at 1:26
  • Also please note that on many-core system (e.g. microprocessors with >64 cores, having possibly >128 threads), scalability is really critical. The overhead of cache line bouncing is HUGE due to the serialization of the operation on a large number of cores, especially between NUMA nodes (e.g. for clustered L3) increasing the latency of such an operation. The best choice is certainly dependent of the target hardware and the kind of application run. For example, per-CPU variables is generally good idea on supercomputers, but not sure on embedded systems... Commented Mar 21 at 1:36
  • @JérômeRichard I really appreciate your answers. So in cases like some cpu variables which bounce between two different clusters of NUMA node , can cause performance issues due to serialization. But in embedded system with few cores per cpu variables may not be very useful? Commented Mar 22 at 17:19

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.