P and E cores when doing multithreaded programming

Question

When you split an algorithm/function/whatever to run as separate threads, let's say I launch 8 threads, you don't know that each thread is going to run on a separate one of my 8 cores, as it's the job of the scheduler to decide which threads are given to which cores. Still, if I have 8 cores and I do split a job into 8 threads, I pretty much expect that to happen, each of my 8 cores (roughly) will take one eighth of the workload. In the case of Intel processors that have P and E cores (performance and efficiency cores), the P cores might be clocked at 5.4Ghz and the E cores might be clocked at 4.2Ghz. Does this type of processor with two different types of processors make multithreaded programming more unpredictable or less desirable? The two-tier system is common in other devices like smartphones and Apple CPUs, and the same question applies. As a programmer how are you supposed to account for the fact that when you run something on a different thread, say you spawn a new thread or another thread is waiting in a thread pool for a job, it may run on a performance core or an efficiency core? Do you have any choice?

I think intel onetbb is most optimal in this regard, you distribute work based on the chunk size, say 1000 work item per chunk, if you have 1 million items then those are 1000 tasks, but it has a lower limit on task size of say 500 items, so 1800 items are 2 of 900 items tasks, and for < 1000 items you get only 1 worker. You control the chunk size. I have found this chunk size based approach much more repeatable than a thread-count based one. As you can maxmize temporal coherence of cache — Ahmed AEK
– Ahmed AEK, Commented Mar 30 at 21:08
For most algorithms you can try different chunk sizes and draw a "bath tub curve" to get the optimal range for the chunk size. — Ahmed AEK
– Ahmed AEK, Commented Mar 30 at 21:20
It is also not a given that increasing the number of virtual CPUs running in parallel always increases performance. Once you saturate memory bandwidth or disk IO capacity (whichever happens first) adding extra workers can even slow things down and certainly generates more heat. You have to experiment to find the performance sweet spot for your problem. — Martin Brown
– Martin Brown, Commented Mar 31 at 7:48
@AhmedAEK OK, I hadn't heard of Intel onetbb. I also hadn't heard of OpenMP. To tell you the truth I'm tired of learning APIs and libraries because you have to relearn everything when you change library/API. I would prefer to stick to raw code doing it. Would the equivalent of what you do with Intel onetbb or OpenMP be doable is plain C++ code? C++ has the facility to start threads and manipulate atomic variables. You can build anything with that, can't you? Also, at this level I think I'm learning something valuable. — Zebrafish
– Zebrafish, Commented Mar 31 at 12:05
@Zebrafish yes, you can write that in pure c++, but at least learn why those people made something that way and what pros/cons of their approach is, to be able to build something of comparable quality instead of repeating old mistakes, also according to steamDB, intel onetbb is among the most used C++ SDKs, it provides concurrent containers, parallel version of common algorithms like prefix sum and sort, and good graph schedulers, and good tasks distribution. it is like the goto for utilizing multiple cores efficiently in complex applications — Ahmed AEK
– Ahmed AEK, Commented Mar 31 at 12:11

Peter Cordes · Accepted Answer · 2025-03-30 18:49:57Z

2

If you divide your workload into equal-size parts then the P-cores will finish first. But if you divide into smaller parts and have threads grab another chunk when they're done their first, like OpenMP schedule=dynamic instead of static, you can keep all cores busy until all the work is done.

Or if there are lots of parallelizable tasks to be done, and later ones can start while some threads of the first are still finishing, that makes it easy to sent work to a thread pool.

Dividing your work into 8 equal-sized parts for an 8-core CPU can be bad even on a homogeneous CPU if there's any other load: if some threads are descheduled for a while they won't finish as early as threads that ran the whole time. (Especially if the total time is on the same scale as a scheduling granule, e.g. 10 ms for Linux with HZ=100.)

So there's already reason to divide up work into moderate-size chunks for threads to consume, especially if you're using a sophisticated thread-pool system like OpenMP which can do that for you without having to write a lot of extra code.

answered Mar 30 at 18:49

Peter Cordes

377k50 gold badges741 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Zebrafish Mar 30 at 20:39

Hmm, OK, my understanding was that you should use as many cores as you have available. I've seen people use like the c++ std::thread::hardware_concurrency - 1. But what you're saying makes sense. What you're saying is that if I split it into 8 segments with 4 P cores and 4 E cores the P cores will finish first and you'll have to wait for the E cores, whereas if I split it into 32 segments, then whenever the P cores finish before the E cores, then they'll just pick up another job, and so on, and so it's more spread out that way?

Zebrafish Mar 30 at 20:41

It's strange on Intel they don't seem to give more than 8 performance cores, but up to 16 efficiency cores for a total of 24 cores. In this example you might divide the work into what? Like 30 or 40 threads?

Peter Cordes Mar 30 at 21:08

@Zebrafish: You'd typically want std::thread::hardware_concurrency threads working at once, each claiming medium-sized chunks of work until there are no chunks left. You don't start a separate thread for every chunk, especially not all at once. If you're doing a dot-product for example, the outer loop of a worker thread might do float *chunkstart = chunkpos.fetch_add(chunk_size); and compare against an end-pointer to see if this chunk is past the end of the array or not. (If so, other threads have already claimed all the chunks.)

Zebrafish Mar 31 at 10:30

The thing about hardware_concurrency is that is that, for example on my CPU it has 4 cores and but hardware_concurrency returns 8, I'm not sure if it would be better off sticking 4 threads in that case.

Peter Cordes Mar 31 at 13:22

@Zebrafish: yeah, it normally returns the number of logical cores. With some workloads (especially ones tuned for cache sizes, such at matmul), it's best to only run one thread per physical core, regardless of SMT (such as hyperthreading). This is kind of an unsolved problem in terms of OS APIs for reporting hardware capabilities, usually requiring some manual work or OS-specific or CPU-vendor-specific tools to figure out the number of physical cores.

Collectives™ on Stack Overflow

P and E cores when doing multithreaded programming

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related