0

When you split an algorithm/function/whatever to run as separate threads, let's say I launch 8 threads, you don't know that each thread is going to run on a separate one of my 8 cores, as it's the job of the scheduler to decide which threads are given to which cores. Still, if I have 8 cores and I do split a job into 8 threads, I pretty much expect that to happen, each of my 8 cores (roughly) will take one eighth of the workload. In the case of Intel processors that have P and E cores (performance and efficiency cores), the P cores might be clocked at 5.4Ghz and the E cores might be clocked at 4.2Ghz. Does this type of processor with two different types of processors make multithreaded programming more unpredictable or less desirable? The two-tier system is common in other devices like smartphones and Apple CPUs, and the same question applies. As a programmer how are you supposed to account for the fact that when you run something on a different thread, say you spawn a new thread or another thread is waiting in a thread pool for a job, it may run on a performance core or an efficiency core? Do you have any choice?

8
  • I think intel onetbb is most optimal in this regard, you distribute work based on the chunk size, say 1000 work item per chunk, if you have 1 million items then those are 1000 tasks, but it has a lower limit on task size of say 500 items, so 1800 items are 2 of 900 items tasks, and for < 1000 items you get only 1 worker. You control the chunk size. I have found this chunk size based approach much more repeatable than a thread-count based one. As you can maxmize temporal coherence of cache Commented Mar 30 at 21:08
  • For most algorithms you can try different chunk sizes and draw a "bath tub curve" to get the optimal range for the chunk size. Commented Mar 30 at 21:20
  • It is also not a given that increasing the number of virtual CPUs running in parallel always increases performance. Once you saturate memory bandwidth or disk IO capacity (whichever happens first) adding extra workers can even slow things down and certainly generates more heat. You have to experiment to find the performance sweet spot for your problem. Commented Mar 31 at 7:48
  • @AhmedAEK OK, I hadn't heard of Intel onetbb. I also hadn't heard of OpenMP. To tell you the truth I'm tired of learning APIs and libraries because you have to relearn everything when you change library/API. I would prefer to stick to raw code doing it. Would the equivalent of what you do with Intel onetbb or OpenMP be doable is plain C++ code? C++ has the facility to start threads and manipulate atomic variables. You can build anything with that, can't you? Also, at this level I think I'm learning something valuable. Commented Mar 31 at 12:05
  • @Zebrafish yes, you can write that in pure c++, but at least learn why those people made something that way and what pros/cons of their approach is, to be able to build something of comparable quality instead of repeating old mistakes, also according to steamDB, intel onetbb is among the most used C++ SDKs, it provides concurrent containers, parallel version of common algorithms like prefix sum and sort, and good graph schedulers, and good tasks distribution. it is like the goto for utilizing multiple cores efficiently in complex applications Commented Mar 31 at 12:11

1 Answer 1

2

If you divide your workload into equal-size parts then the P-cores will finish first. But if you divide into smaller parts and have threads grab another chunk when they're done their first, like OpenMP schedule=dynamic instead of static, you can keep all cores busy until all the work is done.

Or if there are lots of parallelizable tasks to be done, and later ones can start while some threads of the first are still finishing, that makes it easy to sent work to a thread pool.

Dividing your work into 8 equal-sized parts for an 8-core CPU can be bad even on a homogeneous CPU if there's any other load: if some threads are descheduled for a while they won't finish as early as threads that ran the whole time. (Especially if the total time is on the same scale as a scheduling granule, e.g. 10 ms for Linux with HZ=100.)

So there's already reason to divide up work into moderate-size chunks for threads to consume, especially if you're using a sophisticated thread-pool system like OpenMP which can do that for you without having to write a lot of extra code.

Sign up to request clarification or add additional context in comments.

5 Comments

Hmm, OK, my understanding was that you should use as many cores as you have available. I've seen people use like the c++ std::thread::hardware_concurrency - 1. But what you're saying makes sense. What you're saying is that if I split it into 8 segments with 4 P cores and 4 E cores the P cores will finish first and you'll have to wait for the E cores, whereas if I split it into 32 segments, then whenever the P cores finish before the E cores, then they'll just pick up another job, and so on, and so it's more spread out that way?
It's strange on Intel they don't seem to give more than 8 performance cores, but up to 16 efficiency cores for a total of 24 cores. In this example you might divide the work into what? Like 30 or 40 threads?
@Zebrafish: You'd typically want std::thread::hardware_concurrency threads working at once, each claiming medium-sized chunks of work until there are no chunks left. You don't start a separate thread for every chunk, especially not all at once. If you're doing a dot-product for example, the outer loop of a worker thread might do float *chunkstart = chunkpos.fetch_add(chunk_size); and compare against an end-pointer to see if this chunk is past the end of the array or not. (If so, other threads have already claimed all the chunks.)
The thing about hardware_concurrency is that is that, for example on my CPU it has 4 cores and but hardware_concurrency returns 8, I'm not sure if it would be better off sticking 4 threads in that case.
@Zebrafish: yeah, it normally returns the number of logical cores. With some workloads (especially ones tuned for cache sizes, such at matmul), it's best to only run one thread per physical core, regardless of SMT (such as hyperthreading). This is kind of an unsolved problem in terms of OS APIs for reporting hardware capabilities, usually requiring some manual work or OS-specific or CPU-vendor-specific tools to figure out the number of physical cores.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.