3

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).

int sum = 0;
auto start = chrono::high_resolution_clock::now();

for (int i = 0; i < 800000; ++i)
    sum += ints[i];

auto end = chrono::high_resolution_clock::now();
auto diff = end - start;

Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.

for (auto& val : ints)
    sum += val;

for (auto it = ints.begin(); it != ints.end(); ++it)
    sum += *it;

At this point console output looked like:

index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms

This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?

I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.

auto func = [&](int start, int total, int index)
{
    int partial_sum = 0;

    auto s = chrono::high_resolution_clock::now();
    for (int i = start; i < start + total; ++i)
        partial_sum += ints[i];
    auto e = chrono::high_resolution_clock::now();
    auto d = e - s;

    m.lock();
    cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
    sum += partial_sum;
    m.unlock();
};

for (int i = 0; i < 8; ++i)
    threads.push_back(thread(func, i * 100000, 100000, i));

Basically every thread was summing 1/8 of the total array, and the final console output was:

thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms

So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.

EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.

EDIT2: Here's the whole main function: http://pastebin.com/HyZUYxSY

10
  • What platform, compiler, etc? Commented Feb 18, 2015 at 16:13
  • Edited the question. I forgot to include that information. Commented Feb 18, 2015 at 16:16
  • Did you also try to use some native thread implementation without using those std::thread facilities? Commented Feb 18, 2015 at 16:19
  • Isn't std::thread an abstraction over native threads and using whatever native thread implementation is available on a platform? Commented Feb 18, 2015 at 16:25
  • It would be great if you posted a MCVE that others could just run on their machine. Commented Feb 18, 2015 at 16:27

5 Answers 5

2

With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.

For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I know that. I checked and the CPU was at 4.5GHz while this was running.
2

should there be a difference this big in debug mode between these three for loops?

Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.

Or even a difference in 1ms in release mode?

Your test code:

    start = ...
    for (auto& val : ints)
            sum += val;
    end = ...
    diff = end - start;
    sum = 0;

Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:

    start = ...
    // do nothing...
    end = ...
    diff = end - start;

For all your loops.

The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.

Comments

1

Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.

An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units. So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.

With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.

As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx

Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.

5 Comments

Did you use std::thread and does OpenMP offer better performance? I just tested std::thread and native Windows threads, and it showed a huge difference.
Based on your code I made a version of my own which tests the Index For, Iterator For, Ranged For, StdThread and BoostThread. I haven't gotten round yet to putting in POSIX/Windows threads. And the results matched up yours. On an i7-860 with 100.000.000 ints and 8 threads, release build, VS2013. Of the three single-threaded all are <1ms. std::thread around 120ms, boost 130ms. OpenMP 3800ms. OpenMP is disappointing but all three show the impact of thread switching. Though, VS2013 uses 2.0 while 4.0 is out. Here's a link to my code: dl.dropboxusercontent.com/u/6108803/thread_tester.cpp
I just tested it on a linux box with an intel G1820 (dual core, dual threaded). I compiled it with GCC 4.9.1 which has OpenMP4 and hoped for a better outcome for OpenMP. Sadly enough: the results were in line with the Windows box. The single-threaded loops were a magnitude faster than the threaded versions. std::thread still beats boost::thread. And OpenMP still lags behind as if it tried to run without legs. Man, what a let-down. The OpenMP version was so much easier to make than the other threaded versions! The results: 0, 0, 0 (index, ranged, iterator), 18 (std), 23 (boost), 4098ms OpenMP.
Okay, the OpenMP results galled me so I figured I must have done something wrong. Which I did. I updated my code with a new test show-casing a much better OMP construction. The results on my office machine with an i7-4770 on Win8.1 with VS2013 are now: Index/Ranged/Iterator: <1ms. std: 117-152 ms, boost: 127-156ms, OpenMP old: ~1900ms, OpenMP new: 77-136ms. OpenMP is now consistently faster than the other thread tests.
Thanks for sharing your results. I'll have to try OpenMP and play around with it once I have some free time. I refactored the initial code, and got the multithreaded test to be faster than the singlethreaded, using std::thread. Results were a lot different when everything measuring related was moved out of the lambda used by threads.
1

For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.

However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.

For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.

The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?

I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.

Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?

1 Comment

With optimized release build and 800,000 ints all I got were zeroes, so I increased it to 80,000,000 and got this: index loop: 28.0016ms, 8 threads total: 87.0051ms
1

Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:

The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).

If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.

On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).

Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).

Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.

#include <stdafx.h>
#include <nativethreadTest.h>

#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>

struct Range
{
    Range( const int32_t *p, size_t l)
        : data(p)
        , length(l)
        , result(0)
    {}
    const int32_t *data;
    size_t length;
    int32_t result;
};

static int32_t Sum(const int32_t * data, size_t length)
{
    int32_t sum = 0;
    const int32_t *end = data + length;
    for (; data != end; data++)
    {
        sum += *data;
    }
    return sum;
}

static int32_t TestSingleThreaded(const Range& range)
{
    return Sum(range.data, range.length);
}

DWORD 
WINAPI 
CalcThread
(_In_  LPVOID lpParameter
)
{
    Range * myRange = reinterpret_cast<Range*>(lpParameter);
    myRange->result = Sum(myRange->data, myRange->length);
    return 0;
}

static int32_t TestWithNCores(const Range& range, size_t ncores)
{
    int32_t result = 0;
    std::vector<Range> ranges;
    size_t nextStart = 0;
    size_t chunkLength = range.length / ncores;
    size_t remainder = range.length - chunkLength * ncores;
    while (nextStart < range.length)
    {
        ranges.push_back(Range(&range.data[nextStart], chunkLength));
        nextStart += chunkLength;
    }
    Range remainderRange(&range.data[range.length - remainder], remainder);

    std::vector<HANDLE> threadHandles;
    threadHandles.reserve(ncores);
    for (size_t i = 0; i < ncores; ++i)
    {
        threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
    }
    int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
    DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
    if (WAIT_OBJECT_0 == waitResult)
    {
        for (auto& r : ranges)
        {
            result += r.result;
        }
        result += remainderResult;
    }
    else
    {
        throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
    }
    for (auto& h : threadHandles)
    {
        ::CloseHandle(h);
    }
    return result;
}

static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
    int32_t result = 0;
    std::vector<Range> ranges;
    size_t nextStart = 0;
    size_t chunkLength = range.length / ncores;
    size_t remainder = range.length - chunkLength * ncores;
    while (nextStart < range.length)
    {
        ranges.push_back(Range(&range.data[nextStart], chunkLength));
        nextStart += chunkLength;
    }
    Range remainderRange(&range.data[range.length - remainder], remainder);

    std::vector<std::thread> threads;
    for (size_t i = 0; i < ncores; ++i)
    {
        threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
    }

    int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
    for (auto& t : threads)
    {
        t.join();
    }
    for (auto& r : ranges)
    {
        result += r.result;
    }
    result += remainderResult;
    return result;
}

void TestNativeThreads()
{
    const size_t DATA_SIZE = 800000000ULL;
    typedef std::vector<int32_t> DataVector;
    DataVector data;
    data.reserve(DATA_SIZE);

    for (size_t i = 0; i < DATA_SIZE; ++i)
    {
        data.push_back(static_cast<int32_t>(i));
    }

    Range r = { data.data(), data.size() };
    std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
    int32_t result = TestSingleThreaded(r);
    std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
    std::cout
        << "Single threaded sum: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
        << "ms." << " Result = " << result << std::endl;

    std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
    result = TestWithNCores(r, 8);
    std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();

    std::cout 
        << "Multi threaded sum: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
        << "ms." << " Result = " << result << std::endl;

    std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
    result = TestWithSTLThreads(r, 8);
    std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();

    std::cout
        << "std::thread sum: "
        << std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
        << "ms." << " Result = " << result << std::endl;
}

Here the output on my machine of this code:

Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..

Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark. Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.

Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

3 Comments

So, is there a way to force std::thread are spawned on individual cores? Or maybe native Windows threads?
Windows: SetThreadAffinityMask(). The point of this answer is, that there are many more degrees of freedom as to what std::thread might do than is good for using it in benchmarkings. Another example: Does it call CoInitializeEx(..)?
For 8,000,000: Single threaded sum: 13ms. Result = -1805322496 Multi threaded sum: 2ms. Result = -1805322496 For 80,000,000: Single threaded sum: 133ms. Result = 216376832 Multi threaded sum: 22ms. Result = 216376832 And I couldn't run it for 800,000,000. It was all with the unoptimized debug build. This basically destroys the std::thread performance in my original post.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.