Why does a loop over an array run faster without optimization vs. gcc -O3? Array was initialized with malloc + zeroing loop [duplicate]

Ask Question

Asked 4 years, 8 months ago

Modified 4 years, 8 months ago

Viewed 156 times

I am sorry to post this question again with some updates. The previous one has been closed. I am trying to see the performance speedup of AVX instructions. Below is the example code I am running:

#include <iostream>
#include <stdio.h>
#include <string.h>
#include <cstdlib>
#include <algorithm>
#include <immintrin.h>
#include <chrono>
#include <complex>
//using Type = std::complex<double>;
using Type = double;

int main()
{
        size_t b_size  =  1;
        b_size = (1ul << 30) * b_size;
    Type *d_ptr = (Type*)malloc(sizeof(Type)*b_size);
    for(int i = 0; i < b_size; i++)
    {
        d_ptr[i] = 0;
    }
    std::cout <<"malloc finishes!" << std::endl;
    #ifndef AVX512
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i ++)
        {
             d_ptr[i] = i*0.1;
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "No avx takes " << diff << std::endl;  
    #else
            auto a = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m256d tmp2 = _mm256_set_pd(0.1*(i+3),0.1*(i+2),0.1*(i+1),0.1*i);
             __m256d tmp3 = _mm256_add_pd(tmp1,tmp2);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);
            
        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl; 
    
    #endif
}

I have run the above code on the Haswell machine. The results are surprising:

Without AVX and compiled with O3:

~$ ./test_avx512_auto_noavx 
malloc finishes!
1.07374e+08
No avx takes 3824740

With AVX and compiled without any optimization flags:

~$ ./test_avx512_auto
malloc finishes!
1.07374e+08
avx takes 2121917

With AVX and compiled with O3:

~$ ./test_avx512_auto_o3 
malloc finishes!
1.07374e+08
avx takes 6307190

It is against what we thought before.

Also, I have implemented a vectorized version (similar to Add+Mul become slower with Intrinsics - where am I wrong? ), see the code below:

#else
            auto a = std::chrono::high_resolution_clock::now();
            __m256d tmp2 = _mm256_set1_pd(0.1);
            __m256d base = _mm256_set_pd(-1.0,-2.0,-3.0,-4.0);
            __m256d tmp3 = _mm256_set1_pd(4.0);
        for (int i = 0; i < b_size; i += 4)
        {
            /* __m128d tmp1 = _mm_load_pd(reinterpret_cast<double*>(&d_ptr[i]));
             __m128d tmp2 = _mm_set_pd((i+1)*0.1,0.1*i);
             __m128d tmp3 = _mm_add_pd(tmp1,tmp2);
             _mm_store_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp3);*/
            __m256d tmp1 = _mm256_loadu_pd(reinterpret_cast<double*>(&d_ptr[i]));
            base = _mm256_add_pd(base,tmp3);
            __m256d tmp5 = _mm256_mul_pd(base,tmp2);
            tmp1 = _mm256_add_pd(tmp1,tmp5);
             _mm256_storeu_pd(reinterpret_cast<double*>(&d_ptr[i]),tmp1);

        }
        std::cout << d_ptr[b_size-1] << std::endl;
            auto b = std::chrono::high_resolution_clock::now();
            long long diff = std::chrono::duration_cast<std::chrono::microseconds>(b-a).count();
        std::cout << "avx takes " << diff << std::endl;

    #endif

On the same machine, this gives me:

With AVX and without any optimization flags

~$ ./test_avx512_manual 
malloc finishes!
1.07374e+08
avx takes 2151390

With AVX and with O3:

~$ ./test_avx512_manual_o3 
malloc finishes!
1.07374e+08
avx takes 5965288

Not sure where the problem is. Why O3 gives up worse performance?

Editor's note: in the executable names,

_avx512_ seems to be -march=native, even though Haswell only has AVX2.
_manual vs. _auto seems to be -DAVX512 to use the manually-vectorized AVX1 code or the compiler's auto-vectorization of the scalar code that only writes with = instead of += like the intrinsics are doing.

edited Mar 10, 2021 at 5:22

Peter Cordes

377k50 gold badges741 silver badges1k bronze badges

asked Mar 10, 2021 at 2:14

flyree

291 silver badge3 bronze badges

1

Without optimization, the init loop is actually touching the memory and getting the page faults done before the timed region. So despite the actual loop being a lot less efficient, not having to pay the cost of page faults made it an overall win. See Idiomatic way of performance evaluation? re: page faults on memory you touch for the first time. Init with something non-zero to avoid it would be the easiest thing here.

Peter Cordes
– Peter Cordes

2021-03-10 04:47:15 +00:00
Commented Mar 10, 2021 at 4:47
1

Still looking for a Q&A about GCC compiling malloc+memset(0) (or equivalent loop) into calloc; that's the key here. If you edit out the fluff that's a duplicate of Did not get expected performance speed up and just focus on the part that's faster with -O0 (default) than -O3, maybe here is the right place to post that as an answer. But with the current question having auto vs. manual vectorization, and confusing AVX512 macro which you're calling "auto" vs. "manual" separate from your -march= options I think, it's not a good place to put an answer.

Peter Cordes
– Peter Cordes

2021-03-10 05:07:29 +00:00
Commented Mar 10, 2021 at 5:07
1

Also related: Why vectorizing the loop does not have performance improvement - here, gains from smarter vectorization will be hard to see because of the memory bandwidth bottleneck. If you looped repeatedly over a small array that fits in L2 cache or even L1d, you'd have room to beat the compiler's auto-vectorization. (And page faults wouldn't be dominating your run-time.)

Peter Cordes
– Peter Cordes

2021-03-10 05:14:25 +00:00
Commented Mar 10, 2021 at 5:14

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does a loop over an array run faster without optimization vs. gcc -O3? Array was initialized with malloc + zeroing loop [duplicate]

0

Linked

Hot Network Questions