-5

I'm writing code that must perform a dot product between a vector b and a matrix C, where the dot is performed between b and each line of C.

I made two implementation, one in C++ and one in Python. The latter performs better: I'd like to obtain the same speed in C++.

There is one limit though, any improvement can't be done on the way datas are stored, by only on the way the computation is performed.

C++ code is the following:

#include <iostream>
#include <string>
#include <vector>
#include <random>
#include <chrono>

using namespace std;
using namespace std::chrono;

int main() {
    
    size_t jmax = 700;
    size_t kmax = 100000;

    vector<vector<double>> C(jmax + 1, vector<double>(kmax));
    vector<double> b(jmax + 1);
    
    random_device rd;
    mt19937 gen(rd());
    uniform_real_distribution<double> dist_b(5.0, 10.0);
    uniform_real_distribution<double> dist_C(-1.0, 1.0);

    // fake b and C
    for (size_t j = 0; j <= jmax; ++j) {
        b[j] = dist_b(gen);
        for (size_t k = 0; k < kmax; ++k) {
            C[j][k] = dist_C(gen);
        }
    }

    double* eps_ptr = C[jmax].data();
    
    auto start = high_resolution_clock::now();

    for (size_t j = 0; j < jmax; ++j) {
        double* c_ptr = C[j].data();
        double bj = b[jmax - j];
        
        #pragma loop(ivdep)
        for (size_t k = 0; k < kmax; ++k) {
            eps_ptr[k] += bj * c_ptr[k];
        }
    }
    
    auto end = high_resolution_clock::now();
    auto ms = duration_cast<milliseconds>(end - start);

    cout << ms.count() << endl;

    return 0;
}

Python code is the following:

import numpy as np
import time

jmax = 700
kmax = 100000

C = np.random.uniform(-1.0, 1.0, size=(jmax, kmax))
b = np.random.uniform(5.0, 10.0, size=jmax)

start = time.time()

bb= b[::-1]

eps = np.dot(bb, C)
end = time.time()
print(int((end - start) * 1000))

Do you have any suggestion? (C++ is compiled using -O2). To be clear, I'm only interested in improving performance for second part of the code, where dot is performed.

20
  • 4
    what is OV? Don't describe the code but show it. Please post a minimal reproducible example Commented Oct 24 at 10:17
  • 4
    Please provide a minimal reproducible example and full compilation details (switches, version etc) so we can check this out for ourselves Commented Oct 24 at 10:19
  • 4
    Did you try Eigen? eigen.tuxfamily.org Commented Oct 24 at 10:21
  • 3
    Where is a dot product in this C++ code example? I do not see any multiplication. Commented Oct 24 at 11:15
  • 3
    You are actually comparing your C++ code against an open source math library written in C. Python is simply printing your timing result. You can look at this C code to see how it performs faster. Or you can use a fast C++ math library, just like you are doing in Python. Commented Oct 24 at 13:19

1 Answer 1

5

The C++ code is slower mainly because it is less optimised than Numpy (which is written in C and quite optimised, though not always optimal). More specifically here, a tile-based optimisation is missing to make the code faster. Moreover, Numpy internally calls a BLAS function which uses multiple threads to make this faster while your implementation is sequential. Indeed, you need 1 load, 1 store and 1 FMA (fuse-multiply-add) operation per iteration of the inner loop. This is not optimal because most CPU can execute more FMA instruction than store. As a result, your current loop is bound by stores. Tiling significantly the number of stores needed to perform the same thing. It also reduce the number of loads and improve the use of the CPU caches. Multithreading mainly helps to better saturate the memory hierarchy since a single core often cannot saturate it.


Optimised implementation

Here is a modified code using tiling with 8 iterations of j are unrolled in the k-based loop using OpenMP for multithreading:

    size_t j = 0;

    for (j = 0; j+7 < jmax; j+=8) {
        double* c_ptr1 = C[j+0].data();
        double* c_ptr2 = C[j+1].data();
        double* c_ptr3 = C[j+2].data();
        double* c_ptr4 = C[j+3].data();
        double* c_ptr5 = C[j+4].data();
        double* c_ptr6 = C[j+5].data();
        double* c_ptr7 = C[j+6].data();
        double* c_ptr8 = C[j+7].data();
        double bj1 = b[jmax - j - 0];
        double bj2 = b[jmax - j - 1];
        double bj3 = b[jmax - j - 2];
        double bj4 = b[jmax - j - 3];
        double bj5 = b[jmax - j - 4];
        double bj6 = b[jmax - j - 5];
        double bj7 = b[jmax - j - 6];
        double bj8 = b[jmax - j - 7];
        
        // No need to use toomany thread, it will likely not scale anyway.
        // More threads means more time to perform the fork-join, especially on large-scale servers.
        #pragma omp parallel for num_threads(4)
        for (size_t k = 0; k < kmax; ++k) {
            double res = eps_ptr[k];
            res += bj1 * c_ptr1[k];
            res += bj2 * c_ptr2[k];
            res += bj3 * c_ptr3[k];
            res += bj4 * c_ptr4[k];
            res += bj5 * c_ptr5[k];
            res += bj6 * c_ptr6[k];
            res += bj7 * c_ptr7[k];
            res += bj8 * c_ptr8[k];
            eps_ptr[k] += res;
        }
    }

    for (; j < jmax; ++j) {
        double* c_ptr = C[j].data();
        double bj = b[jmax - j];
        
        //#pragma loop(ivdep)
        for (size_t k = 0; k < kmax; ++k) {
            eps_ptr[k] += bj * c_ptr[k];
        }
    }

On top of that, you can use -O3 instead of -O2 and leverage the AVX and FMA instruction sets for better performance. AVX is available on more than 97% of PCs so using it is rather safe. All recent x86-64 CPUs should support it AFAIK. Be aware that if the machine does not support it, then the target executable will certainly crash. You can generate two binaries from same code with different flags if this is a problem for you.


Performance results

Here are performance results on Windows with Clang-CL:

Python code (Numpy 2.1.3):                        14 ms

Initial C++ code (`-O2` flag):                    34 ms
C++ code with tiling (`-O2` flag):                24 ms
C++ code with tiling (`-O3 -mavx -mfma` flags):   22 ms
Same as before, but with a tile of size 8:        20 ms
Same as before, but with `-fopenmp` enabled:      14 ms    <----------

We can see that the best optimised C++ implementation with tiling is as fast as Numpy (retrieved from PIP). Both seem memory-bound (due to the big matrix to load from the quite-slow DRAM).

By the way, using a bigger tile size, like 16, is rather overkill. It does not give much performance benefit while it increases the risk of register spilling. It also makes memory accesses look more random for the CPU so they become less efficient. With a tile of size 32, it actually make things slower. It also make the code bigger, less maintainable, and slower to compile. Actually I used a tile size of 8 only so to mitigate the overhead of fork-join operations on CPUs with many cores. On my machine, a tile size of 4 is already optimal.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.