I'm writing code that must perform a dot product between a vector b and a matrix C, where the dot is performed between b and each line of C.
I made two implementation, one in C++ and one in Python. The latter performs better: I'd like to obtain the same speed in C++.
There is one limit though, any improvement can't be done on the way datas are stored, by only on the way the computation is performed.
C++ code is the following:
#include <iostream>
#include <string>
#include <vector>
#include <random>
#include <chrono>
using namespace std;
using namespace std::chrono;
int main() {
size_t jmax = 700;
size_t kmax = 100000;
vector<vector<double>> C(jmax + 1, vector<double>(kmax));
vector<double> b(jmax + 1);
random_device rd;
mt19937 gen(rd());
uniform_real_distribution<double> dist_b(5.0, 10.0);
uniform_real_distribution<double> dist_C(-1.0, 1.0);
// fake b and C
for (size_t j = 0; j <= jmax; ++j) {
b[j] = dist_b(gen);
for (size_t k = 0; k < kmax; ++k) {
C[j][k] = dist_C(gen);
}
}
double* eps_ptr = C[jmax].data();
auto start = high_resolution_clock::now();
for (size_t j = 0; j < jmax; ++j) {
double* c_ptr = C[j].data();
double bj = b[jmax - j];
#pragma loop(ivdep)
for (size_t k = 0; k < kmax; ++k) {
eps_ptr[k] += bj * c_ptr[k];
}
}
auto end = high_resolution_clock::now();
auto ms = duration_cast<milliseconds>(end - start);
cout << ms.count() << endl;
return 0;
}
Python code is the following:
import numpy as np
import time
jmax = 700
kmax = 100000
C = np.random.uniform(-1.0, 1.0, size=(jmax, kmax))
b = np.random.uniform(5.0, 10.0, size=jmax)
start = time.time()
bb= b[::-1]
eps = np.dot(bb, C)
end = time.time()
print(int((end - start) * 1000))
Do you have any suggestion? (C++ is compiled using -O2). To be clear, I'm only interested in improving performance for second part of the code, where dot is performed.
OV? Don't describe the code but show it. Please post a minimal reproducible example