5

Would it be possible to implement a class that receives a C-style pointer as a template argument and somehow resolves into a static Eigen matrix but using the memory provided? Say a declaration would look something like:

EIGEN_ALIGN16 double array[9];
CMatrix<double,3,3,array> :: m;

I do know about maps, but the example code I provide below has proven them to be slower by 20% when compared to static Eigen matrices.

These would be the premises:

  • I need to provide my own C pointer. This way I can efficiently reuse C code without incurring copies.
  • The resulting matrix should look static to Eigen so that Eigen can optimize as it would with a static array at compile time. Look at my example above where at compile time I would provide both matrix size (static) and the C pointer.
  • CMatrix should fall back to Eigen::Matrix. When the additional template parameter for the C array is not provided I would get the normal Eigen matrix.
  • I do not intend to make a full Eigen extension. With that I mean is I do not care about all kind of checks to provide a neat extension for other users. I just want the most efficient solution for this problem

Would it be possible to implement a solution by adding a new constructor? Say something like:

EIGEN_ALIGN16 double data[9];
Eigen::Matrix<double,3,3> m(data); //where data is NOT copied but used to replace the static allocation used by default.

Find below my code for benchmarking Map vs. Matrix efficiency. It is self contained and you can compile with:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I/path_to_my_Eigen benchmark1.cpp -o benchmark1 -lrt

Here is the code:

#include <Eigen/Eigen>
#include <bench/BenchTimer.h>

#include <iostream>

using namespace Eigen;
using namespace std;

//#define CLASSIC_METHOD
#define USE_MAPS

EIGEN_DONT_INLINE void classic(double VO[4], double AT[4][4], double VI[4])
{
  for (int ii=0; ii<4; ii++)
    {
      VO[ii] = AT[ii][0] * VI[0] +
               AT[ii][1] * VI[1] +
               AT[ii][2] * VI[2] +
               AT[ii][3] * VI[3];
    }
};

template <typename OutputType, typename MatrixType, typename VectorType>
EIGEN_DONT_INLINE void modern(MatrixBase<OutputType>& VOE, const MatrixBase<MatrixType>& A44, const MatrixBase<VectorType>& VIE)
{
  VOE.noalias() = A44.transpose()*VIE;
};

int main()
{
  EIGEN_ALIGN16 double AT[4][4] = {0.1, 0.2, 0.3, 2.0, 0.2, 0.3, 0.4, 3.0, 0.3, 0.4, 0.5, 4.0, 0.0, 0.0, 0.0, 1.0};
  EIGEN_ALIGN16 double VI[4] = {1, 2, 3, 4};
  EIGEN_ALIGN16 double VO[4];

//Eigen matrices

#ifndef USE_MAPS
  Matrix4d A44 = Matrix4d::MapAligned(AT[0]);
      Vector4d VIE = Vector4d::MapAligned(VI);
  Vector4d VOE(0,0,0,0);
#else
  Map<Matrix4d,Aligned> A44(AT[0]);
  Map<Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);

  // Map<Matrix4d> A44(AT[0]);                                                                                                                                                                                                                                      
  // Map<Vector4d> VIE(VI);                                                                                                                                                                                                                                           
  // Map<Vector4d> VOE(VO);

#endif

#ifdef EIGEN_VECTORIZE
  cout << "EIGEN_VECTORIZE defined" << endl;
#else
    cout << "EIGEN_VECTORIZE NOT defined" << endl;
#endif

  cout << "VIE:" << endl;
  cout << VIE << endl;

  VI[0] = 3.14;
  cout << "VIE:" << endl;
  cout << VIE << endl;

  BenchTimer timer;

  const int num_tries = 5;
  const int num_repetitions = 200000000;

#ifdef CLASSIC_METHOD
  BENCH(timer, num_tries, num_repetitions, classic(VO, AT, VI));
  std::cout << Vector4d::MapAligned(VO) << std::endl;
#else
  BENCH(timer, num_tries, num_repetitions, modern(VOE, A44, VIE));
  std::cout << VOE << std::endl;
#endif

  double elapsed = timer.best();
  std::cout << "elapsed time: " << elapsed*1000.0 << " ms" << std::endl;

  return 0;
}
11
  • Why would you think it is not possible? Commented May 5, 2014 at 15:32
  • hint: start with using std::size_t; template <typename T, size_t Rows, size_t Cols, T (&Var)[Rows][Cols]> struct MyEfficientVector { using matrix_type = decltype(Var); matrix_type m = Var; } ; and run from there :D Commented May 5, 2014 at 15:52
  • @Massa. Thank you. Would I need then to Inherit from Matrix/MatrixBase? how can I make this struct in the end to be an Eigen::Matrix? (I mean, I still want to have all the efficient operations, etc. that Eigen offers) Commented May 5, 2014 at 16:06
  • @gha.st. What would you suggest to implement this then? Commented May 5, 2014 at 16:06
  • @Alejandro that is a very good question. I don't know if it's even possible, given that Eigen::MatrixBase has its data in a Eigen::DenseStorage. Maybe you should start by giving a look at Eigen/DenseStorage.h and understand what you should do to replace its default storage schema for one of your design. In any case, your plan to do something "quick and dirty" is botched, because if you want to extend Eigen's functionalities, you'll certainly have to make a "proper" Eigen extension... Commented May 5, 2014 at 16:35

1 Answer 1

5

Rather off-topic but since you stressed performance:

Eigen assembly isn't always optimal - there is some overhead due to poor register reuse and writing back to memory (this is not to blame Eigen by any means - doing this in generic templates is an impossible task).

If your kernels are fairly simple (QCD?), I would write assembly by hand (using intrinsics).

Here is your classical kernel rewritten in intrinsics, faster than Eigen version and same for Map/Matrix types (so you dont have to invent your own types).

EIGEN_DONT_INLINE void classic(double * __restrict__ VO, const double * __restrict__ AT, const double * __restrict__ VI) {
  __m128d vi01 = _mm_load_pd(VI+0);
  __m128d vi23 = _mm_load_pd(VI+2);
  for (int i = 0; i < 4; i += 2) {
    __m128d v00, v11;
    // v[i+0,i+0]                                                                                                                                                                                                   
    {
      int ii = i*4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v00 = _mm_mul_pd(at01, vi01);
      v00 = _mm_add_pd(v00, _mm_mul_pd(at23, vi23));
    }
    // v[i+1,i+1]                                                                                                                                                                                                   
    {
      int ii = i*4 + 4;
      __m128d at01 = _mm_load_pd(&AT[ii + 0]);
      __m128d at23 = _mm_load_pd(&AT[ii + 2]);
      v11 = _mm_mul_pd(at01, vi01);
      v11 = _mm_add_pd(v11, _mm_mul_pd(at23, vi23));
    }

    __m128d v = _mm_hadd_pd(v00, v11);
    // v is now [v00[0] + v00[1], v11[0] + v11[1]]                                                                                                                                                                               
    _mm_store_pd(VO+i, v);
    // VO[i] += AT[j+0 + i*4]*VI[j+0];                                                                                                                                                                              
    // VO[i] += AT[j+1 + i*4]*VI[j+1];                                                                                                                                                                              
  }
}

You may gain some additional improvement by interleaving loads and mul/adds - I tried to keep it simple.

These are the results:

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 611.397 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DCLASSIC_METHOD -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 615.541 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -DUSE_MAPS -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1
elapsed time: 981.941 ms

g++ -Ofast -DNDEBUG -DEIGEN_NO_MALLOC -I /usr/local/eigen benchmark1.cpp -o benchmark1 -lrt -msse4; ./benchmark1 
elapsed time: 838.852 ms

On further note, you could possibly write a better simd kernel if you matrix was transposed - horizontal adds (_mm_hadd_pd) are expensive.

To add to discussion in comments: moving Eigen maps inside the function removes difference in time between map and matrix arguments.

EIGEN_DONT_INLINE void mapped(double (&VO)[4], const double (&AT)[4][4], const double (&VI)[4]) {
  Map<const Matrix4d,Aligned> A44(&AT[0][0]);
  Map<const Vector4d,Aligned> VIE(VI);
  Map<Vector4d,Aligned> VOE(VO);
  VOE.noalias() = A44.transpose()*VIE;
}

This is top of the assembly when passing Map to function (function not inlined)

    movq    (%rsi), %rcx
    movq    (%rdx), %rax
    movq    (%rdi), %rdx
    movapd  (%rcx), %xmm0
    movapd  16(%rcx), %xmm1
    mulpd   (%rax), %xmm0
    mulpd   16(%rax), %xmm1

compared to passing array reference (and map inside) or matrix

    movapd  (%rsi), %xmm0
    movapd  16(%rsi), %xmm1
    mulpd   (%rdx), %xmm0
    mulpd   16(%rdx), %xmm1
Sign up to request clarification or add additional context in comments.

7 Comments

Wow!!, this is an amazing work!! thank you!! However writing assembly for each of my operations is way beyond what I could do with my finite time. Do you know if this is something Eigen does in the background? if not, do you know of any other library that does it? what do you think about this idea I proposed of passing a C array as an additional template argument? would you know if it is feasible? trying to cheat Eigen at compile time that is creating a static array but actually with memory passed by the user.
Yes, basically Eigen generates SSE instructions (provided data is aligned, etc). However it is not as efficient as it could be. you can inspect assembly with "-S" passed to compiler to generate assembly source (omit "-o"). You can certainly hack Eigen to do that but that may be a lot of work - the internals are rather complicated. The Map should be essentially the same as Matrix. I realize that runtimes are very different - let me see if I can figure out what happens.
My guess is the difference between Map and Matrix is caused by no-inline statement - Map requires few extra manipulations to extract addresses of the arrays from function arguments. Other than that, they do exactly same thing. Move map inside the function and pass the arrays to it - results will be same. I am adding code sample to demonstrate what i mean.
Wow, yes you are right!! I tried your code and I did see the difference. So it looks like passing a Map as a function argument is a bad idea unless the function gets inlined? I still dont understand what you mean with your statement "Map requires few extra manipulations to extract addresses of the arrays from function arguments". could you expand on this? thank you!
Look at the very last two listing in the answer. see where the first three lines are movq? those are instructions generated where map is not inlined to extract the data addresses. Note that the instructions are completely redundant - however for whatever reason compiler put them there. Map passing is not a bad idea but not inlining functions where you do mat-vec on 4x4 matrix is a bad idea.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.