Modification to Eigen library source code somehow degrades performance

Question

One very very common operation in my program is taking a vector, and making with it a vector in the same direction, with magnitude inverse to the original vector's square (v -> vNorm/v^2). I'm using the Eigen library for handling vectors, since it's supposed to be quite fast.

So far I've computed this vector like so:

Vector2d directedInverseSquare(const Vector2d &diff) {

    return diff / (diff.norm() * diff.squaredNorm());

}

Which has worked fine. But it's not very efficient, since it computes both norm() and squaredNorm() separately (to my knowledge).

So I've been thinking for a while about modifying the Eigen source code to make a new type of norm, which fixes this double computation.

Important note: trying to save squareNorm and computing std::sqrt(sNorm)*sNorm always comes up short when comparing to the above implementation.

I'm not close to good enough at C++ to understand in any capacity how the Eigen source code works, but after looking, I've found what I'm fairly confident is the place to implement my norm.

In the Dot.h file of Eigen/src/Core, I found this definition:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const typename MatrixBase<Derived>::PlainObject
MatrixBase<Derived>::normalized() const
{
  typedef typename internal::nested_eval<Derived,2>::type _Nested;
  _Nested n(derived());
  RealScalar z = n.squaredNorm();
  // NOTE: after extensive benchmarking, this conditional does not impact performance, at least on recent x86 CPU
  if(z>RealScalar(0))
    return n / numext::sqrt(z);
  else
    return n;
}

I copied it, and modified the line return n / numext::sqrt(z) to n / (z*numext::sqrt(z)). I also changed the name to newtonianNormalized (owing to its origins) and added a declaration in an earlier part of the file. I did the same with the norm's definition:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE typename NumTraits<typename internal::traits<Derived>::Scalar>::Real MatrixBase<Derived>::norm() const
{
  return numext::sqrt(squaredNorm());
}

For which I added the line RealScalar squared = squaredNorm();, and changed the original line to return numext::sqrt(squared)*squared;.

In both instances the change to the code for a newtonian norm only requires one extra multiplication, which is a light cost. No other new computations are preformed.

After benchmarking, the results are very strange. My benchmarking consists of running large loops of the same operation over and over again, and summing up all the results to deny the compiler from contracting the loop into nothing. I use v.setRandom() to get random vectors, and I ran the loops and measured over ~20 runs of the program.

The results, presented with very basic bar charts:

The above chart presents different methods for calculating the "normed" vectors. The result is in milliseconds, using a 100000000 time loop.

The above chart presents different methods for calculating the "norm". The result is in milliseconds, also using a 100000000 time loop.

Note that the charts start nowhere close to 0, and so it's hard to see at first glance that the results barely differ at all.

I really expected a major performance boost: I cut the amount of calculations basically in half, and reuse the calculated quantities instead. How could it possibly be basically the same?

Edit: I compile with MSYS2 Mingw64 with GCC version 13.2.0 and Clion+CMake, with the following line:

C:\msys64\mingw64\bin\c++.exe -IC:/Users/.../MINREP/include/eigen-3.4.0 -isystem C:/Users/.../MINREP/include/SFML-2.6.1/include -O3 -DNDEBUG -march=native -std=gnu++23 -fdiagnostics-color=always -MD -MT CMakeFiles/MINREP.dir/main.cpp.obj -MF CMakeFiles\MINREP.dir\main.cpp.obj.d -o CMakeFiles/MINREP.dir/main.cpp.obj -c C:/Users/.../MINREP/main.cpp

I save the time before and after each loop with the chrono.h library. Please tell me if any other info is needed.

Edit: Here are both functions after modification:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE typename NumTraits<typename internal::traits<Derived>::Scalar>::Real MatrixBase<Derived>::newtonianNorm() const
{
    RealScalar squared = squaredNorm();
    return numext::sqrt(squared)*squared;
}

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const typename MatrixBase<Derived>::PlainObject
MatrixBase<Derived>::newtonianNormalized() const
{
    typedef typename internal::nested_eval<Derived,2>::type _Nested;
    _Nested n(derived());
    RealScalar z = n.squaredNorm();
    // NOTE: after extensive benchmarking, this conditional does not impact performance, at least on recent x86 CPU
    if(z>RealScalar(0))
        return n / (z*numext::sqrt(z));
    else
        return n;
}

rather than describing what you did to the code you should show the code. Please read about minimal reproducible example. In addition, questions about performance need more information than usual, how did you compile? How did you measure? — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Feb 9, 2024 at 10:12
Eigen is based on expression templates. I am not familiar with the details, but in a nutshell, results are only computed when needed. Rather than the result function returns a proxy that computes the result only when it is needed. Again, I am not into the details, but expression templates could manage to merge the two loops you are worried about into one — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Feb 9, 2024 at 10:14
@463035818_is_not_an_ai Well, the source code for the Eigen library is massive. Even a single file would be way too large to show, so I don't see any other option. I preformed very minor changes, so I didn't think it'd be helpful to show almost identical code twice. About the info, I added an edit to the end of the post. Please tell me if you need anything else. — Malmel
– Malmel, Commented Feb 9, 2024 at 10:18
I was refering to the code you posted. As I understood you posted the original code and only explain what you modified. But the question is about the modified code — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Feb 9, 2024 at 10:19
"so I didn't think it'd be helpful to show almost identical code twice" isnt the question about just that difference? Why do you think it would be not useful to show it? I mean you are asking "How could it possibly be basically the same?"..... and now you argue the code is almost identical ;) — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Feb 9, 2024 at 10:20

Jérôme Richard · Accepted Answer · 2024-02-09 17:14:47Z

3

This answer explain why the initial code (directedInverseSquare) is already well optimized and so why the alternative implementations does not impact much performance in practice on your target machine. It also provide a way to improve performance.

The function directedInverseSquare alone results in the following assembly code (when it is not inlined):

        vmovapd xmm2, XMMWORD PTR [rsi]
        mov     rax, rdi
        vmulpd  xmm0, xmm2, xmm2
        vunpckhpd       xmm1, xmm0, xmm0
        vaddsd  xmm1, xmm0, xmm1
        vmovq   xmm0, xmm1
        vsqrtsd xmm0, xmm0, xmm0
        vmulsd  xmm0, xmm0, xmm1
        vmovddup        xmm0, xmm0
        vdivpd  xmm2, xmm2, xmm0
        vmovapd XMMWORD PTR [rdi], xmm2
        ret

This code basically load the input vector from memory to a register, square the two components (using only 1 SIMD instruction), shuffle them so to add the two resulting squared values, then compute the scalar square-root, multiply the result so to compute diff.norm() ** 1.5 and finally compute the inverse before storing back the result in memory. All of this is done in a scalar way (except the multiplication). One can also see that the norm is already computed once. This is because the compiler can inline the functions and then detect common sub-expressions (see Common sub-expression elimination).

If this function is called in a loop and inlined, the situation is not much better (see on godbolt). It is hard for the compiler to generate a faster code since the input-layout is not SIMD-friendly. A more optimizing compiler can certainly shuffle the input so to compute multiple square-root and divide multiple value at once thanks to SIMD instructions. In practice GCC seems not to be able to do that.

On your Intel Tiger-Lake CPU, the loop is clearly limited by the execution port responsible to both compute the scalar division and the scalar square-root, and apparently only that (assuming the loop is sufficiently big and not memory-bound). This means that all other instructions have (nearly) no overhead (thanks to super-scalar execution). Since all alternative still need to compute the scalar division and scalar square-root, it explains why you do not see much a difference (certainly just execution noise, cache effects, or possibly a slight variation of the tiny overhead of the other instructions).

The main way to make this computation faster is to compute the square-root and division using SIMD instructions. To do that, you can either do it manually (often cumbersome and not portable, or alternatively it requires external libraries), or help the compiler to auto-vectorize the loop by storing components more contiguously. This means the fist component of all vector should be stored in an array and the second in another array. The compiler can then read components using packed SIMD instructions and also use SIMD instructions for the rest of the computation. You CPU can compute twice more divisions and twice more square-roots per cycle with SIMD instructions. Thus, the resulting code can be up to twice faster (it should be close to that in practice).

If your input data is huge, then please note that you can also use multiple threads.

PS: the version of Eigen used for the test is the 3.4.0.

answered Feb 9, 2024 at 17:14

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Malmel Over a year ago

This is amazing, and insanely insightful. Thank you so so much. I have a couple of questions about the faster computation method you presented: Firstly, I can't find much info about what "packed" SIMD means. Do you mind elaborting? Secondly, I don't fully understand why the CPU would be able to compute more divisions/sqrts with SIMD instructions. Finally, if I'd like to run my program on many/all CPU cores at the same time, would that interfere with SIMD? I heard SIMD uses other cores to parallelize computations. Thank you very much again.

Jérôme Richard Over a year ago

By "read components using packed SIMD instructions", I mean more specifically "packed loads SIMD instructions" which consists in loading a chunk of several contiguous items from memory (the same applies for stores). Note that there are other SIMD instructions to load data from non-contiguous memory location, but such instructions are generally significantly slower (e.g. gather loads) and only available quite recently.

Jérôme Richard Over a year ago

SIMD instructions operate on wide SIMD registers containing several items (e.g. 2 double-precision floating-point numbers in SSE and 4 in AVX). SIMD instructions can operate on all items of a register simultaneously. This strategy is efficient (lower energy consumption for the same work or a higher throughput compared to a scalar code). One downside of SIMD is that it only speed up codes using SIMD instructions and not all algorithm are SIMD friendly.

Jérôme Richard Over a year ago

Scalar floating-point instructions can only use the first item of SIMD computing units on current CPUs. Thus, using SIMD instruction enable your code to access to additional SIMD computing lanes impossible to access with scalar codes (there low-level reasons for that but it is complexe to explain that here -- put it shortly, it would be too expensive for the CPU).

Jérôme Richard Over a year ago

"would that interfere with SIMD" Yes, it can due to energy consumption and power throttling. This is dependent of the exact target CPU though. Scalar codes generally consume less power than SIMD ones (it is a bit complicated in practice). Thus, using SIMD instructions tends to force the CPU to reduce its frequency to reduce its power while being still faster because SIMD instructions can make code several times faster. For example, Intel CPU tends to use the turbo frequency on scalar codes, but not for running (FP) intensive SIMD codes using AVX. AVX-512 is also a good example.

|

Collectives™ on Stack Overflow

Modification to Eigen library source code somehow degrades performance

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related