4

I've run some benchmarks on Math.Log, System.Numerics.Vector.Log, System.Runtime.Intrinsics.Vector128.Log, Vector256.Log and Vector512.Log and the results were pretty surprising to me. I was expecting the generic Vector.Log to perform similarly to its non-generic counterparts, however it was not only significantly slower than them, but even slower than Math.Log. From reading this https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intrinsics/ I had assumed that Vector<T> would internally make calls to one of the non-generic implementations. Can someone explain to me why Vector<T> is so slow?

I've included my benchmarks below for reference:

Intel Xeon Silver 4214R CPU 2.40GHz, 2 CPU, 48 logical and 24 physical cores  
Benchmark Process Environment Information:  
BenchmarkDotNet v0.13.10  
Runtime=.NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2  
GC=Concurrent Workstation  
HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT  VectorSize=256
Method Mean Error StdDev Median
MathLog 777.2 us 14.34 us 13.42 us 773.1 us
GenericVectorLog 2,537.3 us 50.44 us 139.77 us 2,496.8 us
Vector128Log 407.8 us 7.16 us 6.70 us 403.8 us
Vector256Log 243.3 us 4.54 us 4.46 us 243.0 us
Vector512Log 429.4 us 8.43 us 10.66 us 427.7 us
public class LogBenchmarks
{
    double[] inputArray = new double[100000];
    double[] outputArray = new double[100000];
    public LogBenchmarks()
    {
        var random = new Random(0);
        for (int i = 0; i < inputArray.Length; i++)
        {
            inputArray[i] = random.NextDouble();
        }
    }
    [Benchmark]
    public void MathLog()
    {
        for (int i = 0; i < inputArray.Length; i++)
        {
            outputArray[i] = Math.Log(inputArray[i]);
        }
    }

    [Benchmark]
    public void GenericVectorLog()
    {
        var input = MemoryMarshal.Cast<double, Vector<double>>(new ReadOnlySpan<double>(inputArray));
        var output = MemoryMarshal.Cast<double, Vector<double>>(new Span<double>(outputArray));
        var i = 0;
        while (i < input.Length)
        {
            output[i] = Vector.Log(input[i]);
            i++;
        }
    }

    [Benchmark]
    public void Vector128Log()
    {
        var input = MemoryMarshal.Cast<double, Vector128<double>>(new ReadOnlySpan<double>(inputArray));
        var output = MemoryMarshal.Cast<double, Vector128<double>>(new Span<double>(outputArray));
        var i = 0;
        while (i < input.Length)
        {
            output[i] = Vector128.Log(input[i]);
            i++;
        }
    }

    [Benchmark]
    public void Vector256Log()
    {
        var input = MemoryMarshal.Cast<double, Vector256<double>>(new ReadOnlySpan<double>(inputArray));
        var output = MemoryMarshal.Cast<double, Vector256<double>>(new Span<double>(outputArray));
        var i = 0;
        while (i < input.Length)
        {
            output[i] = Vector256.Log(input[i]);
            i++;
        }
    }

    [Benchmark]
    public void Vector512Log()
    {
        var input = MemoryMarshal.Cast<double, Vector512<double>>(new ReadOnlySpan<double>(inputArray));
        var output = MemoryMarshal.Cast<double, Vector512<double>>(new Span<double>(outputArray));
        var i = 0;
        while (i < input.Length)
        {
            output[i] = Vector512.Log(input[i]);
            i++;
        }
    }
}
30
  • 1
    Because the arbitrary-size generic class is less optimized that the fixed-size classes, which match the size of SIMD registers exactly. internally make calls to one of the non-generic implementations that's a cost. The operations you use are so fast that the difference between RAM and cache access or the need to load registers from RAM has a serious impact. The benchmarks measure allocations too, not just the Vector.Log performance Commented Aug 12 at 11:44
  • 2
    Using a seq etc. to generate the inputs within the timed code is going to have an impact. Better to pre-generate arrays to avoid this. Commented Aug 12 at 13:22
  • 1
    Inputs isn't an array, it's essentially a LINQ query. All benchmarks loop over that seq at the end, so what's actually being measured here? I'd copy the code to clean up and test, but it's unclear what's going on right now Commented Aug 12 at 14:14
  • 1
    FYI for everyone commenting on float/double these are the same thing in f#. Also the array is only instantiated once, and even if Benchmark.Dotnet does this inside the timed code (I would be surprised to learn that it did), it would do this during one of the non-timed warmup runs. Commented Aug 13 at 8:36
  • 1
    @PeterCordes as for Inputs, the code contains both Inputs which produces the test data and Input which is the actual array. I don't guess about seq and the seq {} block, I've used F# in production for ETL. Never used float though, always decimal. BenchmarkDotNet wouldn't include the data generation time in the timings. Commented Aug 13 at 9:30

2 Answers 2

3

I posted a new answer because I finally managed to replicate the question's results. Something weird is going on and doesn't look to have anything to do with SIMD. I managed to confirm the weird Log behavior but only for .NET 9. It looks as if in .NET 9 the un-accelerated Log<T>.Vector is called instead of the accelerated Log(Vector<double> vector)

The source code for Log<T>(Vector<T> vector) and Log(Vector<double> vector) doesn't seem to have changed between .NET 10 and .NET 9.0.1 but only Log(Vector<double> vector) is accelerated :

    internal static Vector<T> Log<T>(Vector<T> vector)
        where T : ILogarithmicFunctions<T>
    {
        Unsafe.SkipInit(out Vector<T> result);

        for (int index = 0; index < Vector<T>.Count; index++)
        {
            T value = T.Log(vector.GetElementUnsafe(index));
            result.SetElementUnsafe(index, value);
        }

        return result;
    }

    /// <inheritdoc cref="Vector128.Log(Vector128{double})" />
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector<double> Log(Vector<double> vector)
    {
        if (IsHardwareAccelerated)
        {
            return VectorMath.LogDouble<Vector<double>, Vector<long>, Vector<ulong>>(vector);
        }
        else
        {
            return Log<double>(vector);
        }
    }

I run this simplified benchmark with a custom configuration to improve the chances of JIT optimization, but .NET 9 always behaves in a weird way :

[Config(typeof(Config))]
public class VectorLogBench
{
    double[] data = new double[4096];

    public VectorLogBench()
    {
        for (int i = 0; i < data.Length; i++)
        {
            data[i] = i + 1;
        }
    }

    [Benchmark]
    public double[] LogMath()
    {
        var results = new double[data.Length];
        for (int i = 0; i < data.Length; i++)
        {
            results[i] = Math.Log(data[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector<double>[] LogV()
    {
        var doubleVectors = MemoryMarshal.Cast<double, Vector<double>>(data);
        var results = new Vector<double>[doubleVectors.Length];
        for (int i = 0; i < doubleVectors.Length; i++)
        {
            results[i] = Vector.Log(doubleVectors[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector256<double>[] Log256()
    {
        var doubleVectors256= MemoryMarshal.Cast<double, Vector256<double>>(data);
        var results = new Vector256<double>[doubleVectors256.Length];
        for (int i = 0; i < doubleVectors256.Length; i++)
        {
            results[i]=Vector256.Log(doubleVectors256[i]);
        }

        return results;
    }

}

The configuration runs withDOTNET_TieredPGO both enabled and disabled, and a very long warmup, just in case this triggers dynamic PGO.

class Config : ManualConfig
{
    public Config()
    {
        AddJob(Job.Default.WithId("Non DPGO 9 ")
            .WithBaseline(true)
            .WithRuntime(CoreRuntime.Core90)
            .WithWarmupCount(100)
            .WithEnvironmentVariables(
                new EnvironmentVariable("DOTNET_TieredPGO", "0")));

        AddJob(Job.Default.WithId("DPGO 9 ")
            .WithRuntime(CoreRuntime.Core90)
            .WithWarmupCount(100)
            .WithEnvironmentVariables(
                new EnvironmentVariable("DOTNET_TieredPGO", "1")));

        AddJob(Job.Default.WithId("Non DPGO 10 ")
            .WithRuntime(CoreRuntime.Core10_0)
            .WithWarmupCount(100)
            .WithEnvironmentVariables(
                new EnvironmentVariable("DOTNET_TieredPGO", "0")));

        AddJob(Job.Default.WithId("DPGO 10 ")
            .WithRuntime(CoreRuntime.Core10_0)
            .WithWarmupCount(100)
            .WithEnvironmentVariables(
                new EnvironmentVariable("DOTNET_TieredPGO", "1")));

    }
}

The results show an enexpected huge delay for Vector.Log in .NET 9 :


// * Summary *

BenchmarkDotNet v0.15.2, Windows 11 (10.0.22631.5699/23H2/2023Update/SunValley3)
Intel Core Ultra 9 185H 2.30GHz, 1 CPU, 22 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
  [Host]       : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
  DPGO 10      : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
  DPGO 9       : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2
  Non DPGO 10  : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
  Non DPGO 9   : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2

WarmupCount=100

| Method  | Job          | EnvironmentVariables | Runtime   | Mean      | Error    | StdDev    | Median    |
|-------- |------------- |--------------------- |---------- |----------:|---------:|----------:|----------:|
| LogMath | DPGO 10      | DOTNET_TieredPGO=1   | .NET 10.0 |  23.99 us | 0.769 us |  2.245 us |  23.83 us |
| LogV    | DPGO 10      | DOTNET_TieredPGO=1   | .NET 10.0 |  13.94 us | 0.691 us |  2.005 us |  13.90 us |
| Log256  | DPGO 10      | DOTNET_TieredPGO=1   | .NET 10.0 |  14.24 us | 0.800 us |  2.360 us |  13.98 us |
|         |              |                      |           |           |          |           |           |
| LogMath | Non DPGO 10  | DOTNET_TieredPGO=0   | .NET 10.0 |  21.83 us | 0.949 us |  2.708 us |  21.60 us |
| LogV    | Non DPGO 10  | DOTNET_TieredPGO=0   | .NET 10.0 |  13.37 us | 0.578 us |  1.631 us |  13.31 us |
| Log256  | Non DPGO 10  | DOTNET_TieredPGO=0   | .NET 10.0 |  12.42 us | 0.418 us |  1.212 us |  12.63 us |
|         |              |                      |           |           |          |           |           |
| LogMath | DPGO 9       | DOTNET_TieredPGO=1   | .NET 9.0  |  27.11 us | 1.156 us |  3.299 us |  27.05 us |
| LogV    | DPGO 9       | DOTNET_TieredPGO=1   | .NET 9.0  | 104.75 us | 3.794 us | 10.824 us | 106.54 us |
| Log256  | DPGO 9       | DOTNET_TieredPGO=1   | .NET 9.0  |  32.27 us | 2.485 us |  7.326 us |  28.94 us |
|         |              |                      |           |           |          |           |           |
| LogMath | Non DPGO 9   | DOTNET_TieredPGO=0   | .NET 9.0  |  20.35 us | 0.402 us |  0.523 us |  20.34 us |
| LogV    | Non DPGO 9   | DOTNET_TieredPGO=0   | .NET 9.0  | 116.76 us | 3.714 us | 10.717 us | 116.71 us |
| Log256  | Non DPGO 9   | DOTNET_TieredPGO=0   | .NET 9.0  |  29.23 us | 1.533 us |  4.447 us |  28.64 us |

The LogV method's IL generated by dotPeek shows that Vector.Log<> is called, and yet the results are weird :

  IL_0027: call         instance !0/*valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>*/& valuetype [System.Runtime]System.Span`1<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>>::get_Item(int32)
  IL_002c: ldobj        valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>
  IL_0031: call         valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64> [System.Numerics.Vectors]System.Numerics.Vector::Log(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>)
  IL_0036: stelem       valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>

The entire method's IL is

  .method public hidebysig instance valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>[]
    LogV() cil managed
  {
    .custom instance void [BenchmarkDotNet.Annotations]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor(int32, string)
      = (
      ...
      )
      // int32(34) // 0x00000022

    .param [0]
      .custom instance void [System.Runtime]System.Runtime.CompilerServices.NullableAttribute::.ctor(unsigned int8[])
        = (01 00 02 00 00 00 01 00 00 00 ) // ..........
        // unsigned int8[2]
          /*( unsigned int8(1) // 0x01
          unsigned int8(0) // 0x00
           )*/
    .maxstack 4
    .locals init (
      [0] valuetype [System.Runtime]System.Span`1<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>> doubleVectors,
      [1] valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>[] results,
      [2] int32 i
    )

    // [37 9 - 37 78]
    IL_0000: ldarg.0      // this
    IL_0001: ldfld        float64[] VectorLogBench::data
    IL_0006: call         valuetype [System.Runtime]System.Span`1<!0/*float64*/> valuetype [System.Runtime]System.Span`1<float64>::op_Implicit(!0/*float64*/[])
    IL_000b: call         valuetype [System.Runtime]System.Span`1<!!1/*valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>*/> [System.Runtime]System.Runtime.InteropServices.MemoryMarshal::Cast<float64, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>>(valuetype [System.Runtime]System.Span`1<!!0/*float64*/>)
    IL_0010: stloc.0      // doubleVectors

    // [38 9 - 38 64]
    IL_0011: ldloca.s     doubleVectors
    IL_0013: call         instance int32 valuetype [System.Runtime]System.Span`1<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>>::get_Length()
    IL_0018: newarr       valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>
    IL_001d: stloc.1      // results

    // [39 14 - 39 23]
    IL_001e: ldc.i4.0
    IL_001f: stloc.2      // i

    IL_0020: br.s         IL_003f
    // start of loop, entry point: IL_003f

      // [41 13 - 41 55]
      IL_0022: ldloc.1      // results
      IL_0023: ldloc.2      // i
      IL_0024: ldloca.s     doubleVectors
      IL_0026: ldloc.2      // i
      IL_0027: call         instance !0/*valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>*/& valuetype [System.Runtime]System.Span`1<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>>::get_Item(int32)
      IL_002c: ldobj        valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>
      IL_0031: call         valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64> [System.Numerics.Vectors]System.Numerics.Vector::Log(valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>)
      IL_0036: stelem       valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>

      // [39 51 - 39 54]
      IL_003b: ldloc.2      // i
      IL_003c: ldc.i4.1
      IL_003d: add
      IL_003e: stloc.2      // i

      // [39 25 - 39 49]
      IL_003f: ldloc.2      // i
      IL_0040: ldloca.s     doubleVectors
      IL_0042: call         instance int32 valuetype [System.Runtime]System.Span`1<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<float64>>::get_Length()
      IL_0047: blt.s        IL_0022
    // end of loop

    // [44 9 - 44 24]
    IL_0049: ldloc.1      // results
    IL_004a: ret

  } // end of method VectorLogBench::LogV
Sign up to request clarification or add additional context in comments.

Comments

2

The question is both right and wrong. I can't reproduce a faster Math.Log with a cleaned-up benchmark and .NET 10, either in C# or F#, whether .NET 9 or .NET 10 are targeted. There's no Vector.Log in .NET 8. Vector operations are always faster and when negative numbers are used, Math.Log is significantly slower.

Only in .NET 9 is Vector.Log slower than Vector256.Log, but Math.Log is slower still. The source code for Log<T>(Vector<T> vector) and Log(Vector<double> vector) doesn't seem to have changed between .NET 10 and .NET 9.0.1 though.

The data is created once instead of inside the benchmarks. Math.Log is the slowest by far, Vector runs only slightly slower than Vector256 on my AVX2 CPU (Core Ultra 9 185H - Meteor Lake, so the P-cores are Redwood Cove. E-cores are Crestmont but hopefully weren't used; they'd probably only show half the SIMD speedup for 256-bit vectors.)

C# handles Vector512 as two 256-bit halves on machines without AVX-512, so we can still run the same benchmark but it's only using AVX2+FMA.

This is the code

using System.Numerics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

var summary = BenchmarkRunner.Run<VectorLogBench>();


public class VectorLogBench
{
    float[] data = new float[4096];

    public VectorLogBench()
    {
        for (int i = 0; i < data.Length; i++)
        {
            data[i] = i + 1;
        }
    }

    [Benchmark]
    public double[] LogSingle()
    {
        var results = new double[data.Length];
        for (int i = 0; i < data.Length; i++)
        {
            results[i] = Math.Log((double)data[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector<float>[] LogV()
    {
        var floatVectors = MemoryMarshal.Cast<float, Vector<float>>(data);
        var results = new Vector<float>[floatVectors.Length];
        for (int i = 0; i < floatVectors.Length; i++)
        {
            results[i] = Vector.Log(floatVectors[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector128<float>[] Log128()
    {
        var floatVectors128= MemoryMarshal.Cast<float, Vector128<float>>(data);
        var results = new Vector128<float>[floatVectors128.Length];
        for (int i = 0; i < floatVectors128.Length; i++)
        {
            results[i] = Vector128.Log(floatVectors128[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector256<float>[] Log256()
    {
        var floatVectors256= MemoryMarshal.Cast<float, Vector256<float>>(data);
        var results = new Vector256<float>[floatVectors256.Length];
        for (int i = 0; i < floatVectors256.Length; i++)
        {
            results[i]=Vector256.Log(floatVectors256[i]);
        }

        return results;
    }

    [Benchmark]
    public Vector512<float>[] Log512()
    {
        var floatVectors512 = MemoryMarshal.Cast<float, Vector512<float>>(data);
        var results = new Vector512<float>[floatVectors512.Length];
        for (int i = 0; i < floatVectors512.Length; i++)
        {
            results[i] = Vector512.Log(floatVectors512[i]);
        }

        return results;
    }

}

And these are the results :

// Benchmark Process Environment Information:
// BenchmarkDotNet v0.15.2
// Runtime=.NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni,SERIALIZE VectorSize=256
// Job: DefaultJob


// * Summary *

BenchmarkDotNet v0.15.2, Windows 11 (10.0.22631.5699/23H2/2023Update/SunValley3)
Intel Core Ultra 9 185H 2.30GHz, 1 CPU, 22 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
  [Host]     : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2


| Method    | Mean      | Error     | StdDev    | Median    |
|---------- |----------:|----------:|----------:|----------:|
| LogSingle | 57.927 us | 2.9520 us | 8.7039 us | 55.213 us |
| LogV      |  4.305 us | 0.1266 us | 0.3652 us |  4.290 us |
| Log128    |  6.418 us | 0.3337 us | 0.9191 us |  6.295 us |
| Log256    |  4.109 us | 0.1700 us | 0.4877 us |  4.122 us |
| Log512    |  4.124 us | 0.1462 us | 0.4310 us |  4.102 us |

One could say this is unfair because Math.Log uses double, but it did so in the original code too. There's no Math.Log(Single) overload. The claim was that Math.Log was faster than other options, which it's definitely not.

Yes, it's a laptop, yes thermal throttling, yes heatwave, but the SIMD benchmarks run after Math.Log and should be affected more by throttling.


Replacing float with double produces these less extreme results:

| Method    | Mean     | Error    | StdDev   | Median   |
|---------- |---------:|---------:|---------:|---------:|
| LogSingle | 24.00 us | 1.071 us | 2.967 us | 23.51 us |
| LogV      | 13.60 us | 0.800 us | 2.358 us | 12.99 us |
| Log128    | 20.80 us | 1.108 us | 3.233 us | 21.27 us |
| Log256    | 11.99 us | 0.447 us | 1.216 us | 11.86 us |
| Log512    | 12.44 us | 0.511 us | 1.449 us | 12.84 us |

The cast was clearly painful but the SIMD operations are still faster.

double being 8 bytes has half as many elements per SIMD vector, so we already expect the speedup ratio vs. scalar machine code to be half as large as with float.

Also, getting 53 bits of precision for double takes a larger polynomial than getting 24 bits for float. This would be a wash for scalar vs. SIMD if both were computing at the same precision, but that wasn't the case: we used Math.Log((double)data[i]) for scalar. But SIMD used a native float implementation that does less work. (Scalar actually gets cheaper with a double array since no conversion is needed, but it shouldn't be twice as fast unless it actually was using a 32-bit float Log despite the cast.)

(In the dotnet runtime source code LogSingle(TVectorSingle x) uses a 10th-order polynomial. The previous function in VectorMath.cs is LogDouble which uses a 20th-order polynomial, so approximately twice as much work, on top of the fixed overhead of getting the exponent and mantissa and putting things back together.

These size-agnostic functions are used by Vector128, Vector256, and Vector512. And by generic Vector.

Anyway, 2x elements per SIMD vector and less work to get full precision for fewer bits per element pretty much explains the approximately 4x speedup of float over double for vectorized Log.


I repeated the same benchmark in F# and there are no surprises. This code :

open BenchmarkDotNet.Running
open System.Numerics
open BenchmarkDotNet.Attributes
open System.Runtime.InteropServices
open System
open System.Runtime.Intrinsics

type VectorLogBench() =
    let data: float array =  [| for i in 1 .. 4096 -> i |]

    [<Benchmark>]
    member this.MathLog () =
        let results:float array = Array.zeroCreate 4096
        for i=0 to data.Length-1 do
            results[i] <- Math.Log(data[i])
        results

    [<Benchmark>]
    member this.LogV () =
        let doubleVectors = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(data))
        let results:Vector<float> array = Array.zeroCreate doubleVectors.Length

        for i=0 to doubleVectors.Length-1 do
            results[i] <- Vector.Log(doubleVectors[i])
        results

    [<Benchmark>]
    member this.Log256 () =
        let doubleVectors = MemoryMarshal.Cast<float, Vector256<float>>(ReadOnlySpan(data))
        let results:Vector256<float> array = Array.zeroCreate doubleVectors.Length

        for i=0 to doubleVectors.Length-1 do
            results[i] <- Vector256.Log(doubleVectors[i])
        results

let summary = BenchmarkRunner.Run<VectorLogBench>()

Produces

| Method  | Mean     | Error    | StdDev   |
|-------- |---------:|---------:|---------:|
| MathLog | 22.28 us | 1.069 us | 3.068 us |
| LogV    | 11.29 us | 0.359 us | 1.031 us |
| Log256  | 11.38 us | 0.537 us | 1.524 us |

Using negative numbers only is a surprise because Math.Log is significantly worse. So bad I run the benchmark twice to be sure.

This change

let data: float array =  [| for i in 1 .. 4096 -> -i |]

Produces :


| Method  | Mean      | Error    | StdDev    | Median    |
|-------- |----------:|---------:|----------:|----------:|
| MathLog | 195.41 us | 7.185 us | 20.030 us | 190.00 us |
| LogV    |  16.52 us | 0.623 us |  1.798 us |  16.25 us |
| Log256  |  15.89 us | 0.635 us |  1.863 us |  15.26 us |

Ensuring the data has a decimal part doesn't change the picture :

let data: float array =  [| for i in 1 .. 4096 -> -(float i)/0.33 |]

Gives

| Method  | Mean      | Error    | StdDev    | Median    |
|-------- |----------:|---------:|----------:|----------:|
| MathLog | 196.89 us | 5.029 us | 14.511 us | 194.73 us |
| LogV    |  16.80 us | 0.898 us |  2.548 us |  16.05 us |
| Log256  |  16.82 us | 0.864 us |  2.351 us |  16.06 us |

Targeting .NET 9

Finally I changed the target to .NET 9.0 and got a slower Vector.Log than Vector256.Log. The Vector.Log method was added in .NET 9. Math.Log is still slower than the vector methods

BenchmarkDotNet v0.15.2, Windows 11 (10.0.22631.5699/23H2/2023Update/SunValley3)
Intel Core Ultra 9 185H 2.30GHz, 1 CPU, 22 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
  [Host]     : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2 DEBUG
  DefaultJob : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2

| Method  | Mean      | Error    | StdDev    |
|-------- |----------:|---------:|----------:|
| MathLog | 204.74 us | 6.825 us | 19.691 us |
| LogV    | 127.99 us | 3.360 us |  9.800 us |
| Log256  |  26.25 us | 0.918 us |  2.620 us |

21 Comments

In fsharp float is an alias for double. Also the data is not created inside the benchmarks. I think you are maybe confusing this.Inputs with this.Input in my code?
Then improve the code - but no repro is still no repro, whether double or float. You claimed Vector is slower, it's not. I've used F# in production a lot and I have trouble understanding what your code does. Especially the pointless Math.Log calls in the vector benchmarks.
The Math.Log calls are for the elements that don't fit inside the vectors, though I'll admit that they're redundant for the purpose of the benchmark. Given that Vector is architecture specific, perhaps it is something to do with the processor that I'm using.
@user31260114: Have you tried simplifying your benchmark code to use an actual array, not yield, as input? Just in case it doesn't optimize the way you think it does, with array generated once. Also, does r.NextBytes bytes generate fully random bit-patterns for your doubles? So half your inputs will be negative; if the scalar or SIMD Log branch on that to handle the invalid input, that could be making one slower than it would normally be. Also some of your inputs will be NaN. (A tiny fraction will be +-0 or +-Infinity, only 4 bit-patterns in 2^64, so not significant.)
The benchmark code does use an actual array. The type of this.Input is double[]. Lesson learnt that I should have just pasted a benchmark with one case and not used the word float on stack overflow haha. As for the NaNs/0, i didn't check the generated numbers but it seems pretty unlikely, and not really relevant for the comparison of Vector<T>/Vector256 anyway.
@PanagiotisKanavos: The Math.Log calls in the vector functions are scalar cleanup for the length % 4 or whatever elements left over after a whole number of vectors. (For non-overlapping input and output, an alternate strategy would be an unaligned load that ends at the last element of the input, redoing 0 to 3 elements, but that only works if the total size is greater than vector width.)
That's not cleanup, that's measurable actions that invalidates the results. whatever elements left over don't use leftovers. You're measuring Vector.Log performance, not RAM access, overlaps or whatever else, using extremely fast operations. Any side-effects will be visible. So make sure there are no side-effects. Like loading a big array from RAM into the CPU
It doesn't invalidate it if I'm comparing to Math.Log on a number of elements that doesn't neatly divide by vector width.
Yes it does. You claimed that Vector.Log is slower that Math.Log. It's not. And even if you wanted to benchmark leftovers you shouldn't use Math.Log, but fill the vector with dummy values.
The Math.Log isn't even called for 1000/100000 elements as both are neatly divided by 8. Yes it could be clearer, but it's not affecting the results so a moot point.
@user31260114: Agreed that 1 in 2^52 of the inputs being NaN isn't relevant, but half your inputs are negative. That's a special case for Log; the result has to be NaN instead of being computed from the mantissa and exponent of the input. It could branch on vtestps to check the sign bits and do some extra work if any of them are negative, to optimize for the fast path of all inputs being finite and positive. Or it could branchlessly always detect that and blend, in which case some elements of a SIMD vector being negative would make no difference.
That. The test data must be identical, otherwise the timings can't be compared.
I rewrote the tests in F#, and used negative numbers too. Negative values make Math.Log a lot worse
I GOT A REPRO AFTER ALL! And it's not about SIMD. In C#/F#, positive data and .NET 9 Vector.Log behaves as if the unaccelerated Vector<T>.Log was called, with extra overhead. Even though C# and dotPeek say that Vector.Log is called in the Release binary. This smells like JIT optimization and inlining changes.
You're right! I ran using the .NET 10 preview, and the issue disappeared.
This is just crazy, even the IL says the accelerated overloads is called. I'm trying to run tests with and without PGO right now but it still looks like the wrong Log is called
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.