The question is both right and wrong. I can't reproduce a faster Math.Log with a cleaned-up benchmark and .NET 10, either in C# or F#, whether .NET 9 or .NET 10 are targeted. There's no Vector.Log in .NET 8. Vector operations are always faster and when negative numbers are used, Math.Log is significantly slower.
Only in .NET 9 is Vector.Log slower than Vector256.Log, but Math.Log is slower still. The source code for Log<T>(Vector<T> vector) and Log(Vector<double> vector) doesn't seem to have changed between .NET 10 and .NET 9.0.1 though.
The data is created once instead of inside the benchmarks. Math.Log is the slowest by far, Vector runs only slightly slower than Vector256 on my AVX2 CPU (Core Ultra 9 185H - Meteor Lake, so the P-cores are Redwood Cove. E-cores are Crestmont but hopefully weren't used; they'd probably only show half the SIMD speedup for 256-bit vectors.)
C# handles Vector512 as two 256-bit halves on machines without AVX-512, so we can still run the same benchmark but it's only using AVX2+FMA.
This is the code
using System.Numerics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
var summary = BenchmarkRunner.Run<VectorLogBench>();
public class VectorLogBench
{
float[] data = new float[4096];
public VectorLogBench()
{
for (int i = 0; i < data.Length; i++)
{
data[i] = i + 1;
}
}
[Benchmark]
public double[] LogSingle()
{
var results = new double[data.Length];
for (int i = 0; i < data.Length; i++)
{
results[i] = Math.Log((double)data[i]);
}
return results;
}
[Benchmark]
public Vector<float>[] LogV()
{
var floatVectors = MemoryMarshal.Cast<float, Vector<float>>(data);
var results = new Vector<float>[floatVectors.Length];
for (int i = 0; i < floatVectors.Length; i++)
{
results[i] = Vector.Log(floatVectors[i]);
}
return results;
}
[Benchmark]
public Vector128<float>[] Log128()
{
var floatVectors128= MemoryMarshal.Cast<float, Vector128<float>>(data);
var results = new Vector128<float>[floatVectors128.Length];
for (int i = 0; i < floatVectors128.Length; i++)
{
results[i] = Vector128.Log(floatVectors128[i]);
}
return results;
}
[Benchmark]
public Vector256<float>[] Log256()
{
var floatVectors256= MemoryMarshal.Cast<float, Vector256<float>>(data);
var results = new Vector256<float>[floatVectors256.Length];
for (int i = 0; i < floatVectors256.Length; i++)
{
results[i]=Vector256.Log(floatVectors256[i]);
}
return results;
}
[Benchmark]
public Vector512<float>[] Log512()
{
var floatVectors512 = MemoryMarshal.Cast<float, Vector512<float>>(data);
var results = new Vector512<float>[floatVectors512.Length];
for (int i = 0; i < floatVectors512.Length; i++)
{
results[i] = Vector512.Log(floatVectors512[i]);
}
return results;
}
}
And these are the results :
// Benchmark Process Environment Information:
// BenchmarkDotNet v0.15.2
// Runtime=.NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni,SERIALIZE VectorSize=256
// Job: DefaultJob
// * Summary *
BenchmarkDotNet v0.15.2, Windows 11 (10.0.22631.5699/23H2/2023Update/SunValley3)
Intel Core Ultra 9 185H 2.30GHz, 1 CPU, 22 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
[Host] : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
DefaultJob : .NET 10.0.0 (10.0.25.35903), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Median |
|---------- |----------:|----------:|----------:|----------:|
| LogSingle | 57.927 us | 2.9520 us | 8.7039 us | 55.213 us |
| LogV | 4.305 us | 0.1266 us | 0.3652 us | 4.290 us |
| Log128 | 6.418 us | 0.3337 us | 0.9191 us | 6.295 us |
| Log256 | 4.109 us | 0.1700 us | 0.4877 us | 4.122 us |
| Log512 | 4.124 us | 0.1462 us | 0.4310 us | 4.102 us |
One could say this is unfair because Math.Log uses double, but it did so in the original code too. There's no Math.Log(Single) overload. The claim was that Math.Log was faster than other options, which it's definitely not.
Yes, it's a laptop, yes thermal throttling, yes heatwave, but the SIMD benchmarks run after Math.Log and should be affected more by throttling.
Replacing float with double produces these less extreme results:
| Method | Mean | Error | StdDev | Median |
|---------- |---------:|---------:|---------:|---------:|
| LogSingle | 24.00 us | 1.071 us | 2.967 us | 23.51 us |
| LogV | 13.60 us | 0.800 us | 2.358 us | 12.99 us |
| Log128 | 20.80 us | 1.108 us | 3.233 us | 21.27 us |
| Log256 | 11.99 us | 0.447 us | 1.216 us | 11.86 us |
| Log512 | 12.44 us | 0.511 us | 1.449 us | 12.84 us |
The cast was clearly painful but the SIMD operations are still faster.
double being 8 bytes has half as many elements per SIMD vector, so we already expect the speedup ratio vs. scalar machine code to be half as large as with float.
Also, getting 53 bits of precision for double takes a larger polynomial than getting 24 bits for float. This would be a wash for scalar vs. SIMD if both were computing at the same precision, but that wasn't the case: we used Math.Log((double)data[i]) for scalar. But SIMD used a native float implementation that does less work. (Scalar actually gets cheaper with a double array since no conversion is needed, but it shouldn't be twice as fast unless it actually was using a 32-bit float Log despite the cast.)
(In the dotnet runtime source code LogSingle(TVectorSingle x) uses a 10th-order polynomial. The previous function in VectorMath.cs is LogDouble which uses a 20th-order polynomial, so approximately twice as much work, on top of the fixed overhead of getting the exponent and mantissa and putting things back together.
These size-agnostic functions are used by Vector128, Vector256, and Vector512. And by generic Vector.
Anyway, 2x elements per SIMD vector and less work to get full precision for fewer bits per element pretty much explains the approximately 4x speedup of float over double for vectorized Log.
I repeated the same benchmark in F# and there are no surprises. This code :
open BenchmarkDotNet.Running
open System.Numerics
open BenchmarkDotNet.Attributes
open System.Runtime.InteropServices
open System
open System.Runtime.Intrinsics
type VectorLogBench() =
let data: float array = [| for i in 1 .. 4096 -> i |]
[<Benchmark>]
member this.MathLog () =
let results:float array = Array.zeroCreate 4096
for i=0 to data.Length-1 do
results[i] <- Math.Log(data[i])
results
[<Benchmark>]
member this.LogV () =
let doubleVectors = MemoryMarshal.Cast<float, Vector<float>>(ReadOnlySpan(data))
let results:Vector<float> array = Array.zeroCreate doubleVectors.Length
for i=0 to doubleVectors.Length-1 do
results[i] <- Vector.Log(doubleVectors[i])
results
[<Benchmark>]
member this.Log256 () =
let doubleVectors = MemoryMarshal.Cast<float, Vector256<float>>(ReadOnlySpan(data))
let results:Vector256<float> array = Array.zeroCreate doubleVectors.Length
for i=0 to doubleVectors.Length-1 do
results[i] <- Vector256.Log(doubleVectors[i])
results
let summary = BenchmarkRunner.Run<VectorLogBench>()
Produces
| Method | Mean | Error | StdDev |
|-------- |---------:|---------:|---------:|
| MathLog | 22.28 us | 1.069 us | 3.068 us |
| LogV | 11.29 us | 0.359 us | 1.031 us |
| Log256 | 11.38 us | 0.537 us | 1.524 us |
Using negative numbers only is a surprise because Math.Log is significantly worse. So bad I run the benchmark twice to be sure.
This change
let data: float array = [| for i in 1 .. 4096 -> -i |]
Produces :
| Method | Mean | Error | StdDev | Median |
|-------- |----------:|---------:|----------:|----------:|
| MathLog | 195.41 us | 7.185 us | 20.030 us | 190.00 us |
| LogV | 16.52 us | 0.623 us | 1.798 us | 16.25 us |
| Log256 | 15.89 us | 0.635 us | 1.863 us | 15.26 us |
Ensuring the data has a decimal part doesn't change the picture :
let data: float array = [| for i in 1 .. 4096 -> -(float i)/0.33 |]
Gives
| Method | Mean | Error | StdDev | Median |
|-------- |----------:|---------:|----------:|----------:|
| MathLog | 196.89 us | 5.029 us | 14.511 us | 194.73 us |
| LogV | 16.80 us | 0.898 us | 2.548 us | 16.05 us |
| Log256 | 16.82 us | 0.864 us | 2.351 us | 16.06 us |
Targeting .NET 9
Finally I changed the target to .NET 9.0 and got a slower Vector.Log than Vector256.Log. The Vector.Log method was added in .NET 9. Math.Log is still slower than the vector methods
BenchmarkDotNet v0.15.2, Windows 11 (10.0.22631.5699/23H2/2023Update/SunValley3)
Intel Core Ultra 9 185H 2.30GHz, 1 CPU, 22 logical and 16 physical cores
.NET SDK 10.0.100-preview.6.25358.103
[Host] : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2 DEBUG
DefaultJob : .NET 9.0.8 (9.0.825.36511), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|-------- |----------:|---------:|----------:|
| MathLog | 204.74 us | 6.825 us | 19.691 us |
| LogV | 127.99 us | 3.360 us | 9.800 us |
| Log256 | 26.25 us | 0.918 us | 2.620 us |
internally make calls to one of the non-generic implementationsthat's a cost. The operations you use are so fast that the difference between RAM and cache access or the need to load registers from RAM has a serious impact. The benchmarks measure allocations too, not just theVector.Logperformanceseqetc. to generate the inputs within the timed code is going to have an impact. Better to pre-generate arrays to avoid this.Inputsisn't an array, it's essentially a LINQ query. All benchmarks loop over thatseqat the end, so what's actually being measured here? I'd copy the code to clean up and test, but it's unclear what's going on right nowInputs, the code contains bothInputswhich produces the test data andInputwhich is the actual array. I don't guess aboutseqand theseq {}block, I've used F# in production for ETL. Never usedfloatthough, alwaysdecimal. BenchmarkDotNet wouldn't include the data generation time in the timings.