I have a buffer of bytes, I want to multiply each byte be another byte like 0x20. One way is to simply iterate over the buffer and multiply each byte. This is obviously suboptimal, SIMD can do this much faster. But using SIMD in Swift is much slower.
On a MacBook Pro M1 Max:
SIMD: 180ms for 100k iterations (operating on 64 bytes at a time)
Loop: 35ms for 6.4M iterations (operating at a single byte)
Here is the code:
let inBytes = Data(repeating: 0x20, count: 6400000).withUnsafeBytes { bufferPointer in
// 100K iterations of the outer loop
// Empty while loop takes about 2ms
while(iteration < 6_400_000 / SIMD64<UInt8>.scalarCount) {
let assumed = bufferPointer.assumingMemoryBound(to: SIMD64<UInt8>.self)
let batch = assumed[0] // Will use the same batch all the time for testing purposes
// This takes 180ms for 100k iterations (6_400_000 bytes / 64 bytes size of the simd)
let spaceMask = batch &* 0x20
/*
Looking to do all these operations much faster, they are all slow
let spaceMask = batch .== 0x20
let result = batch &* 0x20
let tabMask = batch .== 0x09
let combinedMask = (spaceMask .| tabMask)._storage
*/
// Using this loop, it takes 35ms total, running 6.4 million iterations in total
var i = 0
while(i < 64) {
let batchNumber = batch[i] &* 0x20
i += 1
}
iteration += 1
}
}
I would expect the SIMD version to be at least 10x faster than a while loop, instead I got 5 times slower.
whileloop is being auto-vectorized. Could you compare the output assembly in both cases? E.g. on godbolt.org