2

I have a buffer of bytes, I want to multiply each byte be another byte like 0x20. One way is to simply iterate over the buffer and multiply each byte. This is obviously suboptimal, SIMD can do this much faster. But using SIMD in Swift is much slower.

On a MacBook Pro M1 Max:
SIMD: 180ms for 100k iterations (operating on 64 bytes at a time)
Loop: 35ms for 6.4M iterations (operating at a single byte)

Here is the code:

let inBytes = Data(repeating: 0x20, count: 6400000).withUnsafeBytes { bufferPointer in
    // 100K iterations of the outer loop
    // Empty while loop takes about 2ms
    while(iteration < 6_400_000 / SIMD64<UInt8>.scalarCount) {
        let assumed = bufferPointer.assumingMemoryBound(to: SIMD64<UInt8>.self)
        let batch = assumed[0] // Will use the same batch all the time for testing purposes

        // This takes 180ms for 100k iterations (6_400_000 bytes / 64 bytes size of the simd)
        let spaceMask = batch &* 0x20
        /*
         Looking to do all these operations much faster, they are all slow
           let spaceMask = batch .== 0x20
           let result = batch &* 0x20
           let tabMask = batch .== 0x09
           let combinedMask = (spaceMask .| tabMask)._storage
       */
        
        // Using this loop, it takes 35ms total, running 6.4 million iterations in total
        var i = 0
        while(i < 64) {
            let batchNumber = batch[i] &* 0x20
            i += 1
        }

        iteration += 1

    }
}

I would expect the SIMD version to be at least 10x faster than a while loop, instead I got 5 times slower.

3
  • These measurements are in release mode, right? Commented Dec 1, 2023 at 3:59
  • @Alexander yes, tried multiple optimisations, same results. Commented Dec 1, 2023 at 11:44
  • So interestingly, the SIMD types in Swift don't actually have special semantics to force them to lower into SIMD operations. Instead, they're implemented with loops in a particular way that causes LLVM to recognize them and auto-vectorize them (in a way that's suitable for the target platform). Something here might be preventing that optimization. In fact, it's quite likely that your while loop is being auto-vectorized. Could you compare the output assembly in both cases? E.g. on godbolt.org Commented Dec 1, 2023 at 14:54

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.