SSE loop over array gets the wrong value (dot product of two arrays of doubles)

Ask Question

Asked 2 years, 8 months ago

Modified 2 years, 8 months ago

Viewed 71 times

I have problem with my assembly code: I need to multiply two arrays, then add up the result and get a square root out of it. I've did the code and looks like it works fine, but I need to receive 9.16, but instead I'm getting 9.0.

I guess problem somewhere in the loop or in addpd, but I don't know how to fix it.

include /masm64/include64/masm64rt.inc
INCLUDELIB MSVCRT
option casemap:none 

.data
array1 dq  1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0
array2 dq  7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0
result dq  0.0
res dq 0.0
tit1 db "Result of using the SSE", 0
buf BYTE 260 dup(?)
bufp QWORD buf, 0
loop_count dq 7

.code
entry_point proc

    ; Load the two arrays into SSE registers
    movupd xmm1, [array1]
    movupd xmm2, [array2]
    mov rcx, loop_count ; Number of function iterations
    loop1:
    mulpd xmm1, xmm2
    addpd xmm3, xmm1
    movupd xmm1, [array1 + 8]
    movupd xmm2, [array2 + 8]
    loop loop1

    ; Add the result and store to xmm1
    addpd xmm1, xmm3

    ; Compute the square root of the sum of squares in xmm1
    sqrtpd xmm1, xmm1

    ; Move the result into a general-purpose register for output
    movsd res, xmm1

    invoke fptoa, res, bufp
    invoke MessageBox, 0, bufp, addr tit1, MB_OK
    invoke ExitProcess, 0
entry_point endp
end

I've tried to multiply two arrays without using the loop, just mulpd, but I guess this is not the best decision.

edited Mar 13, 2023 at 10:02

Peter Cordes

377k50 gold badges741 silver badges1k bronze badges

asked Mar 11, 2023 at 17:53

hyen

11 bronze badge

2

movupd xmm1, [array1 + 8] loads from the same place every iteration. You need a pointer or index in a register. (e.g. in RCX if you count up towards loop_count instead of using the slow loop instruction). Also, why are you loading 2 elements at once with pd (packed double) instead of sd (scalar double) instructions? At the end you use movsd to store just the low double element, so the upper halves were useless. If you wanted to use SSE for SIMD instead of scalar, you'd advance a pointer by 16 bytes (2 elements), but you'd need scalar cleanup if the array length is odd.

Peter Cordes
– Peter Cordes

2023-03-11 22:29:23 +00:00
Commented Mar 11, 2023 at 22:29
2

looks like it works fine - Look more closely with a debugger at the values getting loaded into XMM registers; they're the same every iteration. Also, your "software pipelining" forgets to multiply the last vector, instead just adding those vectors.

Peter Cordes
– Peter Cordes

2023-03-12 04:02:29 +00:00
Commented Mar 12, 2023 at 4:02
1

@PaulR : One reason I didn't tag [simd] on this question is that the loop iteration count matches the element count, and they're only using the low element of the result. Like they intended to use SSE scalar operations, but accidentally used pd instead for everything except the final movsd which only saves the low element. Since scalar SSE is the simplest and standard way to do FP math on x86-64, I don't think we should assume they intended SIMD, especially when the bugs are with even more basic things.

Peter Cordes
– Peter Cordes

2023-03-13 10:02:02 +00:00
Commented Mar 13, 2023 at 10:02

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

SSE loop over array gets the wrong value (dot product of two arrays of doubles)

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest