0

I have problem with my assembly code: I need to multiply two arrays, then add up the result and get a square root out of it. I've did the code and looks like it works fine, but I need to receive 9.16, but instead I'm getting 9.0.

I guess problem somewhere in the loop or in addpd, but I don't know how to fix it.

include /masm64/include64/masm64rt.inc
INCLUDELIB MSVCRT
option casemap:none 

.data
array1 dq  1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0
array2 dq  7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0
result dq  0.0
res dq 0.0
tit1 db "Result of using the SSE", 0
buf BYTE 260 dup(?)
bufp QWORD buf, 0
loop_count dq 7

.code
entry_point proc

    ; Load the two arrays into SSE registers
    movupd xmm1, [array1]
    movupd xmm2, [array2]
    mov rcx, loop_count ; Number of function iterations
    loop1:
    mulpd xmm1, xmm2
    addpd xmm3, xmm1
    movupd xmm1, [array1 + 8]
    movupd xmm2, [array2 + 8]
    loop loop1

    ; Add the result and store to xmm1
    addpd xmm1, xmm3

    ; Compute the square root of the sum of squares in xmm1
    sqrtpd xmm1, xmm1

    ; Move the result into a general-purpose register for output
    movsd res, xmm1

    invoke fptoa, res, bufp
    invoke MessageBox, 0, bufp, addr tit1, MB_OK
    invoke ExitProcess, 0
entry_point endp
end

I've tried to multiply two arrays without using the loop, just mulpd, but I guess this is not the best decision.

3
  • 2
    movupd xmm1, [array1 + 8] loads from the same place every iteration. You need a pointer or index in a register. (e.g. in RCX if you count up towards loop_count instead of using the slow loop instruction). Also, why are you loading 2 elements at once with pd (packed double) instead of sd (scalar double) instructions? At the end you use movsd to store just the low double element, so the upper halves were useless. If you wanted to use SSE for SIMD instead of scalar, you'd advance a pointer by 16 bytes (2 elements), but you'd need scalar cleanup if the array length is odd. Commented Mar 11, 2023 at 22:29
  • 2
    looks like it works fine - Look more closely with a debugger at the values getting loaded into XMM registers; they're the same every iteration. Also, your "software pipelining" forgets to multiply the last vector, instead just adding those vectors. Commented Mar 12, 2023 at 4:02
  • 1
    @PaulR : One reason I didn't tag [simd] on this question is that the loop iteration count matches the element count, and they're only using the low element of the result. Like they intended to use SSE scalar operations, but accidentally used pd instead for everything except the final movsd which only saves the low element. Since scalar SSE is the simplest and standard way to do FP math on x86-64, I don't think we should assume they intended SIMD, especially when the bugs are with even more basic things. Commented Mar 13, 2023 at 10:02

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.