1

I'm new to assembly and I am trying to swap the contents between two arrays. I have this code so far and after testing it, I've verified it works. However, I am wondering if this is the most efficient way to get the desired result or if there is another possible solution to this that is more efficient?

arrW    WORD  100h, 200h, 300h
arrSW   SWORD  -140, 200, -300

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]
xchg ax, [arrSW +2]
xchg ax, [arrW +2]
mov ax, [arrW + 4]
xchg ax, [arrSW +4]
xchg ax, [arrW +4]
6
  • 1
    xchg with memory is never the most efficient way to do anything. It has an implicit lock, which makes it atomic, but also very expensive. Four mov instructions will be cheaper than mov/xchg/xchg; try mov ax, [arrW] ; mov bx, [arrSW] ; mov [arrSW], ax ; mov [arrW], bx. Commented Oct 19, 2022 at 4:22
  • 1
    Might be faster to move all array elements at the same time; like "mov eax,[arrW]; mov dx, arrW+4]; mov ebx,[arrSW]; mov cx,[arrSW+4]; mov [arrW],ebx; mov [arrW+4],cx; ; mov [arrSW],eax; mov [arrSW+4],dx". Note: There's nothing wrong with using 32-bit registers in 16-bit code if it's being executed on a "recent" CPU. Commented Oct 19, 2022 at 5:23
  • @NateEldredge: Fun fact: it was apparently only 386 that introduced the implicit lock semantics. xchg with memory wasn't slow on old CPUs: www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html says on 8088 it cost 25+EA cycles (not including code-fetch), or 17 cycles on 186. But only 5 cycles on 286/386/486. But those high costs are still less than the sum of two MOV instructions on 8088 or 186. 12+EA and 13+EA, so the same except for paying for the effective-address calculation twice (and the extra code-fetch). The atomic RMW is a perf disaster on modern CPUs. Commented Oct 19, 2022 at 6:47
  • 1
    @ICS: What kind of CPU are you optimizing for? An actual cycle-accurate simulation of 8086 or 286, or a modern CPU in 16-bit mode? If the latter, if SSE is enabled in control regs, movq xmm0, qword ptr [arrW] / movq xmm1, qword ptr [arrSW - 2] would do overlapping loads that get all the data. Then maybe a shuffle like psrldq xmm1, 2 to get the 6 bytes of arrSW at the bottom of xmm1, then a movq store to arrW which would overlap some garbage into arrSW, which we overwrite with a movd dword ptr [arrSW], xmm0 / pextrw [arrSW+4], xmm0, 1 to store the right 6 bytes. Commented Oct 19, 2022 at 6:53
  • 1
    If there's padding past the end of these and it's safe to load and store back the same bytes, SSSE3 pshufb would let you do an arbitrary shuffle within a 16-byte XMM register, keeping the last 4 bytes in place and swapping the two 6-byte chunks. movdqu xmm0, xmmword ptr [arrW] / pshufb xmm0, [shufcontrol] / movdqu xmmword ptr [arrW], xmm0. felixcloutier.com/x86/pshufb. Otherwise you might mess around with pshufd and pshuflw. Commented Oct 19, 2022 at 6:56

1 Answer 1

2
mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]

The first thing that struck me is that second xchg. There's no sense in loading the AX register right before the other load of AX in the following instruction. The first rewrite that also gives a 20% speed increase on 8086 therefore is:

mov  ax, [arrW]
xchg ax, [arrSW]
mov  [arrW], ax
mov  ax, [arrW + 2]
xchg ax, [arrSW + 2]
mov  [arrW + 2], ax
mov  ax, [arrW + 4]
xchg ax, [arrSW + 4]
mov  [arrW + 4], ax

A solution where you avoid using the xchg instruction doesn't pay on 8086, but is the way to go on x86 in general. eg. Next snippet is 10% slower on 8086:

mov  ax, [arrW]
mov  bx, [arrSW]
mov  [arrW], bx
mov  [arrSW], ax

A loop can't beat your current unrolled code, but if the arrays should become larger then next is what it could look like:

 mov  cx, 3
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
More:
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
 dec  cx
 jnz  More

In case the arrays arrW and arrSW follow each other in memory, the same loop is better written as:

 mov  bx, OFFSET arrW
More:
 mov  ax, [bx]
 mov  dx, [bx + 6]
 mov  [bx], dx
 mov  [bx + 6], ax
 add  bx, 2
 cmp  bx, OFFSET arrSW
 jb   More

If the CPU supports 32-bit registers, then working with these dwords can halve the number of iterations that are required. If the count of elements is odd, we peel off one word sized swap:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1
 jnc  More             ; Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
More:
 mov  eax, [si]
 mov  edx, [di]
 mov  [si], edx
 mov  [di], eax
 add  si, 4
 add  di, 4
 dec  cx
 jnz  More

The above code peels off one word sized swap at the start of the loop. As @PeterCordes wrote in his comment below this answer, it is generally beter to put the peeled-off swap at the end (for reasons of data alignment). Next is that version:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1            ; -> CF is set if count is odd
 jz   Next             \
More:                   |
 mov  eax, [si]         |
 mov  edx, [di]         |
 mov  [si], edx         |
 mov  [di], eax         | Nothing changes the CF
 lea  si, [si + 4]      |
 lea  di, [di + 4]      |
 dec  cx                |
 jnz  More              |
Next:                  /
 jnc  Done             ; (*) Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
Done:
Sign up to request clarification or add additional context in comments.

3 Comments

Your version using 32-bit registers does a word at the start if the count is odd. It's usually more common for array starts to be aligned but sizes to be odd, where this strategy would make all the dword accesses misaligned. (On modern x86 CPUs, usually only a penalty when crossing a cache-line boundary.) So if you're not going to actually check alignment, it's often good to do the odd cleanup at the end, not start.
Instead of add si, 2 \ add di, 2 you can use cmpsw (assuming the Direction Flag is cleared). This instruction modifies flags and happens to add 2 to si and di.
@ecm 1 byte versus 6 bytes is indeed a good deal if we would have to optimize for codesize. I have used string primitives more than once for this purpose, but is sucks speedwise: on 8086 cmpsw clocks at 22, while add si, 2 add di, 2 clocks at 4 + 4. (on 386 it's 10 vs 2 + 2).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.