Efficiency when swapping array contents

Question

I'm new to assembly and I am trying to swap the contents between two arrays. I have this code so far and after testing it, I've verified it works. However, I am wondering if this is the most efficient way to get the desired result or if there is another possible solution to this that is more efficient?

arrW    WORD  100h, 200h, 300h
arrSW   SWORD  -140, 200, -300

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]
xchg ax, [arrSW +2]
xchg ax, [arrW +2]
mov ax, [arrW + 4]
xchg ax, [arrSW +4]
xchg ax, [arrW +4]

xchg with memory is never the most efficient way to do anything. It has an implicit lock, which makes it atomic, but also very expensive. Four mov instructions will be cheaper than mov/xchg/xchg; try mov ax, [arrW] ; mov bx, [arrSW] ; mov [arrSW], ax ; mov [arrW], bx. — Nate Eldredge
– Nate Eldredge, Commented Oct 19, 2022 at 4:22
Might be faster to move all array elements at the same time; like "mov eax,[arrW]; mov dx, arrW+4]; mov ebx,[arrSW]; mov cx,[arrSW+4]; mov [arrW],ebx; mov [arrW+4],cx; ; mov [arrSW],eax; mov [arrSW+4],dx". Note: There's nothing wrong with using 32-bit registers in 16-bit code if it's being executed on a "recent" CPU. — Brendan
– Brendan, Commented Oct 19, 2022 at 5:23
@NateEldredge: Fun fact: it was apparently only 386 that introduced the implicit lock semantics. xchg with memory wasn't slow on old CPUs: www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html says on 8088 it cost 25+EA cycles (not including code-fetch), or 17 cycles on 186. But only 5 cycles on 286/386/486. But those high costs are still less than the sum of two MOV instructions on 8088 or 186. 12+EA and 13+EA, so the same except for paying for the effective-address calculation twice (and the extra code-fetch). The atomic RMW is a perf disaster on modern CPUs. — Peter Cordes
– Peter Cordes, Commented Oct 19, 2022 at 6:47
@ICS: What kind of CPU are you optimizing for? An actual cycle-accurate simulation of 8086 or 286, or a modern CPU in 16-bit mode? If the latter, if SSE is enabled in control regs, movq xmm0, qword ptr [arrW] / movq xmm1, qword ptr [arrSW - 2] would do overlapping loads that get all the data. Then maybe a shuffle like psrldq xmm1, 2 to get the 6 bytes of arrSW at the bottom of xmm1, then a movq store to arrW which would overlap some garbage into arrSW, which we overwrite with a movd dword ptr [arrSW], xmm0 / pextrw [arrSW+4], xmm0, 1 to store the right 6 bytes. — Peter Cordes
– Peter Cordes, Commented Oct 19, 2022 at 6:53
If there's padding past the end of these and it's safe to load and store back the same bytes, SSSE3 pshufb would let you do an arbitrary shuffle within a 16-byte XMM register, keeping the last 4 bytes in place and swapping the two 6-byte chunks. movdqu xmm0, xmmword ptr [arrW] / pshufb xmm0, [shufcontrol] / movdqu xmmword ptr [arrW], xmm0. felixcloutier.com/x86/pshufb. Otherwise you might mess around with pshufd and pshuflw. — Peter Cordes
– Peter Cordes, Commented Oct 19, 2022 at 6:56

ecm · Accepted Answer · 2022-10-21 09:09:29Z

2

mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]

The first thing that struck me is that second xchg. There's no sense in loading the AX register right before the other load of AX in the following instruction. The first rewrite that also gives a 20% speed increase on 8086 therefore is:

mov  ax, [arrW]
xchg ax, [arrSW]
mov  [arrW], ax
mov  ax, [arrW + 2]
xchg ax, [arrSW + 2]
mov  [arrW + 2], ax
mov  ax, [arrW + 4]
xchg ax, [arrSW + 4]
mov  [arrW + 4], ax

A solution where you avoid using the xchg instruction doesn't pay on 8086, but is the way to go on x86 in general. eg. Next snippet is 10% slower on 8086:

mov  ax, [arrW]
mov  bx, [arrSW]
mov  [arrW], bx
mov  [arrSW], ax

A loop can't beat your current unrolled code, but if the arrays should become larger then next is what it could look like:

 mov  cx, 3
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
More:
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
 dec  cx
 jnz  More

In case the arrays arrW and arrSW follow each other in memory, the same loop is better written as:

 mov  bx, OFFSET arrW
More:
 mov  ax, [bx]
 mov  dx, [bx + 6]
 mov  [bx], dx
 mov  [bx + 6], ax
 add  bx, 2
 cmp  bx, OFFSET arrSW
 jb   More

If the CPU supports 32-bit registers, then working with these dwords can halve the number of iterations that are required. If the count of elements is odd, we peel off one word sized swap:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1
 jnc  More             ; Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
 add  si, 2
 add  di, 2
More:
 mov  eax, [si]
 mov  edx, [di]
 mov  [si], edx
 mov  [di], eax
 add  si, 4
 add  di, 4
 dec  cx
 jnz  More

The above code peels off one word sized swap at the start of the loop. As @PeterCordes wrote in his comment below this answer, it is generally beter to put the peeled-off swap at the end (for reasons of data alignment). Next is that version:

 mov  cx, 39
 mov  si, OFFSET arrW
 mov  di, OFFSET arrSW
 shr  cx, 1            ; -> CF is set if count is odd
 jz   Next             \
More:                   |
 mov  eax, [si]         |
 mov  edx, [di]         |
 mov  [si], edx         |
 mov  [di], eax         | Nothing changes the CF
 lea  si, [si + 4]      |
 lea  di, [di + 4]      |
 dec  cx                |
 jnz  More              |
Next:                  /
 jnc  Done             ; (*) Count was even
 mov  ax, [si]
 mov  dx, [di]
 mov  [si], dx
 mov  [di], ax
Done:

edited Oct 21, 2022 at 9:09

ecm

2,9404 gold badges25 silver badges33 bronze badges

answered Oct 19, 2022 at 20:58

Sep Roland

41.2k10 gold badges49 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Peter Cordes Over a year ago

Your version using 32-bit registers does a word at the start if the count is odd. It's usually more common for array starts to be aligned but sizes to be odd, where this strategy would make all the dword accesses misaligned. (On modern x86 CPUs, usually only a penalty when crossing a cache-line boundary.) So if you're not going to actually check alignment, it's often good to do the odd cleanup at the end, not start.

ecm Over a year ago

Instead of add si, 2 \ add di, 2 you can use cmpsw (assuming the Direction Flag is cleared). This instruction modifies flags and happens to add 2 to si and di.

Sep Roland Over a year ago

@ecm 1 byte versus 6 bytes is indeed a good deal if we would have to optimize for codesize. I have used string primitives more than once for this purpose, but is sucks speedwise: on 8086 cmpsw clocks at 22, while add si, 2 add di, 2 clocks at 4 + 4. (on 386 it's 10 vs 2 + 2).

Collectives™ on Stack Overflow

Efficiency when swapping array contents

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related