mov ax, arrW
xchg ax, arrSW
xchg ax, arrW
mov ax, [arrW +2]
The first thing that struck me is that second xchg. There's no sense in loading the AX register right before the other load of AX in the following instruction. The first rewrite that also gives a 20% speed increase on 8086 therefore is:
mov ax, [arrW]
xchg ax, [arrSW]
mov [arrW], ax
mov ax, [arrW + 2]
xchg ax, [arrSW + 2]
mov [arrW + 2], ax
mov ax, [arrW + 4]
xchg ax, [arrSW + 4]
mov [arrW + 4], ax
A solution where you avoid using the xchg instruction doesn't pay on 8086, but is the way to go on x86 in general. eg. Next snippet is 10% slower on 8086:
mov ax, [arrW]
mov bx, [arrSW]
mov [arrW], bx
mov [arrSW], ax
A loop can't beat your current unrolled code, but if the arrays should become larger then next is what it could look like:
mov cx, 3
mov si, OFFSET arrW
mov di, OFFSET arrSW
More:
mov ax, [si]
mov dx, [di]
mov [si], dx
mov [di], ax
add si, 2
add di, 2
dec cx
jnz More
In case the arrays arrW and arrSW follow each other in memory, the same loop is better written as:
mov bx, OFFSET arrW
More:
mov ax, [bx]
mov dx, [bx + 6]
mov [bx], dx
mov [bx + 6], ax
add bx, 2
cmp bx, OFFSET arrSW
jb More
If the CPU supports 32-bit registers, then working with these dwords can halve the number of iterations that are required. If the count of elements is odd, we peel off one word sized swap:
mov cx, 39
mov si, OFFSET arrW
mov di, OFFSET arrSW
shr cx, 1
jnc More ; Count was even
mov ax, [si]
mov dx, [di]
mov [si], dx
mov [di], ax
add si, 2
add di, 2
More:
mov eax, [si]
mov edx, [di]
mov [si], edx
mov [di], eax
add si, 4
add di, 4
dec cx
jnz More
The above code peels off one word sized swap at the start of the loop. As @PeterCordes wrote in his comment below this answer, it is generally beter to put the peeled-off swap at the end (for reasons of data alignment). Next is that version:
mov cx, 39
mov si, OFFSET arrW
mov di, OFFSET arrSW
shr cx, 1 ; -> CF is set if count is odd
jz Next \
More: |
mov eax, [si] |
mov edx, [di] |
mov [si], edx |
mov [di], eax | Nothing changes the CF
lea si, [si + 4] |
lea di, [di + 4] |
dec cx |
jnz More |
Next: /
jnc Done ; (*) Count was even
mov ax, [si]
mov dx, [di]
mov [si], dx
mov [di], ax
Done:
xchgwith memory is never the most efficient way to do anything. It has an implicitlock, which makes it atomic, but also very expensive. Fourmovinstructions will be cheaper than mov/xchg/xchg; trymov ax, [arrW] ; mov bx, [arrSW] ; mov [arrSW], ax ; mov [arrW], bx.mov eax,[arrW]; mov dx, arrW+4]; mov ebx,[arrSW]; mov cx,[arrSW+4]; mov [arrW],ebx; mov [arrW+4],cx; ; mov [arrSW],eax; mov [arrSW+4],dx". Note: There's nothing wrong with using 32-bit registers in 16-bit code if it's being executed on a "recent" CPU.locksemantics.xchgwith memory wasn't slow on old CPUs: www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html says on 8088 it cost 25+EA cycles (not including code-fetch), or 17 cycles on 186. But only 5 cycles on 286/386/486. But those high costs are still less than the sum of two MOV instructions on 8088 or 186. 12+EA and 13+EA, so the same except for paying for the effective-address calculation twice (and the extra code-fetch). The atomic RMW is a perf disaster on modern CPUs.movq xmm0, qword ptr [arrW]/movq xmm1, qword ptr [arrSW - 2]would do overlapping loads that get all the data. Then maybe a shuffle likepsrldq xmm1, 2to get the 6 bytes of arrSW at the bottom of xmm1, then amovqstore toarrWwhich would overlap some garbage into arrSW, which we overwrite with amovd dword ptr [arrSW], xmm0/pextrw [arrSW+4], xmm0, 1to store the right 6 bytes.pshufbwould let you do an arbitrary shuffle within a 16-byte XMM register, keeping the last 4 bytes in place and swapping the two 6-byte chunks.movdqu xmm0, xmmword ptr [arrW]/pshufb xmm0, [shufcontrol]/movdqu xmmword ptr [arrW], xmm0. felixcloutier.com/x86/pshufb. Otherwise you might mess around withpshufdandpshuflw.