Add two huge base-10 numbers

Question

This is some assembly code I've written to add two base-10 numbers. The input and output are represented as null-terminated ASCII strings, with digits in big-endian order.

I have also provided comments for each line explaining what it does. Is there anywhere I can use SIMD or something which will be faster here?

Since the function's called add_whole, it only adds whole numbers, a.k.a. non-negative integers. I also use my own strlen which only modifies rcx.

extern strlen

section .text
global add_whole
add_whole:
  ; Input:
  ;   - char *a -> rdi
  ;   - char *b -> rsi
  ;   - char *res -> rdx

  ; Registers used:
  ;   - rax
  ;   - rcx
  ;   - r8
  ;   - r9
  ;   - r10
  ;   - r11
  ;   - r12
  ;   - r13
  ;   - r14
  
  push   r12                  ; Save callee-saved registers used.
  push   r13                  ; In this case,
  push   r14                  ; r12, r13, and r14.
  call   strlen               ; Calculate the length of char *a and store it in rax.
  push   rax                  ; Save rax,
  push   rdi                  ; and rdi to make another call to strlen.
  mov    rdi, rsi             ; Move char *b into rdi.
  call   strlen               ; Calculate the length of char *b,
  mov    r8, rax              ; and store it in r8.
  pop    rdi                  ; Now, rax = strlen(a)
  pop    rax                  ;      r8  = strlen(b)
  mov    r9, r8               ; The next two lines of code
  cmp    rax, r8              ; put std::max(rax, r8) into
  cmovnc r9, rax              ; the r9 register.
  xor    r10b, r10b           ; r10b = false := carry_flag
  xor    rcx, rcx             ; rcx = 0 := loop_counter
.loop_1:
  lea    r13, [rcx + 1]       ; r13 = loop_counter + 1
  mov    r11b, 48             ; r11b = '0', if there are no more digits in the number/s, 
  mov    r12b, r11b           ; r12b = '0'  then these default values will be used.
  cmp    rax, r13             ; Compare strlen(a) with loop_counter + 1.
  js     .after_if_1          ; if (loop_counter < strlen(a)):
  lea    r14, [rdi+rax]       ;   r11b = a[strlen(a) - loop_counter - 1]
  sub    r14, rcx             ; The next two lines 
  dec    r14                  ; execute that one 
  mov    r11b, byte [r14]     ; line if statement.
.after_if_1:
  cmp    r8, r13              ; Compare strlen(b) with loop_counter + 1.
  js     .after_if_2          ; if (loop_counter < strlen(b)):
  lea    r14, [rsi+r8]        ;   r12b = b[strlen(b) - loop_counter - 1]
  sub    r14, rcx             ; The next two lines 
  dec    r14                  ; execute that one
  mov    r12b, byte [r14]     ; line if statement.
.after_if_2:
  mov    r13b, r11b           ; These next two lines add individual digits 
  add    r13b, r12b           ; of the numbers.
  sub    r13b, 48
  test   r10b, r10b           ; Check for carry.
  jz     .after_if_3          ; if (carry_flag):
  xor    r10b, r10b           ;   carry_flag = false
  inc    r13b                 ;   r13b++, add the carry
.after_if_3:
  cmp    r13b, 57             ; Compare the current addition result with '9'
  jle    .after_if_4          ; if (r13b > '9'):
  sub    r13b, 10             ;   r13b -= 10
  mov    r10b, 1              ;   carry_flag = true
.after_if_4:
  mov    byte[rdx+rcx], r13b  ; res[loop_counter] = r13b
  inc    rcx                  ; Increment the loop counter, which now points to the end of res.
  cmp    rcx, r9              ; Keep looping,
  js     .loop_1              ; while (loop_counter < std::max(strlen(a), strlen(b))).
  test   r10b, r10b           ; Check for a final carry.
  jz     .after_if_5          ; if (carry_flag):
  mov    byte[rdx+rcx], 49    ;   res[loop_counter] = '1'
  inc    r9                   ;   Increment r9, which stores strlen(res).
.after_if_5:
  xor    rcx, rcx             ; rcx = 0 := loop_counter
  mov    r11, r9              ; Note that r9 = strlen(res).
  shr    r11, 1               ; r11 = strlen(res) / 2
.loop_2:
  lea    r12, [rdx+r9]        ; r12 = &res[strlen(res) - loop_counter - 1]
  sub    r12, rcx
  dec    r12
  mov    r8b, byte [rdx+rcx]  ; r8b = res[loop_counter]
  mov    r10b, byte [r12]     ; r10b = *r12 = res[strlen(res) - loop_counter - 1]
  mov    byte [rdx+rcx], r10b ; res[loop_counter] = r10b
  mov    byte [r12], r8b      ; res[strlen(res) - loop_counter - 1] = r8b
  inc    rcx
  cmp    rcx, r11             ; Keep looping while (loop_counter < strlen(res) / 2)
  js     .loop_2
  mov    byte [rdx+r9], 0     ; res should be null terminated.
  pop    r14                  ; Pop used callee-saved registers.
  pop    r13
  pop    r12
  ret

Intel assembly, x86-64 nasm like assembly. And no, it only adds non-negative numbers. — avighnac
– avighnac, Commented Sep 22, 2022 at 6:08
You seem to have sketched out the procedure "in C": Did you try to have that compiled to assembly with code improvement /optimisation for reference? — greybeard
– greybeard, Commented Sep 22, 2022 at 10:45
You tagged performance, mention SIMD & faster: There used to be a decimal adjust instruction for packed decimals. Can you code something similar for "ASCII decimals"? — greybeard
– greybeard, Commented Sep 22, 2022 at 10:45

Sep Roland · Accepted Answer · 2022-09-27 20:28:17Z

Following the ABI (you are preserving R12, R13, and R14), the RDX and RSI registers are not amongst the callee-saved registers. Therefore neither might have survived those strlen calls!

Optimizations to the existing code

xor    rcx, rcx             ; rcx = 0 := loop_counter

Write this as xor ecx, ecx. It'll shave off the REX prefix and will zero the whole RCX register just as well.

lea    r14, [rdi+rax] 
sub    r14, rcx
dec    r14

Better avoid that separate dec r14 and add the 'minus 1' that it does to the lea instruction: lea r14, [rdi+rax-1].

lea    r13, [rcx + 1] ; r13 = loop_counter + 1
cmp    rax, r13       ; cmp strlen(a) with loop_counter + 1
js     .after_if_1    ; if (loop_counter < strlen(a)):

Why don't you code it the way your C code expresses it? You wouldn't need the detour via the extra R13 register. Since the meaning of '<' for unsigned numbers is 'Below' and you want to by-pass, the conditional to use is 'JumpIfNotBelow':

cmp     rcx, rax     ; cmp loop_counter, strlen(a)
jnb     .after_if_1

mov    r13b, r11b ; These next two lines add individual digits 
add    r13b, r12b ; of the numbers.
sub    r13b, 48

It is unnecessary to first copy R11B to another register. Just process the sum in the R11B register:

add    r11b, r12b      ; (*) Using R11B instead of R13B
sub    r11b, 48

  test   r10b, r10b           ; Check for carry.
  jz     .after_if_3          ; if (carry_flag):
  xor    r10b, r10b           ;   carry_flag = false
  inc    r13b                 ;   r13b++, add the carry
.after_if_3:

The carry variable in the R10B register is limited to the values of 0 and 1. If it holds 0, you don't want to do anything, and if it is 1, you want to increment the sum and reset the carry variable. Next code is much simpler and does just that:

add     r11b, r10b     ; (*) Using R11B instead of R13B
xor     r10b, r10b

mov    byte[rdx+rcx], r13b  ; res[loop_counter] = r13b

Because you store the result of the addition reversed, you have added a string reversal routine to the code. This is an inefficient approach. The final result will have the same number of digits as the longest of the two inputs, or maybe just one extra digit in case of a final carry.

lea    rdx, [rdx+1+r9] ; R9 is longest input
mov    byte [rdx], 0   ; Zero-terminated result

Writing an optimized version

Don't do the whole calculation in a single loop. First process the positions that both inputs have in common, then process the positions that remain in the longer of the two inputs. You'll have much less conditional branches to execute.
Carefully choosing your registers can shave off a lot of bytes via not requiring a REX prefix, being able to use the short accumulator encodings, and not having to preserve callee-saved registers.

Input:        RDI                RSI       RDX
              v                  v         v
a, b, res ->  1158791457863133@  2678139@  ..................


Setup:        RDI             R8 RSI    R9                  RDX
              v               v  v      v                   v
a, b, res ->  1158791457863133@  2678139@  .................@


Loops:          LoopB   LoopA
              /-------\/-----\
a         ->  1158791457863133@
b         ->           2678139@
res       -> .1158791460541272@


Output:       R8                 R9         RDX
              v                  v          v
a, b, res ->  1158791457863133@  2678139@  .1158791460541272@
                                            ^
                                            RAX

; IN (rdx,rsi,rdi) OUT (rax) MOD (rcx,rdx,rsi,rdi,r8,r9)
  push   rbx              ; The only callee-saved register to preserve

  push   rdx              ; (1)
  push   rsi              ; (2)
  push   rdi              ; (3)
  call   strlen           ; -> RAX is strlen(a)
  mov    ebx, eax         ; For sure less than 4GB
  mov    rdi, [rsp+8]
  call   strlen           ; -> RAX is strlen(b)
  pop    rdi              ; (3)
  pop    rsi              ; (2)
  pop    rdx              ; (1)

  ; At RDI are RBX bytes, at RSI are RAX bytes

  cmp    ebx, eax
  jae    .NoSwap
  xchg   rdi, rsi         ; Make (RDI:RBX) refer to the longer input
  xchg   ebx, eax
.NoSwap:
  mov    ecx, eax         ; RAX is length of the shorter input
  lea    r8, [rdi+rbx]    ; R8 points at the terminating zero
  lea    r9, [rsi+rax]    ; R9 points at the terminating zero
  lea    rdx, [rdx+1+rbx] ; RBX is length of the longer input
  xor    ebx, ebx         ; BL is carry [0,1]
  mov    [rdx], bl        ; Zero-terminating the result
  jecxz  .NoLoopA

.LoopA:                   ; Deals with the common positions
  movzx  eax, byte [r8-1]
  add    al, [r9-1]
  sub    al, '0'
  add    al, bl           ; Plus carry
  xor    ebx, ebx         ; Clear carry
  cmp    al, '9'
  jbe    .CharOK
  sub    al, 10
  inc    ebx              ; Set carry
.CharOK:
  mov    [rdx-1], al
  dec    r9
  dec    r8
  dec    rdx
  dec    rcx
  jnz    .LoopA

.NoLoopA:
  cmp    r8, rdi
  jna    .NoLoopB

.LoopB:                   ; Deals with remainder of the longer input
  movzx  eax, byte [r8-1]
  add    al, bl           ; Plus carry
  xor    ebx, ebx         ; Clear carry
  cmp    al, '9'
  jbe    .CharOK_
  sub    al, 10
  inc    ebx              ; Set carry
.CharOK_:
  mov    [rdx-1], al
  dec    r8
  dec    rdx
  cmp    r8, rdi
  ja     .LoopB

.NoLoopB:
  test   bl, bl           ; Check for a final carry.
  jz     .Done
  dec    rdx
  mov    byte [rdx], '1'

.Done:
  mov    rax, rdx         ; RAX is ptr to zero-terminated result
  pop    rbx
  ret

Tip: The code can be optimized with a further loop, because once the addition in LoopB does not produce another carry, we can exit from that loop and copy the remainder of the string unmodified. If your bigints have very different lengths, this can speed up a lot...

Refinement

Is there any way to make it so you directly write to char *res instead of returning a pointer to a part of it without it affecting the speed?

The little detail that sits in the way for what you desire, is the possibility of having to prepend a "1" digit in case of a final carry. The probability that a final carry occurs IMO would be extremely small when working with these extremely long numbers. Therefore I suggest we rewrite the algorithm so as to ignore a potential final carry at first, but if in the end we do discover one, we just copy the result one byte up in memory and prepend the "1" digit. Sure, this will cost extra when it runs, but it will almost never have to run.

; Version that leaves RDX unchanged
; IN (rdx,rsi,rdi) OUT () MOD (rax,rcx,rsi,rdi,r8,r9)
  push   rbx              ; The only callee-saved register to preserve

  push   rdx              ; (1)
  push   rsi              ; (2)
  push   rdi              ; (3)
  call   strlen           ; -> RAX is strlen(a)
  mov    ebx, eax         ; For sure less than 4GB
  mov    rdi, [rsp+8]
  call   strlen           ; -> RAX is strlen(b)
  pop    rdi              ; (3)
  pop    rsi              ; (2)
  pop    rdx              ; (1)

  ; At RDI are RBX bytes, at RSI are RAX bytes

  cmp    ebx, eax
  jae    .NoSwap
  xchg   rdi, rsi         ; Make (RDI:RBX) refer to the longer input
  xchg   ebx, eax
.NoSwap:
  mov    ecx, eax         ; RAX is length of the shorter input
  lea    r8, [rdi+rbx]    ; R8 points at the terminating zero
  lea    r9, [rsi+rax]    ; R9 points at the terminating zero
  add    rdx, rbx         ; RBX is length of the longer input
  push   rdx              ; (4) "in case there's a final carry"
  xor    ebx, ebx         ; BL is carry [0,1]
  mov    [rdx], bl        ; Zero-terminating the result
  jecxz  .NoLoopA

.LoopA:                   ; Deals with the common positions
  movzx  eax, byte [r8-1]
  add    al, [r9-1]
  sub    al, '0'
  add    al, bl           ; Plus carry
  xor    ebx, ebx         ; Clear carry
  cmp    al, '9'
  jbe    .CharOK
  sub    al, 10
  inc    ebx              ; Set carry
.CharOK:
  mov    [rdx-1], al
  dec    r9
  dec    r8
  dec    rdx
  dec    rcx
  jnz    .LoopA

.NoLoopA:
  cmp    r8, rdi
  jna    .NoLoopB

.LoopB:                   ; Deals with remainder of the longer input
  movzx  eax, byte [r8-1]
  add    al, bl           ; Plus carry
  xor    ebx, ebx         ; Clear carry
  cmp    al, '9'
  jbe    .CharOK_
  sub    al, 10
  inc    ebx              ; Set carry
.CharOK_:
  mov    [rdx-1], al
  dec    r8
  dec    rdx
  cmp    r8, rdi
  ja     .LoopB

.NoLoopB:
  pop    rcx              ; (4)
  test   bl, bl           ; Check for a final carry.
  jz     .Done
.CopyUp:
  mov    al, [rcx]
  mov    [rcx+rbx], al    ; RBX=1
  dec    rcx
  cmp    rcx, rdx
  jae    .CopyUp
  mov    byte [rdx], '1'

.Done:
  pop    rbx              ; Unchanged RDX is ptr to zero-terminated result
  ret

Not only does the extra .CopyUp code very seldom have to run, since it's a simple memcpy, you can replace it with any specialized version of it (or just invoke one) should you deem this useful.

Wow, thank you. I compared my code with the modifications and your optimized version with strings of lengths of 10M. Yours was faster by 0.3 seconds (2.9 and 2.6). Is there any way to make it so you directly write to char *res instead of returning a pointer to a part of it without it affecting the speed? — avighnac
– avighnac, Commented Sep 26, 2022 at 8:12
@avighnac I have amended my answer with a refinement that addresses the issue from your comment. — Sep Roland
– Sep Roland, Commented Oct 2, 2022 at 12:28
Yes, hello, I noticed. Sorry, I haven't been active on this website over the last week and will not be this week either. My exams are on right now, but I'll definitely check this out after they're over! Again, sorry for the late reply! — avighnac
– avighnac, Commented Oct 4, 2022 at 13:38
Hey, could you explain why your code's slightly faster? What did you do that optimized it? — avighnac
– avighnac, Commented Oct 6, 2022 at 13:45
@avighnac I take it that you are referring to the 0.3 s speed increase between your code with optimizations applied (those that I wrote about in the first part of my answer), and then the optimized version that I proposed in the middle part of my answer. The key points are (1) branch reduction, and (2) short and simple instructions. — Sep Roland
– Sep Roland, Commented Oct 6, 2022 at 16:04

Stack Exchange Network

Add two huge base-10 numbers

1 Answer 1

Optimizations to the existing code

Writing an optimized version

Refinement

You must log in to answer this question.

Linked

Hot Network Questions

Add two huge base-10 numbers

1 Answer 1

Optimizations to the existing code

Writing an optimized version

Refinement

You must log in to answer this question.

Linked

Related

Hot Network Questions