UTF-8 string has too many bytes using SBCL and babel on Windows 64 bits

Question

The UTF-8 string in example seems to be coded with too many bytes!

The input string: "👉TEST📍TEST"

“👉” (U+1F449): A hand pointing right
“T”, “E”, “S”, “T”: Basic Latin letters
“📍” (U+1F4CD): A round pushpin
“T”, “E”, “S”, “T”: Basic Latin letters

This string is stored in a UTF-8 encoded file, when I use a hexadecimal editor I see the 16 bytes below as expected. When I copy the strings into Online tools, I find the same 16 bytes.

f0 9f 91 89 54 45 53 54 f0 9f 93 8d 54 45 53 54
 \_______/   \_______/   \_______/   \_______/
  U+1F449    T  E  S  T   U+1F4CD    T  E  S  T
   “👉”                    “📍”

However, the results of the function babel:string-to-octets are different, I get 20 bytes:

(defun print-hex (octets)
  (dotimes (offset (length octets))
    (let ((byte (aref octets offset)))
      (format t "~2,'0x " byte)))
  (format t "(~A bytes)~%" (length octets)))

(let ((string "👉TEST📍TEST"))
  (format t "TEST STRING [~A]~%" string)
  (print-hex (babel:string-to-octets string))
  (print-hex (babel:string-to-octets string :encoding :UTF-8)))
TEST STRING [👉TEST📍TEST]
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)

If we analyze this further:

ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54
 \_____________/   \_______/   \_____________/   \_______/
       ???         T  E  S  T       ???          T  E  S  T 
       ^^^                          ^^^
UTF-16 surrogate pair?       UTF-16 surrogate pair?

How do I get the 16 bytes from the input string?

Another interesting behavior which highlight the same issue, converting to octets and then back to the original string leads to an encoding error on the first character.

(let ((string "👉TEST📍TEST"))
  (babel:octets-to-string (babel:string-to-octets string)))

debugger invoked on a BABEL-ENCODINGS:CHARACTER-OUT-OF-RANGE in thread
#<THREAD "main thread" RUNNING {100F080003}>:
  Illegal :UTF-8 character starting at position 0.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

Edit: the issue seems to be specific to SBCL on Windows, the program runs well on Debian Linux.

Thank you for your interest, but I don't really understand your question, I wrote the given string in a UTF-8 lisp source code file and read back with a hexadecimal editor, I found 16 bytes, then I double checked with the online tool mentioned above, so I am quite sure my input is UTF-8. The output seems wrong but I have no clue yet. — Robert
– Robert, Commented Mar 29, 2024 at 7:39
I see a pattern. both emojis get represented by the sequence ED xx xx ED xx xx. could you expand those (both the reference and the "suspect" encoding) to binary, MSB first? taking utf-8 and other encodings, can you see a pattern? — Christoph Rackwitz
– Christoph Rackwitz, Commented Mar 29, 2024 at 7:40
Clearly a double-encoding bug. The original codepoints U+1F449 and U+1F4CD are being encoded to UTF-16 first, D83D DCCD and D83D DC49 respectively. And then the codeunits are being misinterpreted as codepoints and encoded indivudually to UTF-8, D83D -> ED A0 BD, DC49 -> ED B1 89, D83D -> ED A0 BD, DCCD -> ED B3 8D — Remy Lebeau
– Remy Lebeau, Commented Mar 29, 2024 at 18:59
SBCL is UTF-8-encoding the UTF-16 surrogate pairs directly, instead of decoding the UTF-16 codepoint and re-encoding as UTF-8. I'm not familiar with SBCL, but I can reverse the process with Python: bytes.fromhex('ED A0 BD ED B1 89').decode(errors='surrogatepass').encode('utf-16le', 'surrogatepass').decode('utf-16le') -> 👉 — Mark Tolonen
– Mark Tolonen, Commented Mar 29, 2024 at 22:30
I commented that I couldn't reproduce this at first, but I have now been able to reproduce. The code worked fine when pasted into a Slime repl, but it fails as described in the post when pasted directly into SBCL repls in CMD, PowerShell, or MSYS2 terminal windows. I had pasted the string into a file and saved it in a global variable, using that variable for testing in the other repls; in that scenario the problem did not occur. — ad absurdum
– ad absurdum, Commented Mar 29, 2024 at 23:40

ad absurdum · Accepted Answer · 2024-03-30 07:33:04Z

4

I'm pretty sure that this is a problem with the SBCL repl itself, and possibly a problem with the way that you are introducing strings into your code.

As far as the repl is concerned, the SBCL repl is not really actively developed; most lispers are probably using Slime or something similar for repl development. This is a much better experience than working with the SBCL repl. I couldn't get the posted code to misbehave in a Slime repl.

I was able to reproduce the problem with an SBCL repl. On my Windows machine, it seems that pasting the posted string literal into an SBCL repl window resulted in a string which is UTF-16 encoded. This is where I suspect there is some issue with the SBCL repl. Calling babel:string-to-octets on the pasted string yields the wrong result, as OP noted. SBCL has its own sb-ext:string-to-octets procedure, and calling that on the pasted string drops into the debugger with an SB-IMPL::OCTETS-ENCODING-ERROR error. This makes me think that the problem is somewhere on the SBCL side.

As a workaround, I was able to round-trip the pasted string through a UTF-16 encoding using babel:

;; Calling on a pasted string literal:
* (print-hex (babel:string-to-octets "��TEST��TEST"))
ED A0 BD ED B1 89 54 45 53 54 ED A0 BD ED B3 8D 54 45 53 54 (20 bytes)
NIL

;; Round-tripping the pasted string literal:
* (print-hex (babel:string-to-octets
              (babel:octets-to-string
               (babel:string-to-octets "��TEST��TEST" :encoding :utf-16)
               :encoding :utf-16)))
F0 9F 91 89 54 45 53 54 F0 9F 93 8D 54 45 53 54 (16 bytes)
NIL

* (let* ((s "��TEST��TEST")
         (s-reencoded (babel:octets-to-string
                       (babel:string-to-octets s :encoding :utf-16)
                      :encoding :utf-16)))
    (format t "TEST STRING [~A]~%" s)
    (print-hex (babel:string-to-octets s-reencoded)))
TEST STRING [👉TEST📍TEST]
F0 9F 91 89 54 45 53 54 F0 9F 93 8D 54 45 53 54 (16 bytes)
NIL
*

Note that I was unable to make the same round-tripping work by using SBCL's sb-ext:string-to-octets and sb-ext:octets-to-string procedures.

The OP has said: "This string is stored in a UTF-8 encoded file." The significance of this is unclear. Was the posted code saved in a file and loaded into a repl? I saved the posted code in a file using Emacs and Slime, using Windows Notepad with UTF-8 encoding, and using Windows Notepad with UTF-16 encoding. Every time I loaded this code from any of these files into either the SBCL repl or the Slime repl it worked as expected. This leads me to believe that the problem may be an inconvenience for playing in the repl, but not an issue for real programs.

edited Mar 30, 2024 at 7:33

answered Mar 30, 2024 at 6:15

ad absurdum

22.1k5 gold badges45 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Robert Over a year ago

Thank you for your support, it helped me figure out something new, if the code is copy/pasted into REPL, the behavior is not the same. When the code is loaded from Lisp file directly, the number of bytes is correct, but the characters printed are incorrect (no icon). When the code is copy/pasted, the number of bytes is incorrect but the proper icons are printed.

Robert Over a year ago

So for the time being, my problem is not really solved, maybe uiop:run-program use some kind of console behind, which bring to this encoding issue. I will try to skip uiop:run-program and go with Common Lisp cURL's binding. I understand that there is an encoding issue, but I still have no clue how to work around this.

Robert Over a year ago

What lead me to look for support is a real program, I tried to reproduce the smallest program to highlight my problem. My original issue is the following. I download a webpage with cURL with uiop:run-program and insert it into SQLite database. It works 99% of the time, except on special webpages containing these strange emoticons. The insertion always work, but when the problem occur the bytes inserted in SQLite are incorrect. Later an encoding error is triggered when reading the database, when cffi tries to decode UTF-8 string from alien string (using babel). –

ad absurdum Over a year ago

@Robert -- maybe you should try to reproduce your issue using Clisp or Clozure Common Lisp to see if the problem is with SBCL or some other part of your workflow.

Collectives™ on Stack Overflow

UTF-8 string has too many bytes using SBCL and babel on Windows 64 bits

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related