0

I cannot get non-ASCII characters to be properly displayed in DuckDB console, even if the console application supports UTF-8. I have a sample CSV file encoded in UTF-8 containing a few test strings:

Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते

After starting duckdb.exe from Windows console (chcp reports code page 852) I use the command

SELECT * FROM read_csv('hello_in_languages.csv');

and the response (in duckbox output mode) has flawed non-latin characters as expected:

┌───────────────┬────────────┐
│ Language Code │  Greeting  │
│    varchar    │  varchar   │
├───────────────┼────────────┤
│ pl            │ Cześć      │
│ de            │ Grüß dich  │
│ el            │ ???? ???   │
│ ru            │ ??????     │
│ ar            │ ?????      │
│ he            │ ????       │
│ ja            │ ????? │
│ ko            │ ????? │
│ hi            │ ??????       │
├───────────────┴────────────┤
│ 9 rows           2 columns │
└────────────────────────────┘

Then I switch shell's code page to UTF-8 using Windows command chcp 65001:

.shell chcp 65001

and I see different issue:

����������������������������Ŀ
� Language Code �  Greeting  �
�    varchar    �  varchar   �
����������������������������Ĵ
� pl            � Cze��      �
� de            � Gr�� dich  �
� el            � ???? ???   �
� ru            � ??????     �
� ar            � ?????      �
� he            � ????       �
� ja            � ????? �
� ko            � ????? �
� hi            � ??????       �
����������������������������Ĵ
� 9 rows           2 columns �
������������������������������
  • this happens regardless of input file format UTF-8 BOM / UTF-8 no BOM.
  • this happens regardless of the Windows console (old cmd console / new Terminal app).
  • command COPY ( SELECT * FROM read_csv('hello_in_languages.csv') ) TO 'greetings-out.csv' produces a file identical to the input file ⇒ all characters are fully preserved during the processing, and what we see is only a display issue
  • command .shell type hello_in_languages.csv shows that the console normally supports UTF-8:
Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते

Does this mean that DuckDB cannot work properly with UTF-8 data through the console? Or is there a fix?

1
  • I marked the question for migration to SuperUser SE. If someone else also thinks it should be migrated, let's use Close feature to mark it for automatic migration. Commented Nov 18 at 2:00

1 Answer 1

1

.binary on

After further examining of possibly related options, I found the above dot command.

Then DuckDB passes UTF-8 as expected:

.binary on
select * from read_csv('hello_in_languages.csv');
┌───────────────┬────────────┐
│ Language Code │  Greeting  │
│    varchar    │  varchar   │
├───────────────┼────────────┤
│ pl            │ Cześć      │
│ de            │ Grüß dich  │
│ el            │ Γειά σου   │
│ ru            │ Привет     │
│ ar            │ مرحبا      │
│ he            │ שלום       │
│ ja            │ こんにちは │
│ ko            │ 안녕하세요 │
│ hi            │ नमस्ते       │
├───────────────┴────────────┤
│ 9 rows           2 columns │
└────────────────────────────┘
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.