I cannot get non-ASCII characters to be properly displayed in DuckDB console, even if the console application supports UTF-8. I have a sample CSV file encoded in UTF-8 containing a few test strings:
Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते
After starting duckdb.exe from Windows console (chcp reports code page 852) I use the command
SELECT * FROM read_csv('hello_in_languages.csv');
and the response (in duckbox output mode) has flawed non-latin characters as expected:
┌───────────────┬────────────┐
│ Language Code │ Greeting │
│ varchar │ varchar │
├───────────────┼────────────┤
│ pl │ Cześć │
│ de │ Grüß dich │
│ el │ ???? ??? │
│ ru │ ?????? │
│ ar │ ????? │
│ he │ ???? │
│ ja │ ????? │
│ ko │ ????? │
│ hi │ ?????? │
├───────────────┴────────────┤
│ 9 rows 2 columns │
└────────────────────────────┘
Then I switch shell's code page to UTF-8 using Windows command chcp 65001:
.shell chcp 65001
and I see different issue:
����������������������������Ŀ
� Language Code � Greeting �
� varchar � varchar �
����������������������������Ĵ
� pl � Cze�� �
� de � Gr�� dich �
� el � ???? ??? �
� ru � ?????? �
� ar � ????? �
� he � ???? �
� ja � ????? �
� ko � ????? �
� hi � ?????? �
����������������������������Ĵ
� 9 rows 2 columns �
������������������������������
- this happens regardless of input file format UTF-8 BOM / UTF-8 no BOM.
- this happens regardless of the Windows console (old cmd console / new Terminal app).
- command
COPY ( SELECT * FROM read_csv('hello_in_languages.csv') ) TO 'greetings-out.csv'produces a file identical to the input file ⇒ all characters are fully preserved during the processing, and what we see is only a display issue - command
.shell type hello_in_languages.csvshows that the console normally supports UTF-8:
Language Code,Greeting
pl,Cześć
de,Grüß dich
el,Γειά σου
ru,Привет
ar,مرحبا
he,שלום
ja,こんにちは
ko,안녕하세요
hi,नमस्ते
Does this mean that DuckDB cannot work properly with UTF-8 data through the console? Or is there a fix?