I am rephrasing my question here. I am using AWS DMS tool for migrating from Oracle to PostgreSQL. The source(oracle) character set is AL32UTF8 and target(Pg) character set is set to UT8.
So at the source I have a column with datatype varchar2(4000), where I have something stored like this:
This will be my first time visiting Seattle. 😊
When I am trying to migrate this, I get the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
There is a way in DMS to skip this, but the problem is I have to run the DMS everytime and wait for it to give the invalid byte sequence error and then get past it. Till now, I have got these many:
0xed 0xa4 0x88
0xed 0xbd 0x95
0xed 0xa9 0x8e
0xed 0xbc 0xb8
0xed 0xaa 0xbe
0xed 0xba 0xb5
0xed 0xaf 0x83
0xed 0xb5 0xaa
0xed 0xa0 0xbc
0xed 0xbc 0x9f
0xed 0xa0 0xbd
0xed 0xb8 0xa0
0xed 0xbe 0x88
0xed 0xb1 0x8e
0xed 0xb1 0x8e
0xed 0xb1 0x8d
0xed 0xb3 0x99
0xed 0xb1 0x9f
0xed 0xbe 0xa7
0xed 0xb1 0x8c
0xed 0xa0 0xbe
0xed 0xb4 0x96
0xed 0xba 0x80
0xed 0xb4 0xb1
0xed 0xb0 0xa7
0xed 0xbe 0xb8
0xed 0xbe 0xb5
0xed 0xb7 0xbb
0xed 0xb1 0x86
0xed 0xbe 0xb6
0xed 0xbf 0x8a
0xed 0xb0 0xab
0xed 0xb0 0x95
0xed 0xb0 0x94
0xed 0xb0 0x99
0xed 0xb0 0xb1
0xed 0xbf 0x84
0xed 0xba 0x82
0xed 0xb4 0xa8
0xed 0xb0 0xaf
0xed 0xb0 0xb8
0xed 0xb3 0x9e
0xed 0xb4 0xa7
0xed 0xbe 0x81
0xed 0xb1 0x87
From one of the forum posts here, I got the following query:
select CASE
INSTR (
RAWTOHEX (
utl_raw.cast_to_raw (
utl_i18n.raw_to_char (
utl_raw.cast_to_raw ( <your_column> )
, 'utf8'
)
)
)
, 'EFBFBD'
)
WHEN 0 THEN 'OK'
ELSE 'FAIL'
END
from <your_table>
;
Is it possible to modify the above query to come up with the regular expression to check for all those illegal UTF8 encodings.
Additionally, I was able to do the migration successfully after changing the client_encoding to LATIN1, but I was getting this on the PG end:
This will be my first time visiting Seattle. э НэИ
Please review and comment
AL32UTF8then all characters are UTF-8, otherwise Oracle automatically replaces them by¿UTF8(orAL32UTF8).