Invalid UTF8 characters while migrating from Oracle to PostgreSQL

Question

I am rephrasing my question here. I am using AWS DMS tool for migrating from Oracle to PostgreSQL. The source(oracle) character set is AL32UTF8 and target(Pg) character set is set to UT8.

So at the source I have a column with datatype varchar2(4000), where I have something stored like this:

This will be my first time visiting Seattle. 😊

When I am trying to migrate this, I get the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

There is a way in DMS to skip this, but the problem is I have to run the DMS everytime and wait for it to give the invalid byte sequence error and then get past it. Till now, I have got these many:

0xed 0xa4 0x88
0xed 0xbd 0x95
0xed 0xa9 0x8e
0xed 0xbc 0xb8
0xed 0xaa 0xbe
0xed 0xba 0xb5
0xed 0xaf 0x83
0xed 0xb5 0xaa
0xed 0xa0 0xbc
0xed 0xbc 0x9f
0xed 0xa0 0xbd
0xed 0xb8 0xa0
0xed 0xbe 0x88
0xed 0xb1 0x8e
0xed 0xb1 0x8e
0xed 0xb1 0x8d
0xed 0xb3 0x99
0xed 0xb1 0x9f
0xed 0xbe 0xa7
0xed 0xb1 0x8c
0xed 0xa0 0xbe
0xed 0xb4 0x96
0xed 0xba 0x80
0xed 0xb4 0xb1
0xed 0xb0 0xa7
0xed 0xbe 0xb8
0xed 0xbe 0xb5
0xed 0xb7 0xbb
0xed 0xb1 0x86
0xed 0xbe 0xb6
0xed 0xbf 0x8a
0xed 0xb0 0xab
0xed 0xb0 0x95
0xed 0xb0 0x94
0xed 0xb0 0x99
0xed 0xb0 0xb1
0xed 0xbf 0x84
0xed 0xba 0x82
0xed 0xb4 0xa8
0xed 0xb0 0xaf
0xed 0xb0 0xb8
0xed 0xb3 0x9e
0xed 0xb4 0xa7
0xed 0xbe 0x81
0xed 0xb1 0x87

From one of the forum posts here, I got the following query:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

Is it possible to modify the above query to come up with the regular expression to check for all those illegal UTF8 encodings.

Additionally, I was able to do the migration successfully after changing the client_encoding to LATIN1, but I was getting this on the PG end:

This will be my first time visiting Seattle. э НэИ

Please review and comment

sorry I didnt understand your question, you need something like this ? SELECT * FROM (select asciistr(convert(table_name, 'UTF8')) AS str FROM table_ex) — Moudiz
– Moudiz, Commented Jun 7, 2017 at 6:14
What do you mean by "non UTF8 complaint"? If your database character set is AL32UTF8 then all characters are UTF-8, otherwise Oracle automatically replaces them by ¿ — Wernfried Domscheit
– Wernfried Domscheit, Commented Jun 7, 2017 at 7:35
I did not ask you to change character set of database. I asked "What do you mean by 'non UTF8 complaint'? You cannot store any non UTF8 character if your database is UTF8 (or AL32UTF8). — Wernfried Domscheit
– Wernfried Domscheit, Commented Jun 21, 2017 at 6:43
Sorry, what is your actual problem? It is not clear in your question - that's the reason why you did not get any answer yet. Describe what your actual problem is, not the assumed (and obviously non working) solution. — Wernfried Domscheit
– Wernfried Domscheit, Commented Jun 22, 2017 at 5:42

Wernfried Domscheit · Accepted Answer · 2017-07-11 14:53:09Z

2

Oracle (or any other system which supports UTF-8 properly) cannot store invalid UTF-8 character, there must be a problem while migration. Check carefully each setting regarding character sets, resp. encoding - include your terminal settings and/or editors.

Characer 😊 U+1F60A SMILING FACE WITH SMILING EYES belongs to block Emoticons which is in Supplementary Multilingual Plane. Perhaps your migration tool has a general problem with characters outside the Basic Multilingual Plane, i.e. characters above U+FFFF.

One way to find them would be

SELECT *
FROM ...
WHERE REGEXP_LIKE(<your_column>, UNISTR('[\0001-\FFFF]'));

This conditions returns only characters from Basic Multilingual Plane.

You can also try like this:

SELECT 
    REGEXP_SUBSTR('This will be my first time visiting Seattle. 😊', UNISTR('[\FFFF-\DBFF\DFFF]'))
FROM dual;

REGEXP_SUBSTR('THISWILLBEMYFIRSTTIMEVISITINGSEATTLE.',UNISTR('[\FFFF-\DBFF\DFFF]
--------------------------------------------------------------------------------
😊                                                                                    
1 row selected.

Update

I checked again.

😊 U+1F60A SMILING FACE WITH SMILING EYES
Can be written as UNISTR('\D83D\DE0A')
Encoded as UTF-8 (Oracle Character Set AL32UTF8): F0 9F 98 8A
Encoded as CESU-8 (Oracle Character Set UTF8): ED A0 BD ED B8 8A

Your error message says: "invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd"

ED A0 BD is a CESU-8 sequence. Apparently your export from Oracle is provided as CESU-8 but not as UTF-8. Check again your settings.

Update 2

In order to replace supplementary characters from existing data you can try this one:

UPDATE FDRGIIT.CS_ACTIONS
SET CS_COMMENTS = REGEXP_REPLACE(CS_COMMENTS, UNISTR('[\FFFF-\DBFF\DFFF]'), UNISTR('\00BF'));

or

UPDATE FDRGIIT.CS_ACTIONS
SET CS_COMMENTS = REGEXP_REPLACE(CS_COMMENTS, UNISTR('[\FFFF-\DBFF\DFFF]'));

UNISTR('\00BF') is the placeholder (¿) used by Oracle for invalid characters. UNISTR('\FFFD') -> (�) could also be suitable.

edited Jul 11, 2017 at 14:53

answered Jul 10, 2017 at 8:14

Wernfried Domscheit

60.4k10 gold badges92 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user2068804 Over a year ago

I did try execute the above query SELECT * FROM CS_ACTIONS WHERE REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0000-\FFFF]')); but got the following output

ORA-12726: unmatched bracket in regular expression 12726. 00000 -  "unmatched bracket in regular expression" *Cause:    The regular expression did not have balanced brackets. *Action:   Ensure the brackets are correctly balanced.

Kindly assist

Wernfried Domscheit Over a year ago

Try UNISTR('[\0001-\FFFF]'). \0000 seems to have a special meaning.

user2068804 Over a year ago

Tried this SELECT * FROM CS_ACTIONS WHERE REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0001-\FFFF]'));, but I am getting the whole table contents as output. Please let me if I am doing wrong

Wernfried Domscheit Over a year ago

Try WHERE NOT REGEXP_LIKE(... to get the opposite.

user2068804 Over a year ago

tried this SELECT * FROM CS_ACTIONS WHERE NOT REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0001-\FFFF]')); Got 0 rows.

|

Collectives™ on Stack Overflow

Invalid UTF8 characters while migrating from Oracle to PostgreSQL

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related