0

I am rephrasing my question here. I am using AWS DMS tool for migrating from Oracle to PostgreSQL. The source(oracle) character set is AL32UTF8 and target(Pg) character set is set to UT8.

So at the source I have a column with datatype varchar2(4000), where I have something stored like this:

This will be my first time visiting Seattle. 😊

When I am trying to migrate this, I get the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

There is a way in DMS to skip this, but the problem is I have to run the DMS everytime and wait for it to give the invalid byte sequence error and then get past it. Till now, I have got these many:

0xed 0xa4 0x88
0xed 0xbd 0x95
0xed 0xa9 0x8e
0xed 0xbc 0xb8
0xed 0xaa 0xbe
0xed 0xba 0xb5
0xed 0xaf 0x83
0xed 0xb5 0xaa
0xed 0xa0 0xbc
0xed 0xbc 0x9f
0xed 0xa0 0xbd
0xed 0xb8 0xa0
0xed 0xbe 0x88
0xed 0xb1 0x8e
0xed 0xb1 0x8e
0xed 0xb1 0x8d
0xed 0xb3 0x99
0xed 0xb1 0x9f
0xed 0xbe 0xa7
0xed 0xb1 0x8c
0xed 0xa0 0xbe
0xed 0xb4 0x96
0xed 0xba 0x80
0xed 0xb4 0xb1
0xed 0xb0 0xa7
0xed 0xbe 0xb8
0xed 0xbe 0xb5
0xed 0xb7 0xbb
0xed 0xb1 0x86
0xed 0xbe 0xb6
0xed 0xbf 0x8a
0xed 0xb0 0xab
0xed 0xb0 0x95
0xed 0xb0 0x94
0xed 0xb0 0x99
0xed 0xb0 0xb1
0xed 0xbf 0x84
0xed 0xba 0x82
0xed 0xb4 0xa8
0xed 0xb0 0xaf
0xed 0xb0 0xb8
0xed 0xb3 0x9e
0xed 0xb4 0xa7
0xed 0xbe 0x81
0xed 0xb1 0x87

From one of the forum posts here, I got the following query:

 select CASE
            INSTR (
                  RAWTOHEX (
                      utl_raw.cast_to_raw (
                          utl_i18n.raw_to_char (
                                utl_raw.cast_to_raw ( <your_column> )
                              , 'utf8'
                          )
                      )
                  )
                , 'EFBFBD'
            )
        WHEN 0 THEN 'OK'
        ELSE 'FAIL' 
        END
   from <your_table>
      ;

Is it possible to modify the above query to come up with the regular expression to check for all those illegal UTF8 encodings.

Additionally, I was able to do the migration successfully after changing the client_encoding to LATIN1, but I was getting this on the PG end:

This will be my first time visiting Seattle. э НэИ

Please review and comment

11
  • sorry I didnt understand your question, you need something like this ? SELECT * FROM (select asciistr(convert(table_name, 'UTF8')) AS str FROM table_ex) Commented Jun 7, 2017 at 6:14
  • 1
    What do you mean by "non UTF8 complaint"? If your database character set is AL32UTF8 then all characters are UTF-8, otherwise Oracle automatically replaces them by ¿ Commented Jun 7, 2017 at 7:35
  • I did not ask you to change character set of database. I asked "What do you mean by 'non UTF8 complaint'? You cannot store any non UTF8 character if your database is UTF8 (or AL32UTF8). Commented Jun 21, 2017 at 6:43
  • Sorry, what is your actual problem? It is not clear in your question - that's the reason why you did not get any answer yet. Describe what your actual problem is, not the assumed (and obviously non working) solution. Commented Jun 22, 2017 at 5:42
  • Where and how do you get this error? Commented Jul 8, 2017 at 5:15

1 Answer 1

2

Oracle (or any other system which supports UTF-8 properly) cannot store invalid UTF-8 character, there must be a problem while migration. Check carefully each setting regarding character sets, resp. encoding - include your terminal settings and/or editors.

Characer 😊 U+1F60A SMILING FACE WITH SMILING EYES belongs to block Emoticons which is in Supplementary Multilingual Plane. Perhaps your migration tool has a general problem with characters outside the Basic Multilingual Plane, i.e. characters above U+FFFF.

One way to find them would be

SELECT *
FROM ...
WHERE REGEXP_LIKE(<your_column>, UNISTR('[\0001-\FFFF]'));

This conditions returns only characters from Basic Multilingual Plane.

You can also try like this:

SELECT 
    REGEXP_SUBSTR('This will be my first time visiting Seattle. 😊', UNISTR('[\FFFF-\DBFF\DFFF]'))
FROM dual;

REGEXP_SUBSTR('THISWILLBEMYFIRSTTIMEVISITINGSEATTLE.',UNISTR('[\FFFF-\DBFF\DFFF]
--------------------------------------------------------------------------------
😊                                                                                    
1 row selected.

Update

I checked again.

  • 😊 U+1F60A SMILING FACE WITH SMILING EYES
  • Can be written as UNISTR('\D83D\DE0A')
  • Encoded as UTF-8 (Oracle Character Set AL32UTF8): F0 9F 98 8A
  • Encoded as CESU-8 (Oracle Character Set UTF8): ED A0 BD ED B8 8A

Your error message says: "invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd"

ED A0 BD is a CESU-8 sequence. Apparently your export from Oracle is provided as CESU-8 but not as UTF-8. Check again your settings.

Update 2

In order to replace supplementary characters from existing data you can try this one:

UPDATE FDRGIIT.CS_ACTIONS
SET CS_COMMENTS = REGEXP_REPLACE(CS_COMMENTS, UNISTR('[\FFFF-\DBFF\DFFF]'), UNISTR('\00BF'));

or

UPDATE FDRGIIT.CS_ACTIONS
SET CS_COMMENTS = REGEXP_REPLACE(CS_COMMENTS, UNISTR('[\FFFF-\DBFF\DFFF]'));

UNISTR('\00BF') is the placeholder (¿) used by Oracle for invalid characters. UNISTR('\FFFD') -> () could also be suitable.

Sign up to request clarification or add additional context in comments.

8 Comments

I did try execute the above query SELECT * FROM CS_ACTIONS WHERE REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0000-\FFFF]')); but got the following output ORA-12726: unmatched bracket in regular expression 12726. 00000 - "unmatched bracket in regular expression" *Cause: The regular expression did not have balanced brackets. *Action: Ensure the brackets are correctly balanced. Kindly assist
Try UNISTR('[\0001-\FFFF]'). \0000 seems to have a special meaning.
Tried this SELECT * FROM CS_ACTIONS WHERE REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0001-\FFFF]'));, but I am getting the whole table contents as output. Please let me if I am doing wrong
Try WHERE NOT REGEXP_LIKE(... to get the opposite.
tried this SELECT * FROM CS_ACTIONS WHERE NOT REGEXP_LIKE(CS_COMMENTS, UNISTR('[\0001-\FFFF]')); Got 0 rows.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.