How should I decode a UTF-8 string

Question

I have a string like:

About \xee\x80\x80John F Kennedy\xee\x80\x81\xe2\x80\x99s Assassination . unsolved mystery \xe2\x80\x93 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...

I understand that \xe2\x80\x93 is a dash character. But how should I decode the above string in C#?

How are you getting the data into your string? All of the C# string input mechanisms (that I can think of) let you specify an encoding then. Is your input data double-encoded? — Rup
– Rup, Commented Mar 18, 2014 at 0:19
@Rup: The data is provided to me as input. So there is no way for me to solve this problem on the input side. — derekhh
– derekhh, Commented Mar 18, 2014 at 0:24
@derekhh we understand that it's provided to you, but from where/what/whom ? — Luc Morin
– Luc Morin, Commented Mar 18, 2014 at 0:27
Where do you see these \x** sequences anyway? In the debugger? — Thomas Levesque
– Thomas Levesque, Commented Mar 18, 2014 at 0:59
Please, do not include information about a language used in a question title unless it wouldn't make sense without it. Tags serve this purpose. — Ondrej Janacek
– Ondrej Janacek, Commented Mar 18, 2014 at 12:06

Guffa · Accepted Answer · 2014-03-18 11:58:08Z

10

If you have a string like that, then you have used the wrong encoding when you decoded it in the first place. There is no "UTF-8 string", the UTF-8 data is whent the text is encoded into binary data (bytes). When it's decoded into a string, then it's not UTF-8 any more.

You should use the UTF-8 encoding when you create the string from binary data, once the string is created using the wrong encoding, you can't reliably fix it.

If there is no other alternative, you could try to fix the string by encoding it again using the same wrong encoding that was used to create it, and then decode it using the corrent encoding. There is however no guarantee that this will work for all strings, some characters will simply be lost during the wrong decoding. Example:

// wrong use of encoding, to try to fix wrong decoding
str = Encoding.UTF8.GetString(Encoding.Default.GetBytes(str));

edited Mar 18, 2014 at 11:58

answered Mar 18, 2014 at 0:26

Guffa

703k111 gold badges760 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

max · Accepted Answer · 2014-03-18 01:05:27Z

Scan the input string char-by-char and convert values starting with \x (string to byte[] and back to string using UTF8 decoder), leaving all other characters unchanged:

static string Decode(string input)
{
    var sb = new StringBuilder();
    int position = 0;
    var bytes = new List<byte>();
    while(position < input.Length)
    {
        char c = input[position++];
        if(c == '\\')
        {
            if(position < input.Length)
            {
                c = input[position++];
                if(c == 'x' && position <= input.Length - 2)
                {
                    var b = Convert.ToByte(input.Substring(position, 2), 16);
                    position += 2;
                    bytes.Add(b);
                }
                else
                {
                    AppendBytes(sb, bytes);
                    sb.Append('\\');
                    sb.Append(c);
                }
                continue;
            }
        }
        AppendBytes(sb, bytes);
        sb.Append(c);
    }
    AppendBytes(sb, bytes);
    return sb.ToString();
}

private static void AppendBytes(StringBuilder sb, List<byte> bytes)
{
    if(bytes.Count != 0)
    {
        var str = System.Text.Encoding.UTF8.GetString(bytes.ToArray());
        sb.Append(str);
        bytes.Clear();
    }
}

Output:

About John F Kennedy’s Assassination . unsolved mystery – 45 years later. Over the last decade, a lot of individuals have speculated on conspiracy theories that ...

YakovL · Accepted Answer · 2018-04-11 21:32:40Z

3

Finally I've used something like this:

public static string UnescapeHex(string data)
{
    return Encoding.UTF8.GetString(Array.ConvertAll(Regex.Unescape(data).ToCharArray(), c => (byte) c));
}

edited Apr 11, 2018 at 21:32

YakovL

8,44813 gold badges74 silver badges117 bronze badges

answered Mar 18, 2014 at 19:04

derekhh

5,59011 gold badges43 silver badges61 bronze badges

Collectives™ on Stack Overflow

How should I decode a UTF-8 string

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related