Converting a byte array to a string and converting it back to byte array in same encoding producing different results

Question

I'm generating a random 20 bytes long array and want to convert it to a string, in order to use it as a random token for an API call.
However when I convert it back to a byte array (just for testing), I get a different array compared to the original.

Here is my code:

var rand = new Random();
string generateId(){
    byte[] bytes_buff = new byte[20];
    rand.NextBytes(bytes_buff);
    foreach (byte b in bytes_buff)
        Console.Write("{0, 5}", b);
    Console.WriteLine();
    string converted = System.Text.Encoding.UTF8.GetString(bytes_buff);
    foreach (char character in converted)
        Console.Write("{0, 5}", character);
    Console.WriteLine();
    byte[] recoded = System.Text.Encoding.UTF8.GetBytes(converted);
    foreach (byte b in recoded)
        Console.Write("{0, 5}", b);
    Console.WriteLine();
    return converted;
}

And it produces this output:

  162  108  161    7  212  200  169  171  205   89  240  122  194  173  223  253   57  148  125   76
    ?    l    ?        ?    ?    ?    ?    Y    ?    z    -    ?    ?    9    ?    }    L
  239  191  189  108  239  191  189    7  239  191  189  200  169  239  191  189  239  191  189   89  239  191  189  122  194  173  239  191  189  239  191  189   57  239  191  189  125   76

I've noticed that for larger numbers (bigger than 127) the GetString() converts to the "?" character and GetByte() for "?" converts back to three bytes of 239 191 189.
From this post I've learned that UTF-8 is not one-to-one mapped but then how are we supposed to generate tokens as string and send them across the internet?
isn't UTF-8 the standard encoding on the internet?
Also if we can't convert all 0-255 range for every character in tokens, what is the actual range for those characters (a-z, A-Z, 0-9, etc')?

Any explanation is appreciated. thanks in advance!

A random byte sequence is (in general) not a valid UTF-8 sequence. UTF8.GetString() will somehow fix it. That's why it's different when you convert ib back to a byte sequence. — Codo
– Codo, Commented Nov 19, 2023 at 7:51

Zohar Peled · Accepted Answer · 2023-11-19 08:26:40Z

You can use a number of different ways to generate tokens that don't necessarily include directly converting to UTF-8. A simple GUID.ToString(string) and GUID.TryParseExact(string, string, GUID) will probably give you a good enough token with as little effort as possible.
Of course, this isn't very secured - but neither is rolling your own token system.

The industry standard for public APIs now-a-days seems to be using JWT tokens.
Personally I've only had to deal with enabling JWT in one project (An asp.net core web Api, if that matters), but it was fairly simple and easy to integrate.

If you really want to use a random byte array as your token, your best (and easiest) way to do it would be to use Convert.ToBase64String and Convert.FromBase64String:

Generating the token (assuming rand is an instance of Random):

var bytes_buff = new byte[20];
rand.NextBytes(bytes_buff);
var token = Convert.ToBase64String(bytes_buff);

Parsing the Base64 string back to the same byte array:

var byte_buff = Convert.FromBase64String(token);

Collectives™ on Stack Overflow

Converting a byte array to a string and converting it back to byte array in same encoding producing different results

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related