How to convert utf8 byte array to a string of given length

Question

Let's say I have an array of bytes:

var myArr = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87 };

So it has 6 elements while it corresponds to utf8 abąć which has 4 letters. Typically you do

Encoding.UTF8.GetString(myArr);

to convert it to a string. But lets assume that myArr is actually bigger (there are more bytes at the end) but I do know (a priori to conversion) that I only want the first 4 letters. How can efficiently convert this array to the string? Also it would be preferable to have the index of the last byte in myArr array (corresponding to the end of the converted string).

Example:

// 3 more bytes at the end of formerly defined myArr
var myArr = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87, 0x01, 0x02, 0x03 };
var str = MyConvert(myArr, 4); // read 4 utf8 letters
// str is "abąć"
// possibly I want to know that MyConvert stoped at the index 6 in myArr

The resulting string str object should have str.Length == 4.

"How can efficiently convert this array to the string?" - by calling Encoding.UTF8.GetString(myArr), regarding code length it doesn't get any more efficient than that. What's your question? What do you mean by the last sentence? — CodeCaster
– CodeCaster, Commented Nov 17, 2017 at 14:07
I hope my edit clarified that then. So that's a hard problem, because you only know in how many characters a byte array will result while you're decoding it. You can encounter a multibyte character, a surrogate pair, and so on. (How) do you want to handle zero-length characters? — CodeCaster
– CodeCaster, Commented Nov 17, 2017 at 14:13
So you don't know how many bytes to decode, just the length of the resulting string? Then I think you have to decode the byte-array yourself... — Michael
– Michael, Commented Nov 17, 2017 at 14:13
How about take the first 16 bytes, convert that and then take the first 4 chars form that? — DavidG
– DavidG, Commented Nov 17, 2017 at 14:13
To check: do you want up to 4 char values (UTF-16 code units) or up to 4 Unicode code points? Suppose the byte array is entirely made up of surrogate pairs - do you want 8 chars or 4 in that case? — Jon Skeet
– Jon Skeet, Commented Nov 17, 2017 at 14:28

Jon Skeet · Accepted Answer · 2017-11-17 14:29:59Z

3

It looks like Decoder has your back here, in particular with the somewhat huge Convert method. I think you'd want:

var decoder = Encoding.UTF8.GetDecoder();
var chars = new char[4];
decoder.Convert(bytes, 0, bytes.Length, chars, 0, chars.Length,
    true, out int bytesUsed, out int charsUsed, out bool completed);

Complete sample using the data in your question:

using System;
using System.Text;

public class Test
{
    static void Main()
    {
        var bytes = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87, 0x01, 0x02, 0x03 };
        var decoder = Encoding.UTF8.GetDecoder();
        var chars = new char[4];
        decoder.Convert(bytes, 0, bytes.Length, chars, 0, chars.Length,
            true, out int bytesUsed, out int charsUsed, out bool completed);
        Console.WriteLine($"Completed: {completed}");
        Console.WriteLine($"Bytes used: {bytesUsed}");
        Console.WriteLine($"Chars used: {charsUsed}");
        Console.WriteLine($"Text: {new string(chars, 0, charsUsed)}");
    }
}

edited Nov 17, 2017 at 14:29

answered Nov 17, 2017 at 14:19

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CodeCaster Over a year ago

Can a char[4] contain all possibilities for four code points representable by UTF-8? I mean, code points over 0xFFFF will use two chars, don't know what the OP wants exactly, nor whether they want to support those.

Jon Skeet Over a year ago

@CodeCaster:Ah, I'd assumed the OP wanted four UTF-16 code units. Good point - will ask for clarification.

freakish Over a year ago

That works like a charm! And I additionally learned that char is actually 2 bytes in C#. Great, thanks.

Collectives™ on Stack Overflow

How to convert utf8 byte array to a string of given length

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related