2

Let's say I have an array of bytes:

var myArr = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87 };

So it has 6 elements while it corresponds to utf8 abąć which has 4 letters. Typically you do

Encoding.UTF8.GetString(myArr);

to convert it to a string. But lets assume that myArr is actually bigger (there are more bytes at the end) but I do know (a priori to conversion) that I only want the first 4 letters. How can efficiently convert this array to the string? Also it would be preferable to have the index of the last byte in myArr array (corresponding to the end of the converted string).

Example:

// 3 more bytes at the end of formerly defined myArr
var myArr = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87, 0x01, 0x02, 0x03 };
var str = MyConvert(myArr, 4); // read 4 utf8 letters
// str is "abąć"
// possibly I want to know that MyConvert stoped at the index 6 in myArr

The resulting string str object should have str.Length == 4.

16
  • "How can efficiently convert this array to the string?" - by calling Encoding.UTF8.GetString(myArr), regarding code length it doesn't get any more efficient than that. What's your question? What do you mean by the last sentence? Commented Nov 17, 2017 at 14:07
  • 1
    I hope my edit clarified that then. So that's a hard problem, because you only know in how many characters a byte array will result while you're decoding it. You can encounter a multibyte character, a surrogate pair, and so on. (How) do you want to handle zero-length characters? Commented Nov 17, 2017 at 14:13
  • 1
    So you don't know how many bytes to decode, just the length of the resulting string? Then I think you have to decode the byte-array yourself... Commented Nov 17, 2017 at 14:13
  • 1
    How about take the first 16 bytes, convert that and then take the first 4 chars form that? Commented Nov 17, 2017 at 14:13
  • 1
    To check: do you want up to 4 char values (UTF-16 code units) or up to 4 Unicode code points? Suppose the byte array is entirely made up of surrogate pairs - do you want 8 chars or 4 in that case? Commented Nov 17, 2017 at 14:28

1 Answer 1

3

It looks like Decoder has your back here, in particular with the somewhat huge Convert method. I think you'd want:

var decoder = Encoding.UTF8.GetDecoder();
var chars = new char[4];
decoder.Convert(bytes, 0, bytes.Length, chars, 0, chars.Length,
    true, out int bytesUsed, out int charsUsed, out bool completed);

Complete sample using the data in your question:

using System;
using System.Text;

public class Test
{
    static void Main()
    {
        var bytes = new byte[] { 0x61, 0x62, 0xc4, 0x85, 0xc4, 0x87, 0x01, 0x02, 0x03 };
        var decoder = Encoding.UTF8.GetDecoder();
        var chars = new char[4];
        decoder.Convert(bytes, 0, bytes.Length, chars, 0, chars.Length,
            true, out int bytesUsed, out int charsUsed, out bool completed);
        Console.WriteLine($"Completed: {completed}");
        Console.WriteLine($"Bytes used: {bytesUsed}");
        Console.WriteLine($"Chars used: {charsUsed}");
        Console.WriteLine($"Text: {new string(chars, 0, charsUsed)}");
    }
}
Sign up to request clarification or add additional context in comments.

3 Comments

Can a char[4] contain all possibilities for four code points representable by UTF-8? I mean, code points over 0xFFFF will use two chars, don't know what the OP wants exactly, nor whether they want to support those.
@CodeCaster:Ah, I'd assumed the OP wanted four UTF-16 code units. Good point - will ask for clarification.
That works like a charm! And I additionally learned that char is actually 2 bytes in C#. Great, thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.