1

I have been looking for any examples of custom encoding in .net. Let's say if I want to implement roman8

https://www.compart.com/en/unicode/charsets/hp-roman8

encoding/decoding in .net, how do I do that? In a nut shell, I know that we need to inherit from Encoding system class and implement our own encoder/decoder methods, but without examples it looks complicated. There is one example I can see from JonSkeet, but that's too old to follow in my opinion.

https://stackoverflow.com/a/5536825/7340823.

Any help would be appreciated. Thanks!

1
  • Encoding is abstract... the compiler will tell you exactly what you need to implement. The stuff to implement are generally just methods to convert read bytes to their Char equivalents and vice versa; a simple Char[256] array could be all you need. Commented Jan 15, 2020 at 14:14

1 Answer 1

2

Now that .Net is open source, you can view the source code of the encodings included in the framework.

It looks like the Unicode implementations use interop to call some native code to do the actual work, but there are a few which are fully implemented in C#, such as ISCIIEnocding

Here is the source: https://referencesource.microsoft.com/#mscorlib/system/text/isciiencoding.cs


To create an implementation for a new encoding, you need to subclass System.Text.Encoding and implement the following methods. I'm assuming you're using a simple 1:1 encoding like roman8, if not things will be a bit more complicated!

GetByteCount() and GetCharCount() both return the number of bytes/chars the input will produce. In this case we can just return length of the input array.

GetMaxByteCount() and GetMaxCharCount are similar, but return the theoretical maximum number of items which could be returned for the given input. Once again, we can just return the same length.

To do the actual conversion, these methods will be called. The base Encoding class will take care of creating the arrays for you, you just need to fill in the output with the correct values.

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
    {
        for (var i = 0; i < charCount; i++) 
        {
            bytes[byteIndex + i] = GetByte(chars[charIndex + i]);
        }
        return charCount;
    }

Where GetByte() is a simple method to look up the index of the char in your array.

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
    {
        for (var i = 0; i < byteCount; i++) 
        {
            chars[charIndex + i] = conversionArray[bytes[byteIndex + i]];
        }
        return byteCount;
    }

Populate conversionArray with your characters at the correct index for the encoding.

See https://dotnetfiddle.net/eBvgc6 for a working example.

Sign up to request clarification or add additional context in comments.

3 Comments

I looked at the non-interop encodings, but it seems like they are using pointers instead of storing the bytes/chars in a byte[]? - github.com/microsoft/referencesource/blob/master/mscorlib/…. I don't quite understand how the decoding is happening in 433, for example. Where I do define my custom character table? The arrayCharBestFit char array does not seem to be used anywhere in the above link.
@sankar oh yes, I'm guessing that's because a lot of this code was ported from native code. The code you need to implement is actually a lot simpler than those examples (for roman8 anyway). I've updated my answer to cover what methods need to be overridden.
By the way, Jon Skeet's code in the linked question may be old, but it's still relevant and a very good example of how to do it properly. Hopefully this answer will help explain what's going on, then you can look at the other codebase to see how to create a more complete and robust implementation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.