Problems Converting Between ByteBuffer and String in Java

Question

I'm currently developing an application where users can edit a ByteBuffer via a hex editor interface and also edit the corresponding text through a JTextPane. My current issue is because the JTextPane requires a String I need to convert the ByteBuffer to a String before displaying the value. However, during the conversion invalid characters are replaced by the charsets default replacement character. This squashes the invalid value so when I convert it back to a byte buffer the invalid characters value is replace by the byte value of the default replacement character. Is there an easy way to retain the byte value of an invalid character in a string? I've read the following stackoverflow posts but usually folks want to just replace unprintable characters, I need to preserve them.

Java ByteBuffer to String

Java: Converting String to and from ByteBuffer and associated problems

Is there an easy way to do this or do I need to keep track of all the changes that happen in the text editor and apply them to the ByteBuffer?

Here is code demonstrating the problem. The code uses byte[] instead of ByteBuffer but the issue is the same.

        byte[] temp = new byte[16];
        // 0x99 isn't a valid UTF-8 Character
        Arrays.fill(temp,(byte)0x99);

        System.out.println(Arrays.toString(temp));
        // Prints [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // -103 == 0x99

        System.out.println(new String(temp));
        // Prints ����������������
        // � is the default char replacement string

        // This takes the byte[], converts it to a string, converts it back to a byte[]
        System.out.println(Arrays.toString(new String(temp).getBytes()));
        // I need this to print [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // However, it prints
        //[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
        // The printed byte is the byte representation of �

I think this needs code. Sounds like a bug. Could also be a conceptual error: what exact text sequence(s) you are having trouble converting to bytes? — markspace
– markspace, Commented Oct 2, 2016 at 20:10
I've updated the question to include code showing the issue. This isn't a bug in my code, it's a supposed to work this way by default. — Justin Moore
– Justin Moore, Commented Oct 2, 2016 at 20:28

Joop Eggen · Accepted Answer · 2016-10-02 21:03:09Z

1

Especially UTF-8 will go wrong

    byte[] bytes = {'a', (byte) 0xfd, 'b', (byte) 0xe5, 'c'};
    String s = new String(bytes, StandardCharsets.UTF_8);
    System.out.println("s: " + s);

One need a CharsetDecoder. There one can ignore (=delete) or replace the offending bytes, or by default: let an exception be thrown.

For the JTextPane we use HTML, so we can write the hex code of the offending byte in a <span> giving it a red background.

    ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
    CharBuffer charBuffer = CharBuffer.allocate(bytes.length * 50);
    charBuffer.append("<html>");
    for (;;) {
        try {
            CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
            if (!result.isError()) {
                break;
            }
        } catch (RuntimeException ex) {
        }
        int b = 0xFF & byteBuffer.get();
        charBuffer.append(String.format(
            "<span style='background-color:red; font-weight:bold'> %02X </span>",
            b));
        decoder.reset();
    }
    charBuffer.rewind();
    String t = charBuffer.toString();
    System.out.println("t: " + t);

The code does not reflect a very nice API, but play with it.

answered Oct 2, 2016 at 21:03

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Justin Moore Over a year ago

That's a really good idea that I hadn't even considered. The only problem I see with this is there's going to be a ton of additional markup residing in the text of the JTextPane when I convert it back from the String to a byte[]. Do you have any ideas on how to get around that?

Joop Eggen Over a year ago

A replaceAll("<[^>]*>", "") or better a loop with a Pattern Matcher.

Joop Eggen Over a year ago

A JTextPane would also allow to use styled text (StyledDocument) and use attributes separate of the text, but that is cumbersome, especially if you want to allow editing. But you may use byteBuffer.position() to mark those bytes.

Justin Moore Over a year ago

I think this approach might be the best to satisfy my needs for this specific project. I was hoping there was some easier I could do but this will probably have to do. Thanks!

bmargulies · Accepted Answer · 2016-10-02 20:47:24Z

0

What do you think that new String(temp).getBytes() will do for you?

I can tell you that it does something BAD.

It converts temp to a String using the default encoding, which is probably wrong, and may lose information.
It converts the results back to a byte array, using the default encoding.

To turn a byte[] into a String, you must always pass a Charset into the String constructor, or else use a decoder directly. Since you are working from buffers, you might find the decoder API congenial.

To turn a String into a byte[], you must always call getBytes(Charset) so that you know that you're using the correct charset.

Based on comments, I am now suspecting that your problem here is that you need to be writing code something like the following to convert from bytes to hex for your UI. (and then something corresponding to get back.)

String getHexString(byte[] bytes) {
    StringBuilder builder = new StringBuilder();
    for (byte b : bytes) {
       int nibble = b >> 4;
       builder.append('0' + nibble);
       nibble = b & 0xff;
       builder.append('0' + nibble);
    }
    return builder.toString();
}

edited Oct 2, 2016 at 20:47

answered Oct 2, 2016 at 20:34

bmargulies

101k40 gold badges196 silver badges327 bronze badges

6 Comments

Justin Moore Over a year ago

I understand that best practice dedicts that both getBytes and the String constructor should take a Charset. The issue still exists if I pass a Charset into the String constructor. new String (temp, "UTF-8") throws an UnsupportedEncodingException exception because the byte[] contains unmappable characters by design. I feel that the answer is going to need to use the CharsetDecoder API, but I haven't seen any examples using it for something similar.

bmargulies Over a year ago

If it contains non-UTF-8, you may not convert it to a string if you want to keep all the information. You need to convert each byte to two hex digits; there's no way to do that with the APIs you are using.

markspace Over a year ago

@JustinA.Moore So, now that we've found the conceptual error/bug, what exactly do you want to do with unmappable characters. They are, by definition, unmappable, so you have to have some plan for them that's outside of Charset's perview.

Justin Moore Over a year ago

They can be printed inside the JTextArea as anything (An empty space, the � character from above, whatever really. They don't have a character associated with them), I just need the underlying byte to stay the same when the String is converted back to a byte[] or ByteBuffer.

bmargulies Over a year ago

You can't have that unless you write a custom Charset. There are no charsets that provide round-trip of all possible byte values. something will be mapped to a substitution character, and thus get lost, always.

|

Collectives™ on Stack Overflow

Problems Converting Between ByteBuffer and String in Java

2 Answers 2

4 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related