2

I'm currently developing an application where users can edit a ByteBuffer via a hex editor interface and also edit the corresponding text through a JTextPane. My current issue is because the JTextPane requires a String I need to convert the ByteBuffer to a String before displaying the value. However, during the conversion invalid characters are replaced by the charsets default replacement character. This squashes the invalid value so when I convert it back to a byte buffer the invalid characters value is replace by the byte value of the default replacement character. Is there an easy way to retain the byte value of an invalid character in a string? I've read the following stackoverflow posts but usually folks want to just replace unprintable characters, I need to preserve them.

Java ByteBuffer to String

Java: Converting String to and from ByteBuffer and associated problems

Is there an easy way to do this or do I need to keep track of all the changes that happen in the text editor and apply them to the ByteBuffer?

Here is code demonstrating the problem. The code uses byte[] instead of ByteBuffer but the issue is the same.

        byte[] temp = new byte[16];
        // 0x99 isn't a valid UTF-8 Character
        Arrays.fill(temp,(byte)0x99);

        System.out.println(Arrays.toString(temp));
        // Prints [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // -103 == 0x99

        System.out.println(new String(temp));
        // Prints ����������������
        // � is the default char replacement string

        // This takes the byte[], converts it to a string, converts it back to a byte[]
        System.out.println(Arrays.toString(new String(temp).getBytes()));
        // I need this to print [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // However, it prints
        //[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
        // The printed byte is the byte representation of �
2
  • I think this needs code. Sounds like a bug. Could also be a conceptual error: what exact text sequence(s) you are having trouble converting to bytes? Commented Oct 2, 2016 at 20:10
  • I've updated the question to include code showing the issue. This isn't a bug in my code, it's a supposed to work this way by default. Commented Oct 2, 2016 at 20:28

2 Answers 2

1

Especially UTF-8 will go wrong

    byte[] bytes = {'a', (byte) 0xfd, 'b', (byte) 0xe5, 'c'};
    String s = new String(bytes, StandardCharsets.UTF_8);
    System.out.println("s: " + s);

One need a CharsetDecoder. There one can ignore (=delete) or replace the offending bytes, or by default: let an exception be thrown.

For the JTextPane we use HTML, so we can write the hex code of the offending byte in a <span> giving it a red background.

    ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
    CharBuffer charBuffer = CharBuffer.allocate(bytes.length * 50);
    charBuffer.append("<html>");
    for (;;) {
        try {
            CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
            if (!result.isError()) {
                break;
            }
        } catch (RuntimeException ex) {
        }
        int b = 0xFF & byteBuffer.get();
        charBuffer.append(String.format(
            "<span style='background-color:red; font-weight:bold'> %02X </span>",
            b));
        decoder.reset();
    }
    charBuffer.rewind();
    String t = charBuffer.toString();
    System.out.println("t: " + t);

The code does not reflect a very nice API, but play with it.

Sign up to request clarification or add additional context in comments.

4 Comments

That's a really good idea that I hadn't even considered. The only problem I see with this is there's going to be a ton of additional markup residing in the text of the JTextPane when I convert it back from the String to a byte[]. Do you have any ideas on how to get around that?
A replaceAll("<[^>]*>", "") or better a loop with a Pattern Matcher.
A JTextPane would also allow to use styled text (StyledDocument) and use attributes separate of the text, but that is cumbersome, especially if you want to allow editing. But you may use byteBuffer.position() to mark those bytes.
I think this approach might be the best to satisfy my needs for this specific project. I was hoping there was some easier I could do but this will probably have to do. Thanks!
0

What do you think that new String(temp).getBytes() will do for you?

I can tell you that it does something BAD.

  1. It converts temp to a String using the default encoding, which is probably wrong, and may lose information.
  2. It converts the results back to a byte array, using the default encoding.

To turn a byte[] into a String, you must always pass a Charset into the String constructor, or else use a decoder directly. Since you are working from buffers, you might find the decoder API congenial.

To turn a String into a byte[], you must always call getBytes(Charset) so that you know that you're using the correct charset.

Based on comments, I am now suspecting that your problem here is that you need to be writing code something like the following to convert from bytes to hex for your UI. (and then something corresponding to get back.)

String getHexString(byte[] bytes) {
    StringBuilder builder = new StringBuilder();
    for (byte b : bytes) {
       int nibble = b >> 4;
       builder.append('0' + nibble);
       nibble = b & 0xff;
       builder.append('0' + nibble);
    }
    return builder.toString();
}

6 Comments

I understand that best practice dedicts that both getBytes and the String constructor should take a Charset. The issue still exists if I pass a Charset into the String constructor. new String (temp, "UTF-8") throws an UnsupportedEncodingException exception because the byte[] contains unmappable characters by design. I feel that the answer is going to need to use the CharsetDecoder API, but I haven't seen any examples using it for something similar.
If it contains non-UTF-8, you may not convert it to a string if you want to keep all the information. You need to convert each byte to two hex digits; there's no way to do that with the APIs you are using.
@JustinA.Moore So, now that we've found the conceptual error/bug, what exactly do you want to do with unmappable characters. They are, by definition, unmappable, so you have to have some plan for them that's outside of Charset's perview.
They can be printed inside the JTextArea as anything (An empty space, the � character from above, whatever really. They don't have a character associated with them), I just need the underlying byte to stay the same when the String is converted back to a byte[] or ByteBuffer.
You can't have that unless you write a custom Charset. There are no charsets that provide round-trip of all possible byte values. something will be mapped to a substitution character, and thus get lost, always.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.