1

Is there any "already-implemented" (not manual) way to replace all occurrences of single byte-array/string inside byte array ? I have a case where i need to create byte array containing platform dependent text (Linux (line feed), Windows (carriage return + line feed)). I know such task can be implemented manually but i am looking for out-of-the-box solution. Note that these byte array's are large and solution needs to be performance wise in my case. Also note that i am processing large amount of these byte-arrays.

My current approach:

var byteArray = resourceLoader.getResource("classpath:File.txt").getInputStream().readAllBytes();
byteArray = new String(byteArray)
    .replaceAll((schemeModel.getOsType() == SystemTypes.LINUX) ? "\r\n" : "\n",
                (schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n"
    ).getBytes(StandardCharsets.UTF_8);

This approach is not performance wise because of creating new Strings and using regex to find occurrences. I know that manual implementation would require looking at sequence of bytes because of Windows encoding. Manual implementation would therefore also require reallocation (if needed) as well.

Appache common lang utils contains ArrayUtils which contains method
byte[] removeAllOccurrences(byte[] array, byte element). Is there any third party library which contains similar method for replacing ALL byte-arrays/strings occurrences inside byte array ??

Edit: As @saka1029 mentioned in comments, my approach doesn't work for Windows OS type. Because of this bug i need to stick with regexes as following:

(schemeModel.getOsType() == SystemTypes.LINUX) ? "\\r\\n" : "[?:^\\r]\\n", 
(schemeModel.getOsType() == SystemTypes.LINUX) ? "\n" : "\r\n")

This way, for windows case, only occurrences of '\n' without preceding '\r' are searched and replaced with '\r\n' (regex is modified to find group at '\n' not at [^\r]\n position directly otherwise last letter from line would be extracted as well). Such workflow cannot be implemented using conventional methods thus invalidates this question.

12
  • 2
    "byte array containing platform dependent text" - If you are working with text, why not use String? Byte arrays are difficult to work with, as you have discovered. How large are the arrays we're talking about? Commented Aug 9, 2020 at 21:43
  • My use case need's to use byteArray because i get it as input (reading file from input stream) and need to process it further with ZipOutputStream resulting in downloadable Zip in my API. Files have not static size (same size as average Java files) and there are many files like this (from 20 to 100). As mentioned working with String directly results in creation of such Strings (i cannot obtain String directly) and is not acceptable in my case. Commented Aug 9, 2020 at 21:53
  • 1
    If the regex is your concern, just change replaceAll to replace. The replace method does not use regular expressions. Commented Aug 9, 2020 at 22:51
  • @VGR thanks for response. Didn't notice this. Regex is not my only one problem but this will certainly give me performance improvement. Commented Aug 9, 2020 at 22:56
  • @saka1029 i noticed that and it also invalidates answer as well. In windows case i need to search for occurence of '\n' without preceding '\r' and replace it. That cannot be done with conventional method and therefore i will probably stick with regex's anyway. Commented Aug 9, 2020 at 23:33

1 Answer 1

1

If you’re reading text, you should treat it as text, not as bytes. Use a BufferedReader to read the lines one by one, and insert your own newline sequences.

String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";

OutputStream out = /* ... */;

try (Writer writer = new BufferedWriter(
        new OutputStreamWriter(out, StandardCharsets.UTF_8));
    BufferedReader reader = new BufferedReader(
        new InputStreamReader(
            resourceLoader.getResource("classpath:File.txt").getInputStream(),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = reader.readLine()) != null) {
        writer.write(line);
        writer.write(newline);
    }
}

No byte array needed, and you are using only a small amount of memory—the amount needed to hold the largest line encountered. (I rarely see text with a line longer than one kilobyte, but even one megabyte would be a pretty small memory requirement.)

If you are “fixing” zip entries, the OutputStream can be a ZipOutputStream pointing to a new ZipEntry:

String newline = schemeModel.getOsType() == SystemTypes.LINUX ? "\n" : "\r\n";

ZipInputStream oldZip = /* ... */;
ZipOutputStream newZip = /* ... */;

ZipEntry entry;
while ((entry = oldZip.getNextEntry()) != null) {
    newZip.putNextEntry(entry);

    // We only want to fix line endings in text files.
    if (!entry.getName().matches(".*\\." +
        "(?i:txt|x?html?|xml|json|[ch]|cpp|cs|py|java|properties|jsp)")) {

        oldZip.transferTo(newZip);
        continue;
    }

    Writer writer = new BufferedWriter(
        new OutputStreamWriter(newZip, StandardCharsets.UTF_8));

    BufferedReader reader = new BufferedReader(
        new InputStreamReader(oldZip, StandardCharsets.UTF_8));

    String line;
    while ((line = reader.readLine()) != null) {
        writer.write(line);
        writer.write(newline);
    }

    writer.flush();
}
    

Some notes:

  • Are you deliberately ignoring Macs (and other operating systems which are neither Windows nor Linux)? You should assume \n for everything except Windows. That is, schemeModel.getOsType() == SystemTypes.WINDOWS ? "\r\n" : "\n"
  • Your code contains new String(byteArray) which assumes the bytes of your resource use the default Charset of the system on which your program is running. I suspect this is not what you intended; I have added StandardCharsets.UTF_8 to the construction of the InputStreamReader to address this. If you really meant to read the bytes using the default Charset, you can remove that second constructor argument.
Sign up to request clarification or add additional context in comments.

4 Comments

@saka1029 UTF-8 is usually the best default. Far more systems use UTF-8 than use ISO-8859-x now.
This is good solution in my eyes since line delimiters are easily reworked. Currently using InputStreamReader with ByteArrayInputStream over my byte arrays since my project is too large to rework immediately. I am getting expected results.
@saka1029 A one-byte charset can only handle 256 characters. It can’t possibly handle all of the characters UTF-8 can. And, again, it is extremely likely that any non-Windows system will be using UTF-8.
@saka1029 You do know that a Java String is not like a C string, and is never a suitable container for arbitrary bytes, right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.