0

I want to read fast line by line big csv files (approx ~ 1gb) in UTF-8. I have created a class for it, but it doesn't work properly. UTF-8 decodes Cyrillic symbol from 2 bytes. I use byte buffer to read it, for example, it has 10 bytes length. So if symbol composed from 10 and 11 bytes in the file it wouldn't be decoded normally :(

Tipical error

public class MyReader extends InputStream {

  private FileChannel channel;
  private ByteBuffer buffer = ByteBuffer.allocate(10);
  private int buffSize = 0;
  private int position = 0;
  private boolean EOF = false;
  private CharBuffer charBuffer;

  private MyReader() {}

  static MyReader getFromFile(final String path) throws IOException {
    MyReader myReader = new MyReader();
    myReader.channel = FileChannel.open(Path.of(path),
        StandardOpenOption.READ);
    myReader.initNewBuffer();
    return myReader;
  }
  private void initNewBuffer() {
    try {
      buffSize = channel.read(buffer);
      buffer.position(0);
      charBuffer = Charset.forName("UTF-8").decode(buffer);
      buffer.position(0);
    } catch (IOException e) {
      throw new RuntimeException("Error reading file: {}", e);
    }
  }
  @Override
  public int read() throws IOException {
    if (EOF) {
      return -1;
    }
    if (position < charBuffer.length()) {
      return charBuffer.array()[position++];
    } else {
      initNewBuffer();
      if (buffSize < 1) {
        EOF = true;
      } else {
        position = 0;
      }
      return read();
    }
  }
  public char[] readLine() throws IOException {
    int readResult = 0;
    int startPos = position;
    while (readResult != -1) {
      readResult = read();
    }
    return Arrays.copyOfRange(charBuffer.array(), startPos, position);
  }
}
5
  • 2
    Why did you create your own class instead of using InputStreamReader? Commented Sep 27, 2019 at 11:03
  • I want my own realization :) Commented Sep 27, 2019 at 11:20
  • Well it seems to be a bit out of your reach for now. You're mixing streams (IO) and channels (NIO), your buffer handling is wrong (using position() instead of flip()) and so on. Maybe read a few tutorials? It's too broad to explain all the things wrong with your code. Commented Sep 27, 2019 at 11:22
  • Thank you. Can you share links to tutorials? Commented Sep 27, 2019 at 11:25
  • docs.oracle.com/javase/tutorial/essential/io/fileio.html Commented Sep 27, 2019 at 11:27

2 Answers 2

1

Bad solution, but it works)

private void initNewBuffer() {
    try {
      buffSize = channel.read(buffer);
      buffer.position(0);
      charBuffer = StandardCharsets.UTF_8.decode(buffer);
      if (buffSize > 0) {
        byte edgeByte = buffer.array()[buffSize - 1];
        if (edgeByte == (byte) 0xd0 ||
            edgeByte == (byte) 0xd1 ||
            edgeByte == (byte) 0xc2 ||
            edgeByte == (byte) 0xd2 ||
            edgeByte == (byte) 0xd3
        ) {
          channel.position(channel.position() - 1);
          charBuffer.limit(charBuffer.limit()-1);
        }
      }
      buffer.position(0);
    } catch (IOException e) {
      throw new RuntimeException("Error reading file: {}", e);
    }
  }
Sign up to request clarification or add additional context in comments.

1 Comment

The decode method of Charset is designed to process complete input. You should use a CharsetDecoder as described in its class documentation. Combine this with a flip/compact loop instead of calling position(0) and the correct handling of dangling multi-byte characters comes for free.
0

First: the gain is questionable.

The Files class has many nice and quite production fast methods.

Bytes with high bit 1 (< 0) are part of a UTF-8 multibyte sequence. With high bits 10 they are continuation bytes. Sequences might be upto 6 bytes nowadays (I believe).

So the next buffer starts with some continuation bytes, they belong to the previous buffer.

The programming logic I gladly leave to you.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.