How to parse huge csv file efficiently in java

Question

My application is currently using CSV Parser to parse csv files and persist to database. It loads the entire csv into memory and taking a lot of time to persist , sometimes even times out. I have seen on the site
seeing mixed recommendations to use Univocity parser. Please advice the best approach to process large amounts of data which takes less time.
Thank you.

Code:

 int numRecords = csvParser.parse( fileBytes );

  public int parse(InputStream ins) throws ParserException {
    long parseTime=  System.currentTimeMillis();
    fireParsingBegin();
    ParserEngine engine = null;
    try {
        engine = (ParserEngine) getEngineClass().newInstance();
    } catch (Exception e) {
        throw new ParserException(e.getMessage());
    }
    engine.setInputStream(ins);
    engine.start();
    int count = parse(engine);
    fireParsingDone();
    long seconds = (System.currentTimeMillis() - parseTime) / 1000;
    System.out.println("Time taken is "+seconds);
    return count;
}


protected int parse(ParserEngine engine) throws ParserException {
    int count = 0;
    while (engine.next()) //valuesString Arr in Engine populated with cell data
    {
        if (stopParsing) {
            break;
        }

        Object o = parseObject(engine); //create individual Tos
        if (o != null) {
            count++; //count is increased after every To is formed
            fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
        }
        else {
            return count;
        }
    }
    return count;

There are different ways to read a file which performance is commented in this other SO question. — Serg M Ten
– Serg M Ten, Commented Oct 29, 2018 at 15:13
Depends on the application.. I would think that in most situations the bottle neck would be pushing the data to persistence rather than reading from a csv file. Given that the file is huge, you may want to only partially load the csv data into memory to ensure that you are not memory bound. — flakes
– flakes, Commented Oct 29, 2018 at 15:14
“It loads the entire csv into memory” ← That is the cause of your problem. Don’t do that. Parse each line after reading it. The whole point of InputStreams and Readers is having manageable amounts of data in memory. — VGR
– VGR, Commented Oct 29, 2018 at 15:51
Thank you for the response.I have updated the question with mycode. We are converting into filebytes and calling the parse(byte bytes[]). Do I need to change my implementation here? Any sample code that you can refer to? — StarFish
– StarFish, Commented Oct 29, 2018 at 16:12
Is there a way to send file bytes in chunks in java for parsing? — StarFish
– StarFish, Commented Oct 29, 2018 at 16:29

Jeronimo Backes · Accepted Answer · 2018-10-30 01:58:26Z

2

univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:

1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory using flags -Xms8G and Xmx8G for example.

2 - you are probably not batching your insert statements.

My suggestion is to try this (using univocity-parsers):

    //configure input format using
    CsvParserSettings settings = new CsvParserSettings();

    //get an interator
    CsvParser parser = new CsvParser(settings);
    Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

    //connect to the database and create an insert statement
    Connection connection = getYourDatabaseConnectionSomehow();
    final int COLUMN_COUNT = 2;
    PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

    //run batch inserts of 1000 rows per batch
    int batchSize = 0;
    while (it.hasNext()) {
        //get next row from parser and set values in your statement
        String[] row = it.next(); 
        for(int i = 0; i < COLUMN_COUNT; i++){ 
            if(i < row.length){
                statement.setObject(i + 1, row[i]);
            } else { //row in input is shorter than COLUMN_COUNT
                statement.setObject(i + 1, null);   
            }
        }

        //add the values to the batch
        statement.addBatch();
        batchSize++;

        //once 1000 rows made into the batch, execute it
        if (batchSize == 1000) {
            statement.executeBatch();
            batchSize = 0;
        }
    }
    // the last batch probably won't have 1000 rows.
    if (batchSize > 0) {
        statement.executeBatch();
    }

This should execute pretty quickly and you won't need not even 100mb of memory to run.

For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.

Hope it helps.

edited Oct 30, 2018 at 1:58

answered Oct 30, 2018 at 1:53

Jeronimo Backes

6,2892 gold badges27 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

StarFish Over a year ago

Thank you Jeronimo. The application is already using -Xms8G and Xmx8G. I will try using the batch implementation you suggested. Thank you much again for the inputs.

StarFish Over a year ago

Hi Jeronimo, I looked at the code and we are using CSVParser and Parseorbserver and it takes 1 sec for each row in a csv file to parse and validate. But for a file which has 120k records it takes about 1 hr plus to finish uploading to database because it is always processed in serial way. Can you suggest ways to implement this in parallel.

StarFish Over a year ago

Also to add my application is using -Xms8G and -Xms24G

Jeronimo Backes Over a year ago

Univocity parsers doesn't have a parserobserver class. Are you using the right lib?

Jeronimo Backes Over a year ago

You should process 120k records in less than 2 seconds (csv) and need another 10 seconds max to insert it all in the database.

|

Jerin Joseph · Accepted Answer · 2018-10-29 16:14:38Z

1

Use the Commons CSV Library by Apache.

answered Oct 29, 2018 at 16:14

Jerin Joseph

1,0979 silver badges17 bronze badges

Comments

Sergey Nemchinov · Accepted Answer · 2022-03-01 14:51:57Z

0

Streaming with Apache Commons IO

try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
}

answered Mar 1, 2022 at 14:51

Sergey Nemchinov

1,64818 silver badges24 bronze badges

Collectives™ on Stack Overflow

How to parse huge csv file efficiently in java

3 Answers 3

15 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

15 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related