8

I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File

Main errors (or what generated an error) are:

  • Quotes & commas (many errors when trying to parse the file with R)
  • Empty rows
  • Unexpected line break inside a field

I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)

So I decided to use a CSV parser under Node.js with the following code:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');

let input       = 'data_stack.csv';
let readStream  = Fs.createReadStream(input);
let option      = {delimiter: ',', quote: '"', escape: '"', relax: true};

let parser = Csv.parse(option).on('data', (data) => {
    console.log(data)
});

readStream.pipe(parser)

But:

  • Some rows are parsed correctly (array of strings)
  • Some are not parsed (all fields are one string)
  • Some rows are still empty (can be solve by adding skip_empty_lines: true to the options)
  • I don't know how to handle the unexpected line break.

I don't know how to make this CSV clean, neither with R nor with Node.js.

Any help?

EDIT:

Following @Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.

with console.log(); I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).

Here is the code I'm using:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);
let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);

EDIT 2:

Here is the final code that works:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty
        .replace(/\r\n"/g, '\r\n')
        .replace(/"\r\n/g, '\r\n')
        .replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);

let cleaner = Csv.transform(data => {
    let clean = data.map(l => {
        if (l.length > 100 || l[0] === '+') {
            return l = "Encoding issue";
        }
        return l;
    });
    return clean;
});

let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);

Thanks to everyone!

3
  • Wow, that's one messed up CSV! You will need to fix it in multiple stages. First would be to fix the newlines that seem to be embedded in some rows. Next I would sort out the random quotes. If you don't expect to have commas in your data, remove the quotes. Commented Jan 23, 2016 at 13:21
  • could you upload the csv file somewhere else? gist maybe Commented Jan 23, 2016 at 13:32
  • Here is another link: Test CSV FIle Commented Jan 23, 2016 at 13:36

3 Answers 3

1

I don't know how to make this CSV clean, neither with R nor with Node.js.

Actually, it is not as bad as it looks.

This file can easily be converted to a valid csv using the following steps:

  • replace all "" with ".
  • replace all \n" with \n.
  • replace all "\n with \n.

With \n meaning a newline, not the characters "\n" which also appear in your file.

Note that in your example file \n is actually \r\n (0x0d, 0x0a), so depending on the software you use you may need to replace \n in \r\n in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.

This should produce a valid csv file:

enter image description here enter image description here

There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.


It looks like the original data has had an extra pass for escaping quote characters:

  • If the original fields contained a , they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.

  • But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.

  • Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.

Sign up to request clarification or add additional context in comments.

7 Comments

Just tried with Fs.readFile('data_stack.csv', (err, data) => { data.toString().replace(/""/g, '"').replace(/[\r\n]"/g, '\n').replace(/"[\r\n]/g, '\n'); Fs.writeFile('data_output.csv', data); }) and it doesn't work.
You need something like: .replace(/\r\n"/g, '\r\n') or .replace(/\n"/g, '\n'). Idem for the last replace.
@Synleb - Well, it's normal that there are still newlines and empty lines (should not be empty csv-rows), because there is a multiline field in your data (column 8 / R2) - which is valid csv if the multiline field is quoted, which should be the case after the first cleanup. If you don't want that, you could remove the newlines in that field only, after parsing the file.
@Synleb - But you'll have to make sure your csv parser supports multiline fields of course (I didn't see an option for that on the site you link to). But since you're saying that it is parsed correctly, I guess that's the case.
You're right, the field is still on multiline but the Csv is now valid when parsed. I just removed the malformed field and it's clean. Thanks a lot!
|
1

The data is not too messed up to work with. There is a clear pattern.

General steps:

  1. Temporarily remove mixed format inner fields (beginning with double(or more) quotes and having all kinds of characters.
  2. Remove quotes from start and end of quoted lines giving clean CSV
  3. Split data into columns
  4. Replace removed fields

Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.

good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data

Here is a working example based on the file provided:

fs.readFile('./data_stack.csv', (e, data) => {

    // Take out fields that are delimited with double+ quotes
    var dirty = data.toString();
    var matches = dirty.match(/""[\s\S]*?""/g);
    matches.forEach((m,i) => {
        dirty = dirty.replace(m, "<REPL-" + i + ">");
    });

    var cleanData =   dirty
        .split('\n') // get lines

        // ignore first line with column names
        .filter((l, i) => i > 0)

        // remove first and last quotation mark if exists
        .map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines

        // split into columns
        .map(l => l.split(','))

        // return replaced fields back to data (columsn 7,8 and 9)
        .map(col => {

            if (col.length > 9) {
                col[7] = returnField(col[7]);
                col[8] = returnField(col[8]);
                col[9] = returnField(col[9]);
            }
            return col;

            function returnField(f) {
                if (f) {
                    var repls = f.match(/<.*?>/g)
                    if (repls)
                        repls.forEach(m => {
                            var num = +m.split('-')[1].split('>')[0];
                            f = f.replace(m, matches[num]);
                        });
                }
                return f;
            }
        })

    return cleanData
});

Result:

Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):

  ...,
  [ '19403',
    '560e348d2adaffa66f72bfc9',
    'done',
    '276',
    '2015-10-02T07:38:53.172Z',
    '20151002',
    '560e31f69cd6d5059668ee16',
    '""560e336ef3214201030bf7b5""',
    'a+�a��a+�a+�a��a+�a��a+�a��',
    '',
    '560e2e362adaffa66f72bd99',
    '55f8f041b971644d7d861502',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],
  [ '20388',
    '560ce1a467cf15ab2cf03482',
    'update',
    '231',
    '2015-10-01T07:32:52.077Z',
    '20151001',
    '560ce1387494620118c1617a',
    '""""""Final test, with a comma""""""',
    '',
    '',
    '55e6dff9b45b14570417a908',
    '55e6e00fb45b14570417a92f',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],

Comments

0

Following on from my comment:

The data is too messed up to fix in one step, don't try.

Firstly decide whether double-quotes and/or comma's might be part of the data. If they are not, remove the double-quotes with a simple regex.

Next, there should be 14 commas on each line. Read the file as text and count the number of commas on each line in turn. Where there are less than 14, check the following line and if the sum of the commas is 14, merge the 2 lines. If the sum is less than 14, check the next line and continue until you have 14 commas. If the next line takes you over 14 there is a serious error so make a note of the line numbers - you will probably have to fix by hand. Save the resulting file.

With luck, you will now have a file that can be processed as a CSV. If not, come back with the partially tidied file and we can try to help further.

It should go without saying that you should process a copy of the original, you are unlikely to get it right first time :)

3 Comments

Thanks Julian. Just one thing though regarding your first point (and the second by the way). How can I count the commas by not taking into account enclosed commas that can be enclosed inside a quoted string. And by applying Regex to remove the double quotes, I leave enclosed commas in the wild.
That's why I asked whether the data could contain comma's. If it can, I'm not sure you can fix the data without manually checking it and possibly not even then. Not all CSV data will contain embedded commas which is why the quotes around the data are actually optional. Though in your case you have many mismatched quotes which is a concern as it either indicates corrupted data or that the data itself is actually binary some of which is showing up as quotes.
I should have also said that, without knowing the origin of the data, it is almost impossible to provide a definitive answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.