339

I have a line like this in my CSV:

"Samsung U600 24"","10000003409","1","10000003427"

Quote next to 24 is used to express inches, while the quote just next to that quote closes the field. I'm reading the line with fgetcsv but the parser makes a mistake and reads the value as:

Samsung U600 24",10000003409"

I tried putting a backslash before the inches quote, but then I just get a backslash in the name:

Samsung U600 24\"

Is there a way to properly escape this in the CSV, so that the value would be Samsung U600 24" , or do I have to regex it in the processor?

0

7 Answers 7

583

Use 2 quotes:

"Samsung U600 24"""

RFC-4180 says "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."

Sign up to request clarification or add additional context in comments.

6 Comments

As tommed says you only need to add a single double quote to escape a double quote. You can use a command-line tool called csvfix to detect any lines which don't conform: csvfix check -nl -v [filename]
@SamCritchley I only see a single double quote being used to escape here. By "Use 2 quotes", user4035 means that 1 quote should be replaced with 2 quotes. By escaping double quotes with double quotes, you're effectively creating pairs of double quotes (2 double quotes). The final quote you see on the end is to terminate the field.
if my csv string like: world,",hello
@GarfieldCat Try this: "world,"",hello"
@codemicky that basically because your original file was not a properly formatted CSV. If you need to make that file work using an RFC-4180-compliant parser, you could try setting the quote character to something other than " (a null character works well) so that the parser will give you the unprocessed string "Samsung U600 24"", and then you could do the work to unquote it yourself.
|
16

CSV is, in theory, a simple format: tabular data delimited by commas. Regrettably there is no universal formal specification, so there are many subtly different implementations out there. This requires some care when importing/exporting! I will quote RFC 4180 for common implementations:

2. Definition of the CSV Format

While there are various specifications and implementations for the
CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of
interpretations of CSV files. This section documents the format that seems to be followed by most implementations:

  1. Each record is located on a separate line, delimited by a line break (CRLF). For example:

    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
    
  2. The last record in the file may or may not have an ending line break. For example:

    aaa,bbb,ccc CRLF
    zzz,yyy,xxx
    
  3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example:

    field_name,field_name,field_name CRLF
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
    
  4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:

    aaa,bbb,ccc
    
  5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

    "aaa","bbb","ccc" CRLF
    zzz,yyy,xxx
    
  6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx
    
  7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

    "aaa","b""bb","ccc" 
    

So usually

  • A field may or may not be enclosed in double quotes. (The RFC from 2005 says Excel doesn't use double quotes, but I tested with Excel 2016 and it does.)
  • Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. (In particular, a CSV file may have multiple lines as they appear in a text editor correspond to one row of data.)
  • If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
    • So "" in the raw CSV field represents an empty string, and """" in a raw CSV represents a single quote, ".

Most CSV readers can detect and handle either CRLF (Windows-style) or LF (Unix-style) line endings, and the newline after the last row doesn't matter. The RFC doesn't specify encoding, so you will have to determine that. Microsoft products like Excel mostly export ASCII, Windows-1252, or UTF-8 with a BOM which may end up in the very first field if not recognized and handled.

However you may encounter data that escapes quotes or other characters (delimiter, linebreaks, escape character itself) with an escape character like \. For example, in readr's read_csv(), this is controlled by escape_double and escape_backslash.

Always check imported data for any irregularities, including if any fields have quotes or newlines, or a row doesn't have a consistent number of columns (an error). Also be careful with programs like Excel that might strip off leading zeros from strings that look like numbers. Unlike JSON, there is no concept of data type in CSV, just text.

Aside: the MIME standard for TSV takes a simplistic approach by simply disallowing tabs and newlines in fields. LF, CR, and tab need to be escaped, usually with \n, \r, \t (backslash itself is \\). I actually quite like this approach as it completely gets rid of quoting ambiguities, and rows correspond exactly to newline splits.

Comments

6

Not only double quotes, you will be in need for single quote ('), double quote ("), backslash (\) and NUL (the NULL byte).

Use fputcsv() to write, and fgetcsv() to read, which will take care of all.

8 Comments

@Angelin Nadar, could you please add a source to your claim about the need to double single quote, backslash and NUL? I didn't find it in RFC-4180.
You don't need to actually escape single quotes etc. A proper CSV file doesn't even need to add double quotes around a field which contains only single quotes. If the CSV reader is properly implemented it should read the file correctly even with those symbols.
Why was this answer ever voted up? The comment about escaping characters was never backed up and the original question doesn't ask about PHP. This only seems to be true for the string delimiter (and only for the chosen delimiter) when a program, such as Open Office, allows you to change it.
If you actually read it RFC4180 notes that there are various [other] specifications and implementations for the CSV format and lists at least 4. The OP does not specify which particular format, so I feel it's unfair to downvote this answer based on presumptions about a specific document.
@cz The answer here is plainly wrong. It neither answers the original question, nor does it provide reasonable other context. For example, NONE of the usual "other CSV formats" requires escaping single quotes inside double quotes, but SOME of the common CSV readers won't unescape escaped single quotes. Escaping backslash MAY be necessary with some readers, but again not with all.
|
3

I have written in Java.

public class CSVUtil {
    public static String addQuote(
            String pValue) {
        if (pValue == null) {
            return null;
        } else {
            if (pValue.contains("\"")) {
                pValue = pValue.replace("\"", "\"\"");
            }
            if (pValue.contains(",")
                    || pValue.contains("\n")
                    || pValue.contains("'")
                    || pValue.contains("\\")
                    || pValue.contains("\"")) {
                return "\"" + pValue + "\"";
            }
        }
        return pValue;
    }

    public static void main(String[] args) {
        System.out.println("ab\nc" + "|||" + CSVUtil.addQuote("ab\nc"));
        System.out.println("a,bc" + "|||" + CSVUtil.addQuote("a,bc"));
        System.out.println("a,\"bc" + "|||" + CSVUtil.addQuote("a,\"bc"));
        System.out.println("a,\"\"bc" + "|||" + CSVUtil.addQuote("a,\"\"bc"));
        System.out.println("\"a,\"\"bc\"" + "|||" + CSVUtil.addQuote("\"a,\"\"bc\""));
        System.out.println("\"a,\"\"bc" + "|||" + CSVUtil.addQuote("\"a,\"\"bc"));
        System.out.println("a,\"\"bc\"" + "|||" + CSVUtil.addQuote("a,\"\"bc\""));
    }
}

2 Comments

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
Doesn't all that .Contains() scan the full string every time? Why not just always wrap the text in double quote? Also, the first pValue.contains("\"")) seems useless (the replace will check the double quote exists itself).
3

I know this is an old post but here's the extremely short answer... The line in proper CSV format should be "Samsung U600 24""","10000003409","1","10000003427" Following the 24 you should have 3 double quotes. The first 2 double quotes will make the one double quote in the actual field. The third double quote ends the field started by the doublequote in front of Samsung.

1 Comment

Doesn't the existing answer already say that though?
-3

Since no one has mentioned the way I usually do it, I'll just type this down. When there's a tricky string, I don't even bother escaping it.

What I do is just base64_encode and base64_decode, that is, encode the value to Base64 before writing the CSV line and when I want to read it, decode.

For your example assuming it's PHP:

$csvLine = [base64_encode('Samsung U600 24"'),"10000003409","1","10000003427"];

And when I want to take the value, I do the opposite.

$value = base64_decode($csvLine[0])

I just don't like to go through the pain.

Comments

-4

I know this is an old post, but here's how I solved it (along with converting null values to empty string) in C# using an extension method.

Create a static class with something like the following:

    /// <summary>
    /// Wraps value in quotes if necessary and converts nulls to empty string
    /// </summary>
    /// <param name="value"></param>
    /// <returns>String ready for use in CSV output</returns>
    public static string Q(this string value)
    {
        if (value == null)
        {
            return string.Empty;
        }
        if (value.Contains(",") || (value.Contains("\"") || value.Contains("'") || value.Contains("\\"))
        {
            return "\"" + value + "\"";
        }
        return value;
    }

Then for each string you're writing to CSV, instead of:

stringBuilder.Append( WhateverVariable );

You just do:

stringBuilder.Append( WhateverVariable.Q() );

1 Comment

doesn't this miss the doubling up of embedded quotes?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.