2

I have this string

G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.

That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example

"Auxotrophies, carbon and"

is getting split into two strings.

Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:

G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.
5
  • you should decide to use regex. Commented May 23, 2012 at 22:38
  • 1
    This is why character encodings are used. Do not place it in quotes, change the comma to something like %2C, then decode after. Commented May 23, 2012 at 22:39
  • Can you choose a delimiter that is not a comma? like a "|" Commented May 23, 2012 at 22:52
  • Unfortunately I cannot modify the input file Commented May 23, 2012 at 22:53
  • I think you could tokenize first using " and iterating over all tokens, again tokenize using , Commented May 26, 2012 at 16:46

2 Answers 2

2

http://opencsv.sourceforge.net/

But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis

This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.

Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 An existing CSV parser is the only reasonable answer for this task, unless it is homework. :)
1

I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

1 Comment

what if the text in one of the csv columns contains the literal '{comma}'? replacing the comma doesn't actually solve the problem. It might make the problem less likely, but this is a hack.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.