0

I am trying to parse the following string

 s1 = """ "foo","bar", "foo,bar" """

And out put of this parsing I am hoping is...

 List ["foo","bar","foo,bar"] length 3

I am able to parse the following

s2 = """ "foo","bar", 'foo,bar' """

By using the following pattern

pattern = "(('[^']*')|([^,]+))"
re.findall(pattern,s2)
gives [('foo', '', 'foo'), ('bar', '', 'bar'), ("'foo,bar'", "'foo,bar'", '')]

But I am not able to figure out the pattern for s2.. Note that I need to parse both s1 and s2 successfully

Edit
   The current pattern support strings like
   "foo,bar,foo bar" => [foo,bar,foo bar]
   "foo,bar,'foo bar'" => ["foo","bar",'foo bar']
    "foo,bar,'foo, bar'" => [foo,bar, 'foo, bar'] #length 3
8
  • @aliteralmind The beginning and end of the string literal Commented Apr 12, 2014 at 23:07
  • I use this: regex101.com/#python Commented Apr 12, 2014 at 23:08
  • 1
    You posted almost the same exact question, although for a different language (huh?) an hour ago. Commented Apr 12, 2014 at 23:17
  • @aliteralmind : Yepp.. I was trying in scala but gave it up and pivoted back to python :-/ Commented Apr 12, 2014 at 23:18
  • 1
    @Fraz: this (a csv-like reader) is an example of something which is easy to describe statefully but annoying to squeeze into a regex. Commented Apr 12, 2014 at 23:42

3 Answers 3

4

I think that shlex (simple lexical analysis) is much simpler solution here (when regex is too complicated). Specifically, I'd use:

>>> import shlex
>>> lex = shlex.shlex(""" "foo","bar", 'foo,bar' """, posix=True)
>>> lex.whitespace = ','        # Only comma will be a splitter
>>> lex.whitespace_split=True   # Split by any delimiter defined in whitespace
>>> list(lex)                   # It is actually an generator
['foo', 'bar', 'foo,bar']

Edit:

I have a feeling that you're trying to read a csv file. Did you try import csv?

Sign up to request clarification or add additional context in comments.

2 Comments

Pretty cool solution. I think you mean that lex is a generator though, and that's why we need to call list(). A list is an iterator.
@Haidro - I always thought that iterator was an object that allows you to iterate, and generator is a function that allows you to iterate (using yield). I changed it anyway.
2

Maybe you could use something like this:

>>> re.findall(r'["|\'](.*?)["|\']', s1)
['foo', 'bar', 'foo,bar']
>>> re.findall(r'["|\'](.*?)["|\']', s2)
['foo', 'bar', 'foo,bar']

This finds all the words inside of "..." or '...' and groups them.

9 Comments

@Hairdo Thanks for the pattern. it works.. but it fails at " foo,bar,'foobar' " Is it possible to support this as well?
So some strings are not quoted? It would be quite different to have to capture unquoted strings.
@Haidro : I updated the use case a bit ... can we support those cases as well?
Well now we need to know the exact format of those unquoted words. Are they truly words (only alpha-numeric)?
@aliteralmind: yepp.. they are alphanumeric.. everything I entered are valid python strings?
|
1

This works:

(?:"([^"]+)"|'([^']+)')

Regular expression visualization

Debuggex Demo

Capture groups 1 or two contain the desired output. So each element could be $1$2, because exactly one will always be empty.


Updated to the new requirements as in the comments to Haidro's answer:

(?:("[^"]+")|('[^']+')|(\w+))

Regular expression visualization

Debuggex Demo

Each element is now $1$2$3.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.