0

I am trying to parse the following string

 s1 = """ "foo","bar", "foo,bar" """

And out put of this parsing I am hoping is...

 List ["foo","bar","foo,bar"] length 3

I am able to parse the following

s2 = """ "foo","bar", 'foo,bar' """

By using the following pattern

pattern = "(('[^']*')|([^,]+))"
re.findall(pattern,s2)
gives [('foo', '', 'foo'), ('bar', '', 'bar'), ("'foo,bar'", "'foo,bar'", '')]

But I am not able to figure out the pattern for s2.. Note that I need to parse both s1 and s2 successfully

Edit
   The current pattern support strings like
   "foo,bar,foo bar" => [foo,bar,foo bar]
   "foo,bar,'foo bar'" => ["foo","bar",'foo bar']
    "foo,bar,'foo, bar'" => [foo,bar, 'foo, bar'] #length 3
8
  • @aliteralmind The beginning and end of the string literal Commented Apr 12, 2014 at 23:07
  • I use this: regex101.com/#python Commented Apr 12, 2014 at 23:08
  • 1
    You posted almost the same exact question, although for a different language (huh?) an hour ago. Commented Apr 12, 2014 at 23:17
  • @aliteralmind : Yepp.. I was trying in scala but gave it up and pivoted back to python :-/ Commented Apr 12, 2014 at 23:18
  • 1
    @Fraz: this (a csv-like reader) is an example of something which is easy to describe statefully but annoying to squeeze into a regex. Commented Apr 12, 2014 at 23:42

3 Answers 3

4

I think that shlex (simple lexical analysis) is much simpler solution here (when regex is too complicated). Specifically, I'd use:

>>> import shlex
>>> lex = shlex.shlex(""" "foo","bar", 'foo,bar' """, posix=True)
>>> lex.whitespace = ','        # Only comma will be a splitter
>>> lex.whitespace_split=True   # Split by any delimiter defined in whitespace
>>> list(lex)                   # It is actually an generator
['foo', 'bar', 'foo,bar']

Edit:

I have a feeling that you're trying to read a csv file. Did you try import csv?

Sign up to request clarification or add additional context in comments.

2 Comments

Pretty cool solution. I think you mean that lex is a generator though, and that's why we need to call list(). A list is an iterator.
@Haidro - I always thought that iterator was an object that allows you to iterate, and generator is a function that allows you to iterate (using yield). I changed it anyway.
2

Maybe you could use something like this:

>>> re.findall(r'["|\'](.*?)["|\']', s1)
['foo', 'bar', 'foo,bar']
>>> re.findall(r'["|\'](.*?)["|\']', s2)
['foo', 'bar', 'foo,bar']

This finds all the words inside of "..." or '...' and groups them.

9 Comments

@Hairdo Thanks for the pattern. it works.. but it fails at " foo,bar,'foobar' " Is it possible to support this as well?
So some strings are not quoted? It would be quite different to have to capture unquoted strings.
@Haidro : I updated the use case a bit ... can we support those cases as well?
Well now we need to know the exact format of those unquoted words. Are they truly words (only alpha-numeric)?
@aliteralmind: yepp.. they are alphanumeric.. everything I entered are valid python strings?
|
1

This works:

(?:"([^"]+)"|'([^']+)')

Regular expression visualization

Debuggex Demo

Capture groups 1 or two contain the desired output. So each element could be $1$2, because exactly one will always be empty.


Updated to the new requirements as in the comments to Haidro's answer:

(?:("[^"]+")|('[^']+')|(\w+))

Regular expression visualization

Debuggex Demo

Each element is now $1$2$3.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.