how to parse a string using regex?

Question

I am trying to parse the following string

 s1 = """ "foo","bar", "foo,bar" """

And out put of this parsing I am hoping is...

 List ["foo","bar","foo,bar"] length 3

I am able to parse the following

s2 = """ "foo","bar", 'foo,bar' """

By using the following pattern

pattern = "(('[^']*')|([^,]+))"
re.findall(pattern,s2)
gives [('foo', '', 'foo'), ('bar', '', 'bar'), ("'foo,bar'", "'foo,bar'", '')]

But I am not able to figure out the pattern for s2.. Note that I need to parse both s1 and s2 successfully

Edit
   The current pattern support strings like
   "foo,bar,foo bar" => [foo,bar,foo bar]
   "foo,bar,'foo bar'" => ["foo","bar",'foo bar']
    "foo,bar,'foo, bar'" => [foo,bar, 'foo, bar'] #length 3

You posted almost the same exact question, although for a different language (huh?) an hour ago. — aliteralmind
– aliteralmind, Commented Apr 12, 2014 at 23:17
@aliteralmind : Yepp.. I was trying in scala but gave it up and pivoted back to python :-/ — frazman
– frazman, Commented Apr 12, 2014 at 23:18
@Fraz: this (a csv-like reader) is an example of something which is easy to describe statefully but annoying to squeeze into a regex. — DSM
– DSM, Commented Apr 12, 2014 at 23:42

tmrlvi · Accepted Answer · 2014-04-12 23:44:51Z

4

I think that shlex (simple lexical analysis) is much simpler solution here (when regex is too complicated). Specifically, I'd use:

>>> import shlex
>>> lex = shlex.shlex(""" "foo","bar", 'foo,bar' """, posix=True)
>>> lex.whitespace = ','        # Only comma will be a splitter
>>> lex.whitespace_split=True   # Split by any delimiter defined in whitespace
>>> list(lex)                   # It is actually an generator
['foo', 'bar', 'foo,bar']

Edit:

I have a feeling that you're trying to read a csv file. Did you try import csv?

edited Apr 12, 2014 at 23:44

answered Apr 12, 2014 at 23:37

tmrlvi

2,36019 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TerryA Over a year ago

Pretty cool solution. I think you mean that lex is a generator though, and that's why we need to call list(). A list is an iterator.

tmrlvi Over a year ago

@Haidro - I always thought that iterator was an object that allows you to iterate, and generator is a function that allows you to iterate (using yield). I changed it anyway.

TerryA · Accepted Answer · 2014-04-12 23:09:41Z

2

Maybe you could use something like this:

>>> re.findall(r'["|\'](.*?)["|\']', s1)
['foo', 'bar', 'foo,bar']
>>> re.findall(r'["|\'](.*?)["|\']', s2)
['foo', 'bar', 'foo,bar']

This finds all the words inside of "..." or '...' and groups them.

answered Apr 12, 2014 at 23:09

TerryA

60.2k11 gold badges122 silver badges148 bronze badges

9 Comments

frazman Over a year ago

@Hairdo Thanks for the pattern. it works.. but it fails at " foo,bar,'foobar' " Is it possible to support this as well?

aliteralmind Over a year ago

So some strings are not quoted? It would be quite different to have to capture unquoted strings.

frazman Over a year ago

@Haidro : I updated the use case a bit ... can we support those cases as well?

aliteralmind Over a year ago

Well now we need to know the exact format of those unquoted words. Are they truly words (only alpha-numeric)?

frazman Over a year ago

@aliteralmind: yepp.. they are alphanumeric.. everything I entered are valid python strings?

|

aliteralmind · Accepted Answer · 2014-04-12 23:39:24Z

1

This works:

(?:"([^"]+)"|'([^']+)')

Regular expression visualization

Debuggex Demo

Capture groups 1 or two contain the desired output. So each element could be $1$2, because exactly one will always be empty.

Updated to the new requirements as in the comments to Haidro's answer:

(?:("[^"]+")|('[^']+')|(\w+))

Regular expression visualization

Debuggex Demo

Each element is now $1$2$3.

edited Apr 12, 2014 at 23:39

answered Apr 12, 2014 at 23:12

aliteralmind

20.2k17 gold badges80 silver badges109 bronze badges

Collectives™ on Stack Overflow

how to parse a string using regex?

3 Answers 3

2 Comments

9 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

9 Comments

Comments

Linked

Related