3

I have a regex like this

r"^(.*?),(.*?)(,.*?=.*)"

And a string like this

name1,value1,tag11=value11,tag12=value12,tag13=value13

I am trying to check, using a regex, whether the string follows the following format: name,value, name and value pairs separated by commas.

I need then to extract the comma-separated data using a regex.

I am getting the data extracted as a first group as name1 and a second group as value2 and a third group matches completely from tag11 to value13 (due to greedy match).

But I want to match each name and value pairs. I am new to Python and not sure how can I achieve this.

1
  • Regex like the following might be helpful: ((?<name>\w+),(?<value>\w+))|(?<name>\w+)=(?<value>\w+) (tested on RegExr sans the named capture groups). Commented Jan 5, 2017 at 10:21

3 Answers 3

2

Turns out Python doesn't support repeated named capture groups unlike .NET, which is a bit of a shame (means my solution is a little longer than I thought it'd need to be). Does this meet your requirements?

import re

def is_valid(s):
    pattern = '^name\d+,value\d+(,tag\d+=value\d+)*$'
    return re.match(pattern, s)

def get_name_value_pairs(s):
    if not is_valid(s):
        raise ValueError('Invalid input: {}'.format(s))

    pattern = '((?P<name1>\w+),(?P<value1>\w+))|(?P<name2>\w+)=(?P<value2>\w+)'
    for match in re.finditer(pattern, s):
        name1 = match.group('name1')
        name2 = match.group('name2')
        value1 = match.group('value1')
        value2 = match.group('value2')

        if name1 and value1:
            yield name1, value1
        elif name2 and value2:
            yield name2, value2

if __name__ == '__main__':
    testString = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
    assert not is_valid('')
    assert not is_valid('foo')
    assert is_valid(testString)

    print(list(get_name_value_pairs(testString)))

Output

[('name1', 'value1'), ('tag11', 'value11'), ('tag12', 'value12'), ('tag13', 'value13')]

Edit 1

Added input validation logic. Assumptions made:

  • Must have initial name/value pair in form name<x>,value<x>
  • All following pairs must be in form tag<x>=value<x>
  • Names and values consist only of alphanumeric characters
  • Whitespace is not allowed

Note that I'm not currently validating that x is the same value within a name/value pair, which I assume is a requirement. I'm not sure how to do this leaving this as an exercise for the reader.

Sign up to request clarification or add additional context in comments.

2 Comments

your solution helps me but i need to validate the format of the string. it should be in the format as name1, value1, tag-1=value-1, tag-2=value-2 ... tag-n=value-n. how can i achieve this.
@MohanRaj I've added validation logic. I don't really understand what you're doing so I've made assumptions about what exactly determines if a string is valid or not, but I've listed my assumptions and you can tweak as needed.
1

Why not just split by the commas:

s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
print(s.split(','))

If you want to use regex it's just as simple using the pattern:

[^,]+

Example:

https://regex101.com/r/jS6fgW/1

Comments

1

First, validate the format acc. to your pattern, and then split with [,=] regex (that matches , and =) and convert to a dictionary like this:

import itertools, re
s = 'name1,value1,tag11=value11,tag12=value12,tag13=value13'
if re.match(r'[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$', s):
    l = re.split("[=,]", s)
    d = dict(itertools.izip_longest(*[iter(l)] * 2, fillvalue=""))
    print(d)
else:
    print("Not valid!")

See the Python demo

The pattern is

^[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$

Details:

  • ^ - start of string (in the re.match, this can be omitted since the pattern is already anchored)
  • [^,=]+ - 1+ chars other than = and ,
  • , - a comma
  • [^,=]+ - 1+ chars other than = and ,
  • (?:,[^,=]+=[^,=]+)+ - 1 or more sequences of:
    • , - comma
    • [^,=]+ - 1+ chars other than = and ,
    • = - an equal sign
    • [^,=]+ - 1+ chars other than = and ,
  • $ - end of string.

3 Comments

it helps me but i also need to validate the format of the input string. it should be in the format as name1, value1, tag-1=value-1, tag-2=value-2 ... tag-n=value-n. how can i achieve this.
You need to precise: can the comma separated name,value appear somewhere inside the string, or at its end? Maybe ^(?:\w+[=,]\w+)+$ is enough and will do? Or if you really have a single comma-separated name-value at the start and then =-separated ones, use ^[^,=]+,[^,=]+(?:,[^,=]+=[^,=]+)+$.
It should be only at the beginning of the string.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.