-2

I'm working with some marked text and I need to extract information in order to use later. I want to use regular expressions from Python using the module re, but I can't construct the right expression. I have two situations:

  • Text in format string="{some text}{other text 1}{other text 2}". Here I use the regexp "\\{(.*?)\\}" but I obtain

    >> string="{some text}{other text 1}{other text 2}"
    >> elements = re.split("\\{(.*?)\\}",string) 
    >> print(elements)
    >> ['', 'some text', '', 'other text 1', '', 'other text 2', '']
    

    I can't understand why the empty strings appear in positions 0, 2, 4 and 6. If I edit my original string to string="}{some text}{other text 1}{other text 2}{" and use the regexp "\\}\\{(.*?)\\}\\{" I obtain

    >> string="}{some text}{other text 1}{other text 2}{"
    >> elements = re.split("\\}\\{(.*?)\\}\\{",string) 
    >> print(elements)
    >> ['', 'some text', 'other text 1', 'other text 2', '']
    

    the internal empty strings in the output dissapear, but not the first and last. How should I construct the regular expression in order to obtain only the elements inside brackets?

  • Text in format string="some text {other text}". In this case I need to extract "some text" and also "other text". Here I don't know how to proceed.

Can someone help me, please?

9
  • 3
    Why did you expect otherwise? Split gives you what's between the matches and, if any, the groups captured from the matches: docs.python.org/3/library/re.html#re.split. It's like doing "1,2,3".split(",") and asking why there are numbers in the result. Commented 2 days ago
  • 2
    How about re.findall(r"\{([^}]+)\}",string). Notice the r"" for a raw string, so you do not have to escape \ . Also [^}]+ is cleaner than .*? since it can never overstep a closing curly bracket. Commented 2 days ago
  • This question is similar to: What exactly is a "raw string regex" and how can you use it?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented 2 days ago
  • @ti7 This is not about string escaping. Commented 2 days ago
  • 1
    I assume it was closed as they are asking 3 questions: 1. How to extract text between braces 2. Why does split produce empty strings 3. How to tokenize text between braces and not between braces. There are also many existing answers for all these questions, so voting as a duplicate would also be valid. Commented yesterday

1 Answer 1

1

A good strategy is often to use raw strings (r"", realistically always consider for regex) and re.escape() for complicated inputs, building your regex in parts if you need to

>>> s = re.escape(r"a [complex]* string's literal value") + ", " + r"exact.* or escaped\?"
>>> print(s)
a\ \[complex\]\*\ string's\ literal\ value, exact.* or escaped\?

Then use re.findall() or re.finditer() to get every match

.findall() directly returns a list of string matches, either the entire match or group if used, while .finditer() creates a generator which yields Match instances .. each of which can be situationally more useful than the other

>>> data = "}{some text}{other text 1}{other text 2}{"
>>> RE_braces = re.compile(r"\{([^\}]+)\}")  # { group(not '}', 1 or more) }
>>> RE_braces.findall(data)  # match -> group -> string
['some text', 'other text 1', 'other text 2']
>>> next(RE_braces.finditer(data))  # for match in RE.finditer(): ...
<re.Match object; span=(1, 12), match='{some text}'>
Sign up to request clarification or add additional context in comments.

6 Comments

Raw string and regex escaping is not a part of this question. Since you point it out, the target string should be made raw as well. And it's a distraction to escape curly braces \{ \} when not in the form of a range quantifier (m,n}. Your regex should be in this form re.findall(r"{([^\}]+)}", r"}{some text}{other text 1}{other text 2}{") if you follow your own advice. Also you should either provide a solution to part 2 of his question or note that you are omitting that.
the curly bracket is escaped in the character class. You have been distracted.
Indeed ... ... . {([^}]+)}
@sin raw strings are just an aid for string creation and I highly recommend starting with good escaping here to help cut away the scope where bugs can be hidden! you may be conflating them with binary inputs b"", which do need to match (re.match("a", b"a") -> TypeError and can be exchanged with .encode()/.decode())! further, I think here providing good technique is more important than an overly-specific Answer, as there is not enough of the text body to completely understand their problem .. still, perhaps they would be satisfied with list(filter(None, re.split(r"[\}\{]", s)))
I'm just on the regex side, not interested in absorbing language ambiguities in string escaping. That is not the topic here. You should check out the String tag if that's your focus here.
There is plenty of information to provide answers to both parts of his question. Its all about regex not any filters.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.