Python regex replace substrings inside strings

Question

I have a string like this:

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_.

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.

Any help is tremendously appreciated!

Thanks in advance.

preferably the answer would provide a regex that can produce the expected output? — Aaron Meier, Commented Jun 6, 2021 at 16:34
A function or regex is not described by one input value (your string) and one output value (the expected output). Please provide a more general description of what you want. — René Pijl, Commented Jun 6, 2021 at 16:36
@RenéPijl are you saying that the question isn't clear? I'll update it. — Aaron Meier, Commented Jun 6, 2021 at 16:37
@RenéPijl updated the to include what I'm after: All occurrences of < between CODE and CODE to be replaced with _LT_. — Aaron Meier, Commented Jun 6, 2021 at 16:40

S-c-r-a-t-c-h-y · Accepted Answer · 2021-06-06 17:05:06Z

3

Here is my answer:

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output. This does not include regex but this works.

answered Jun 6, 2021 at 17:05

S-c-r-a-t-c-h-y

4313 silver badges7 bronze badges

That's a pretty good solution, thanks very much! I was hoping for a regex version, but it's looking like it's not possible.
– Aaron Meier
Commented Jun 6, 2021 at 17:17

Add a comment |

Ryszard Czech · Accepted Answer · 2021-06-06 19:59:34Z

With regex:

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof.

Results:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'

Ivee Nara · Accepted Answer · 2021-06-06 17:50:16Z

0

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT! I found the answer, but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

edited Jun 6, 2021 at 17:50

answered Jun 6, 2021 at 16:44

Ivee Nara

343 bronze badges

Yeah looks like this will be the way to handle it I think.
– Aaron Meier
Commented Jun 6, 2021 at 17:17
Thanks by the way! If you can come up with a regex I'd love to see that. Seems like a hard problem though.
– Aaron Meier
Commented Jun 6, 2021 at 17:26
1

@AaronMeier I edited the comment :) The search 'python replace substring between two characters' on Google gave me that link stackoverflow.com/a/11096811/8665327
– Ivee Nara
Commented Jun 6, 2021 at 17:52

Add a comment |

Collectives™ on Stack Overflow

Python regex replace substrings inside strings

3 Answers 3

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Linked

Related