1

I have a string like this:

import re

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

And the expected output is:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

Here's a small subset of what I've tried:

# None of these work, and typically only replace the first or last occurence of <
re.sub(r'(?<=CODE)<(?=CODE)', r'_LT_', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)<(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(?<=CODE)(.*?)[<]*(.*?)(?=CODE)', r'\1_LT_\2', text, flags=re.DOTALL|re.MULTILINE)
re.sub(r'(CODE.*?)<(.*?CODE)', r'\1_LT_\2', text, flags=re.DOTALL)
re.sub(r'(CODE.*)<(.*CODE)', r'\1_LT_\2', text, flags=re.DOTALL)

What I'd like to happen: All occurrences of < between CODE and CODE to be replaced with _LT_.

After spending the day on stackoverflow and regex101.com, I'm starting to think either it's not possible or I'm not smart enough to handle this.

Any help is tremendously appreciated!

Thanks in advance.

6
  • what do you want the answer to be exactly ? Commented Jun 6, 2021 at 16:33
  • preferably the answer would provide a regex that can produce the expected output? Commented Jun 6, 2021 at 16:34
  • A function or regex is not described by one input value (your string) and one output value (the expected output). Please provide a more general description of what you want.
    – René Pijl
    Commented Jun 6, 2021 at 16:36
  • @RenéPijl are you saying that the question isn't clear? I'll update it. Commented Jun 6, 2021 at 16:37
  • @RenéPijl updated the to include what I'm after: All occurrences of < between CODE and CODE to be replaced with _LT_. Commented Jun 6, 2021 at 16:40

3 Answers 3

3

Here is my answer:

text = """
Some stuff to keep <b>here</b>

CODE
<b>Replace gt and lt</b>
<i>inside <script>this</script> code</i>
CODE

Some more stuff to keep <b>here</b>
"""

output = ''
for i in range(len(text.split('CODE'))):
    if i % 2:
        output += text.split('CODE')[i].replace('>', '_GT_').replace('<', '_LT_')
    else:
        output += text.split('CODE')[i]


print(output)

With this solution, every code block is being formated and added to the output. This does not include regex but this works.

1
  • That's a pretty good solution, thanks very much! I was hoping for a regex version, but it's looking like it's not possible. Commented Jun 6, 2021 at 17:17
2

With regex:

import re
text = "\nSome stuff to keep <b>here</b>\n\nCODE\n<b>Replace gt and lt</b>\n<i>inside <script>this</script> code</i>\nCODE\n\nSome more stuff to keep <b>here</b>\n"
pattern = r"(?s)CODE.*?CODE"
print(re.sub(pattern, lambda x: x.group().replace('<','_LT_').replace('>','_GT_'), text))

See Python proof.

Results:

Some stuff to keep <b>here</b>

CODE
_LT_b_GT_Replace gt and lt_LT_/b_GT_
_LT_i_GT_inside _LT_script_GT_this_LT_/script_GT_ code_LT_/i_GT_
CODE

Some more stuff to keep <b>here</b>

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  (?s)                     set flags for this block (with . matching
                           \n) (case-sensitive) (with ^ and $
                           matching normally) (matching whitespace
                           and # normally)
--------------------------------------------------------------------------------
  CODE                     'CODE'
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
--------------------------------------------------------------------------------
  CODE                     'CODE'
0

I'll update this answer in a few minutes with an only-regex solution but, meanwhile... Is not doing a split and then join strings a solution?

re.sub(regex, value, text.split("CODE\n")[1], flags)

EDIT! I found the answer, but it's a little bit hacky You can read the full description in this post: https://stackoverflow.com/a/11096811/8665327

Basically, the line you are looking for is this:

text = re.sub('\nCODE\n[^(CODE)]*\nCODE\n', lambda x: x.group(0).replace('<', '_LT_').replace('>', '_GT_'), text)

This will work with the first set of text placed between "CODE" text in its own line as long as there is no "CODE" string between them

will_work = """
<title>This will work</title>
CODE
<b>Replace this</b>
CODE
"""

wont_work = """
CODE
<b>This won't work</b>CODE
CODE
"""

3
  • Yeah looks like this will be the way to handle it I think. Commented Jun 6, 2021 at 17:17
  • Thanks by the way! If you can come up with a regex I'd love to see that. Seems like a hard problem though. Commented Jun 6, 2021 at 17:26
  • 1
    @AaronMeier I edited the comment :) The search 'python replace substring between two characters' on Google gave me that link stackoverflow.com/a/11096811/8665327
    – Ivee Nara
    Commented Jun 6, 2021 at 17:52

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.