-1

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:

mystring = "Carl's house is big. He is asking 1M for that(the house)."

Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:

substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"

At the end I want the original sentence like this:

mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."

The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:

Carl's **house** is big. He is asking 1M for that(**the **house****). 

On the other hand, if I analyze the shortest substring first, I am getting this:

Carl's **house** is big. He is asking 1M for that(the **house**).

It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.

Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words

1
  • re.sub() can take a function for the repl argument. Create a pattern that matches your substrings, then create a function that takes a match object as an argument and returns that string modified however you want.
    – wwii
    Commented Dec 13, 2016 at 21:24

4 Answers 4

3

You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:

re.sub(r"(house|the house)", r"**\1**", mystring)
2
  • What does the \1 mean? What does | mean?
    – wwii
    Commented Dec 13, 2016 at 21:20
  • @wwii \1 refers the first matched group, | is a or operator that helps you have a pattern thats house or the house
    – ashwinjv
    Commented Dec 13, 2016 at 21:29
1

You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example

>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data) 
"Carl's **house** is big. He is asking 1M for that(**the house**)."

Note: \g<repl> is used to get all the string matched by the group <repl>

0

You could do two passes:

First: Go through from longest to shortest and replace with something like:

  • 'the house': 'AA_THE_HOUSE'
  • 'house': 'BB_HOUSE'

Second: Go through replace like:

  • 'AA_THE_HOUSE': '**the house**'
  • 'BB_HOUSE': '**house**'
0

Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.

For example:

'the house' with 'temp_the_house' 'house' with 'temp_house'

then 'temp_house' with 'house' 'temp_the_house' with '**the house****'

Should work fine. You can automate this by using two lists.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.