Python: Replace all substring occurrences with regular expressions

Question

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:

mystring = "Carl's house is big. He is asking 1M for that(the house)."

Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:

substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"

At the end I want the original sentence like this:

mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."

The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:

Carl's **house** is big. He is asking 1M for that(**the **house****).

On the other hand, if I analyze the shortest substring first, I am getting this:

Carl's **house** is big. He is asking 1M for that(the **house**).

It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.

Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words

re.sub() can take a function for the repl argument. Create a pattern that matches your substrings, then create a function that takes a match object as an argument and returns that string modified however you want. — wwii, Commented Dec 13, 2016 at 21:24

jasonharper · Accepted Answer · 2016-12-13 21:19:39Z

3

You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:

re.sub(r"(house|the house)", r"**\1**", mystring)

answered Dec 13, 2016 at 21:19

jasonharper

9,6212 gold badges20 silver badges44 bronze badges

What does the \1 mean? What does | mean?
– wwii
Commented Dec 13, 2016 at 21:20
@wwii \1 refers the first matched group, | is a or operator that helps you have a pattern thats house or the house
– ashwinjv
Commented Dec 13, 2016 at 21:29

Add a comment |

ashwinjv · Accepted Answer · 2016-12-13 21:12:30Z

You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example

>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data) 
"Carl's **house** is big. He is asking 1M for that(**the house**)."

Note: \g<repl> is used to get all the string matched by the group <repl>

Quinn Weber · Accepted Answer · 2016-12-13 21:06:04Z

0

You could do two passes:

First: Go through from longest to shortest and replace with something like:

'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'

Second: Go through replace like:

'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'

answered Dec 13, 2016 at 21:06

Quinn Weber

9275 silver badges10 bronze badges

Add a comment |

Naveen Jetty · Accepted Answer · 2016-12-13 21:25:56Z

0

Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.

For example:

'the house' with 'temp_the_house' 'house' with 'temp_house'

then 'temp_house' with 'house' 'temp_the_house' with '**the house****'

Should work fine. You can automate this by using two lists.

answered Dec 13, 2016 at 21:25

Naveen Jetty

961 gold badge1 silver badge8 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Python: Replace all substring occurrences with regular expressions

4 Answers 4

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Related