1

I need perform a regular expression find and replace, many times in a large html file.

The first string to find is the following:

(?-i)(<p class=.+\r\n.+)([\d]{2}/[\d]{2}/[\d]{4})(((.+\r\n)+?)(.+>))(MIR)

The second string to find is the next occurrence of:

"Times New Roman"'>MIR

My objective is to insert the date group ([\d]{2}/[\d]{2}/[\d]{4}) into the second string, just before the "MIR"

My attempts to do this were unsuccessful.

The search string I came up with was this:

(?-i)())(MIR)((.+\r\n)+?)((.+)?>)(MIR)

The replacement string I came up with was this:

\1\2\3\7\8\10\2< br />\12

It does not work.

I need to do many such find and replace actions in the html file.

Any help you could give me would be much appreciated.

Marc

5
  • No need to encase \d inside [] unless you're adding something else to the character class that you may want matched.
    – ctwheels
    Commented Mar 2, 2018 at 21:40
  • By default case-insensitivity isn't applied. You don't need (?-i).
    – revo
    Commented Mar 2, 2018 at 22:00
  • This is pretty confusing. "The first string to find is the following": where is the actual source string? Please provide a minimal example with input string and desired result (and patterns you tried).
    – Jeto
    Commented Mar 2, 2018 at 22:51
  • Sorry for posting my question in a confusing way. I posted a minimal text file at this url: dropbox.com/s/y8j7r1pcotrcrxz/… Commented Mar 3, 2018 at 1:14
  • I put the wrong search string in my original post. The search string I used was the following: (?-i)(<p class=.+\r\n.+)([\d]{2}/[\d]{2}/[\d]{4})(((.+\r\n)+?)(.+>))(MIR)((.+\r\n)+?)((.+)?>)(MIR) Commented Mar 3, 2018 at 1:15

1 Answer 1

1

Try it like with the following pattern:

(<p class=.+\n.+)([\d]{2}/[\d]{2}/[\d]{4})(?:((.+\n)+?)(.+>))(MIR)

and substitution

\1\2\3\4\5\2< br />\6

Sample Code:

import re
regex = r"(<p class=.+\n.+)([\d]{2}/[\d]{2}/[\d]{4})(?:((.+\n)+?)(.+>))(MIR)"
subst = "\\1\\2\\3\\4\\5\\2< br />\\6"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)

if result:
    print (result)

Online demo.

6
  • Your original pattern does only have 7 matching groups. I have removed the overlapping one with a non-capturing group.
    – wp78de
    Commented Mar 3, 2018 at 20:36
  • STRING #1: <p class=MsoNormal><span style='font-size:9.0pt;font-family:"Arial",sans-serif; mso-fareast-font-family:"Times New Roman"'>05/13/2016<o:p></o:p></span></p> <p class=MsoNormal><span style='font-size:9.0pt;font-family:"Arial",sans-serif; mso-fareast-font-family:"Times New Roman"'>MIR STRING #2 = Numerous lines of HTML code in which STRING #1 does not appear. STRING #3 equals the quoted characters: ">MIR" OBJECTIVE: Capture the date (05/13/2016) in STRING #1 into a group, and insert the date into STRING #3 between the > symbol and the MIR, as follows: >05/13/2016 MIR Commented Mar 3, 2018 at 20:37
  • Dear wp78de, your comment of 1 min. ago appeared just as I was posting my comment, which is incomprehensible because the line feeds are removed. I'll post it as a file in a few minutes, i..e, as soon as I can get it done. Commented Mar 3, 2018 at 20:39
  • I uploaded a PDF file (which replied to wp78de's postings) at this url: dropbox.com/s/zgjdzrvmpuv5wnm/… Commented Mar 3, 2018 at 20:53
  • I uploaded a .txt version of the same file (which replied to wp78de's postings) at this URL: dropbox.com/s/jl4mrj7x481647d/… Commented Mar 3, 2018 at 20:57

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.