python .replace() regex [duplicate]

Question

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))

Warning: parsing HTML with regular expressions leads to madness. — Adam Rosenfield, Commented Jul 13, 2012 at 18:08
I have a bunch of garbage after my closing html tag and I just want to remove it. — user1442957, Commented Jul 13, 2012 at 18:11
But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing </html>? Or what if the garbage at the end itself has a </html>? Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. a Content-Length: HTTP header). — Adam Rosenfield, Commented Jul 13, 2012 at 18:16

Flame · Accepted Answer · 2023-01-12 14:07:41Z

871

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)

edited Jan 12, 2023 at 14:07

Flame

7,6563 gold badges42 silver badges62 bronze badges

answered Jul 13, 2012 at 18:05

Ignacio Vazquez-Abrams

801k160 gold badges1.4k silver badges1.4k bronze badges

2

How would I apply the re model to my 'article' variable?
– user1442957
Commented Jul 13, 2012 at 18:05
I tried the following to no avail z.write(re.sub(r'</html>.+', r'</html>', article))
– user1442957
Commented Jul 13, 2012 at 18:17
4

Is the tag not lowercase, or is it followed by a '\n'? You can make it case-insensitive ((?i) flag) and make . match newlines ((?s) flag) with r'(?is)</html>.+'.
– MRAB
Commented Jul 13, 2012 at 18:32
3

Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern.
– parvus
Commented Jul 8, 2021 at 5:14

Add a comment |

Andre Pena · Accepted Answer · 2020-02-06 04:40:36Z

119

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'

edited Feb 6, 2020 at 4:40

answered Jan 3, 2017 at 16:02

Andre Pena

59.5k53 gold badges210 silver badges257 bronze badges

Add a comment |

Julian · Accepted Answer · 2012-07-13 19:01:50Z

9

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

answered Jul 13, 2012 at 19:01

Julian

2,65223 silver badges21 bronze badges

13

Not so clean; you have to hard-code the length of "</html>".
– Daniel Griscom
Commented Feb 28, 2016 at 20:44
@DanielGriscom : what about len(str('</html>')) ?
– Edgard Knive
Commented Mar 3, 2018 at 13:35
@OleAnders Better, but then you're duplicating that string, which opens another possibility for error.
– Daniel Griscom
Commented Mar 3, 2018 at 14:30
@OleAnders ... and just realized; no need for the str(); just use len('</html>')
– Daniel Griscom
Commented Mar 3, 2018 at 16:00
4

I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish.
– Julian
Commented Mar 3, 2018 at 18:42

Add a comment |

norio · Accepted Answer · 2017-06-24 20:08:09Z

8

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>

answered Jun 24, 2017 at 20:08

norio

3,9123 gold badges28 silver badges36 bronze badges

Add a comment |

Collectives™ on Stack Overflow

python .replace() regex [duplicate]

4 Answers 4

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Linked

Related