520

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))
4
  • 82
    Warning: parsing HTML with regular expressions leads to madness. Commented Jul 13, 2012 at 18:08
  • 6
    I have a bunch of garbage after my closing html tag and I just want to remove it. Commented Jul 13, 2012 at 18:11
  • 1
    But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing </html>? Or what if the garbage at the end itself has a </html>? Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. a Content-Length: HTTP header). Commented Jul 13, 2012 at 18:16
  • 16
    none of those things are a factor. Commented Jul 13, 2012 at 18:19

4 Answers 4

871

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

str_output = re.sub(regex_search_term, regex_replacement, str_input)
4
  • 2
    How would I apply the re model to my 'article' variable? Commented Jul 13, 2012 at 18:05
  • I tried the following to no avail z.write(re.sub(r'</html>.+', r'</html>', article)) Commented Jul 13, 2012 at 18:17
  • 4
    Is the tag not lowercase, or is it followed by a '\n'? You can make it case-insensitive ((?i) flag) and make . match newlines ((?s) flag) with r'(?is)</html>.+'.
    – MRAB
    Commented Jul 13, 2012 at 18:32
  • 3
    Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern.
    – parvus
    Commented Jul 8, 2021 at 5:14
119

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
9

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

5
  • 13
    Not so clean; you have to hard-code the length of "</html>". Commented Feb 28, 2016 at 20:44
  • @DanielGriscom : what about len(str('</html>')) ? Commented Mar 3, 2018 at 13:35
  • @OleAnders Better, but then you're duplicating that string, which opens another possibility for error. Commented Mar 3, 2018 at 14:30
  • @OleAnders ... and just realized; no need for the str(); just use len('</html>') Commented Mar 3, 2018 at 16:00
  • 4
    I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish.
    – Julian
    Commented Mar 3, 2018 at 18:42
8

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.