I am trying to do a grab everything after the '</html>'
tag and delete it, but my code doesn't seem to be doing anything. Does .replace()
not support regex?
z.write(article.replace('</html>.+', '</html>'))
No. Regular expressions in Python are handled by the re
module.
article = re.sub(r'(?is)</html>.+', '</html>', article)
In general:
str_output = re.sub(regex_search_term, regex_replacement, str_input)
z.write(re.sub(r'</html>.+', r'</html>', article))
Commented
Jul 13, 2012 at 18:17
'\n'
? You can make it case-insensitive ((?i)
flag) and make .
match newlines ((?s)
flag) with r'(?is)</html>.+'
.
In order to replace text using regular expression use the re.sub function:
sub(pattern, repl, string[, count, flags])
It will replace non-everlaping instances of pattern
by the text passed as string
. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string
argument. more info here.
Examples
>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'
>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
You can use the re
module for regexes, but regexes are probably overkill for what you want. I might try something like
z.write(article[:article.index("</html>") + 7]
This is much cleaner, and should be much faster than a regex based solution.
str()
; just use len('</html>')
Commented
Mar 3, 2018 at 16:00
For this particular case, if using re
module is overkill, how about using split
(or rsplit
) method as
se='</html>'
z.write(article.split(se)[0]+se)
For example,
#!/usr/bin/python
article='''<html>Larala
Ponta Monta
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')
se='</html>'
z.write(article.split(se)[0]+se)
outputs out.txt
as
<html>Larala
Ponta Monta
</html>
</html>
? Or what if the garbage at the end itself has a</html>
? Unless you can guarantee that none of those etc. can happen, you either need to fully parse the HTML or have some other way of knowing how much data you have (e.g. aContent-Length:
HTTP header).