1

I want to delete all comment. This is my regular expression :

re.sub(re.compile('<!--.*-->', re.DOTALL),'', text)

But if my text is :

bzzzzzz <!-- blabla --> blibli <!-- bloblo --> blublu

the result is :

bzzzzzz blublu

instead of :

bzzzzzz blibli blublu

Thanks for your help

2 Answers 2

11

I'd suggest not to use regex for this kind of stuff. There is always a better solution, such as lxml.html.clean.

Your example:

import lxml.html.clean as clean
cleaner = clean.Cleaner(comments=True)
cleaner.clean_html("bzzzzzz <!-- blabla --> blibli <!-- bloblo --> blublu")
#'bzzzzzz  blibli  blublu'
Sign up to request clarification or add additional context in comments.

Comments

8

* is greedy while *? is not

re.sub(re.compile('<!--.*?-->', re.DOTALL), '', text)

or, even shorter:

re.sub('(?s)<!--.*?-->', '', text)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.