0

I extracted a raw string from a Q&A forum. I have a string like this:

s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'

I want to extract this substring "<font color="blue"><font face="Times New Roman">" and assign it to a new variable. I am able to remove it with regex but I don't know how to assign it to a new variable. I am new to regex.

import re
s1 = re.sub('<.*?>', '', s)

This is removes the sub but I'd like to keep the removed sub for the record, ideally reassign it to a varialbe.

How can I do this? I may prefer regular expressions.

1
  • 2
    Why don't you use an HTML parser like beautifulsoup? Commented Feb 10, 2020 at 5:14

2 Answers 2

1

Though bs4 is more approprate for webscraping but if you are okay with regex for your case you could do following

>>> import re
>>> s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
>>> regex = re.compile('<.*?>')
>>> regex.findall(s)
['<font color="blue">', '<font face="Times New Roman">', '<font color="green">', '<font face="Arial">']
>>> regex.sub('', s)
'Take about 2 + but double check with teacher before you do'
Sign up to request clarification or add additional context in comments.

1 Comment

Works like a charm. Thank you @saurabh
0

Regex is not exactly the easiest tool to parse HTML components. You can try using BeautifulSoup to parse the components and make your substring.

from bs4 import BeautifulSoup

s = """Take about 2 + <font color="blue">
       <font face="Times New Roman">but double check with teacher <font color="green">
       <font face="Arial">before you do"""


soup = BeautifulSoup(s, "html.parser")

Print the html:

Take about 2 +
<font color="blue">
 <font face="Times New Roman">
  but double check with teacher
  <font color="green">
   <font face="Arial">
    before you do
   </font>
  </font>
 </font>
</font>

Extract components:

soup.font.font['face']
> 'Times New Roman'
soup.font["color"]
> 'blue'

Now make and save your substring as a variable:

variable = f"<font color={soup.font.font['face']}><font face={soup.font.font['face']}>"

This will give you:

"<font color="blue"><font face="Times New Roman">"

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.