1

I have a string that can contain links:

<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a> ...

How can I extract the text (not the link) of all html tags "Hello", "Hello2", "Hello3" ... ? I'm thinking of a list that should contain all texts.

2
  • you want to look into the BeautifulSoup library Commented Nov 16, 2012 at 22:48
  • Never use regex for parsing! Never! Commented Nov 17, 2012 at 9:01

1 Answer 1

1

Using lxml:

import lxml.html as LH

content = '''
<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a>
<a href="/">go <b>home</b>, dude!</a>
'''

doc = LH.fromstring(content)
texts = [elt.text_content() for elt in doc.xpath('//a')]
print(texts)

yields

['Hello', 'Hello2', 'Hello3', 'go home, dude!']
Sign up to request clarification or add additional context in comments.

5 Comments

Please don't use /text(), it's a code smell. In particular, it will do funny things on links like <a href="/">go <b>home</b>, dude!</a>
I'd do //a/string(). Is your version equivalent?
I just tried that; for some reason lxml raises lxml.etree.XPathEvalError: Invalid expression.
@larsmans: But to answer your question, yes, text_content() will return all the text between <a> and </a> with no markup.
string() is probably XPath 2.0, LXML only supports 1.0. +1 for a clean solution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.