match string pattern in python

Question

I have a string that can contain links:

<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a> ...

How can I extract the text (not the link) of all html tags "Hello", "Hello2", "Hello3" ... ? I'm thinking of a list that should contain all texts.

you want to look into the BeautifulSoup library

Cameron Sparr
– Cameron Sparr

2012-11-16 22:48:50 +00:00
Commented Nov 16, 2012 at 22:48 — Cameron Sparr
– Cameron Sparr, Commented Nov 16, 2012 at 22:48
Never use regex for parsing! Never!

Lucas Hoepner
– Lucas Hoepner

2012-11-17 09:01:12 +00:00
Commented Nov 17, 2012 at 9:01 — Lucas Hoepner
– Lucas Hoepner, Commented Nov 17, 2012 at 9:01

unutbu · Accepted Answer · 2012-11-16 22:50:58Z

1

Using lxml:

import lxml.html as LH

content = '''
<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a>
<a href="/">go <b>home</b>, dude!</a>
'''

doc = LH.fromstring(content)
texts = [elt.text_content() for elt in doc.xpath('//a')]
print(texts)

yields

['Hello', 'Hello2', 'Hello3', 'go home, dude!']

answered Nov 16, 2012 at 22:50

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Fred Foo Over a year ago

Please don't use /text(), it's a code smell. In particular, it will do funny things on links like <a href="/">go <b>home</b>, dude!</a>

Fred Foo Over a year ago

I'd do //a/string(). Is your version equivalent?

unutbu Over a year ago

I just tried that; for some reason lxml raises lxml.etree.XPathEvalError: Invalid expression.

unutbu Over a year ago

@larsmans: But to answer your question, yes, text_content() will return all the text between <a> and </a> with no markup.

Fred Foo Over a year ago

string() is probably XPath 2.0, LXML only supports 1.0. +1 for a clean solution.

Collectives™ on Stack Overflow

match string pattern in python

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related