parsing xml in python

Question

I want to parse text from a xml file.Consider that I have a some lines in a file.xml

<s id="1792387-2">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>

How can I extract the following text from the above line:

Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).

And after making some changes with the text, I want to get return the change text with the same tag as like below.

<s id="1792387-2"> Changed Text </s>

Any suggestion please.Thanks!

What exactly is your question?

Daniel Roseman
– Daniel Roseman

2011-08-01 15:20:33 +00:00
Commented Aug 1, 2011 at 15:20 — Daniel Roseman
– Daniel Roseman, Commented Aug 1, 2011 at 15:20
Do you want to parse the text, the XML or both?

Legolas
– Legolas

2011-08-01 15:22:21 +00:00
Commented Aug 1, 2011 at 15:22 — Legolas
– Legolas, Commented Aug 1, 2011 at 15:22

Fred Foo · Accepted Answer · 2011-08-01 15:26:30Z

5

LXML makes this particularly easy.

>>> from lxml import etree
>>> text = '''<s id="1792387-2">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>'''
>>> def edit(s):
...     return 'Changed Text'
... 
>>> t = etree.fromstring(text)
>>> t.text = edit(t.text)
>>> etree.tostring(t)
'<s id="1792387-2">Changed Text</s>'

answered Aug 1, 2011 at 15:26

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Blue Ice Over a year ago

getting Traceback Traceback (most recent call last): File "<string>", line 1, in <fragment> builtins.ImportError: No module named lxml

Fred Foo Over a year ago

@Blue Ice: LXML is not a Python built-in module, you have to install it separately. lxml.de

David Wolever Over a year ago

If you'd like to just use the standard library (python 2.5+) you can use the ElementTree module (see my answer).

Blue Ice Over a year ago

But, I am working in server & for the momentum not possible to do it , since no administration access.Any other alternatives please!

David Wolever · Accepted Answer · 2011-08-01 15:38:29Z

4

There are a couple stdlib methods for parsing xml… But in general ElementTree is the simplest:

from xml.etree import ElementTree
from StringIO import StringIO
doc = ElementTree.parse(StringIO("""<doc><s id="1792387-2">Castro…</s><s id="1792387-3">Other stuff</s></doc>"""))
for elem in doc.findall("s"):
    print "Text:", elem.text
    elem.text = "new text"
    print "New:", ElementTree.dump(elem)

And if your XML is coming from a file, you can use:

f = open("path/to/foo.xml")
doc = ElementTree.parse(f)
f.close()
… use `doc` …

edited Aug 1, 2011 at 15:38

answered Aug 1, 2011 at 15:33

David Wolever

156k94 gold badges365 silver badges513 bronze badges

4 Comments

Blue Ice Over a year ago

Could you please have a look the following Traceback

Traceback (most recent call last):   File "<string>", line 1, in <fragment> builtins.ImportError: No module named StringIO

David Wolever Over a year ago

What version of Python are you using? (python --version)

David Wolever Over a year ago

Hrm… Is it a custom or restricted installation? Because StringIO should exist. Anyway, you can try loading it from a file (as per the second portion of my answer).

Legolas Over a year ago

Just to be sure: is it possible you have multiple versions of Python installed? In Python 3 it is changed to from StringIO import StringIO. I have two Pythons on my system (2.6 and 3.2) and get into such a situation from time to time.

Legolas · Accepted Answer · 2011-08-01 15:44:38Z

1

Parsing XML using the dom package (part of Python) http://docs.python.org/py3k/library/xml.dom.minidom.html is my favorite:

import xml.dom.minidom
d = xml.dom.minidom.parseString("<s id=\"1792387-2\">Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).</s>")
oldText = d.childNodes[0].childNodes[0].data
d.childNodes[0].childNodes[0].data = "Changed text"
d.toxml()

But this does not help you parse the text, so I am not sure what you exactly want there.

edited Aug 1, 2011 at 15:44

answered Aug 1, 2011 at 15:32

Legolas

1,50210 silver badges11 bronze badges

1 Comment

Blue Ice Over a year ago

I want to extract the following text from the above line:

Castro Verde is situated in the Baixo Alentejo Subregion within a territory known locally as the Campo Branco (English: White Plains).

Collectives™ on Stack Overflow

parsing xml in python

3 Answers 3

4 Comments

4 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

1 Comment

Related