0
<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
        </back>

I have an XML file of scientific journal metadata and am trying to extract just the funding information for each article. I need the info contained within the p tag. While the "sec id" varies between article, the "sec-type" is always "funding".

I have been trying to do this in Python3 using Element Tree.

import xml.etree.ElementTree as ET  

tree = ET.parse(journals.xml)
root = tree.getroot()
for title in root.iter("title"):
    ET.dump(title)

Any help would be greatly appreciated!

1
  • Can you give an example of full valid XML? Commented Jan 15, 2019 at 14:58

1 Answer 1

2

You can use findall with an XPath expression to extract the values you want. I extrapolated from your example data a little bit in order to complete the document and have two p elements:

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
          <sec id="sec8" sec-type="funding">
            <title>Funding</title>
            <p>I'm a little teapot</p>
          </sec>
        </back>
      </body>
    </front>
  </article>
</root>

The following extracts all of the text contents of p nodes under a sec node where sectype="funding":

import xml.etree.ElementTree as ET

doc = ET.parse('journals.xml')
print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')])

Result:

['This work was supported by the NIH', "I'm a little teapot"]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer. Is there a way of combining this XPath expression with a simple search for a specific element's text so that for each article I get the title along with the corresponding funding info? for elem in tree.iter(tag='article-id'): print(elem.text) print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')]) this separately gives me the article IDs and funding info but ideally I want these matching

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.