0

i try to parse a xml-file using the following code:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = Options()
# options.add_argument('--headless=new')  
options.add_argument("start-maximized")
options.add_argument('--log-level=3')  
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 1})    
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled') 
srv=Service()
driver = webdriver.Chrome (service=srv, options=options)    
# driver.minimize_window()
waitWD = WebDriverWait (driver, 10)  

wLink = "https://projects.propublica.org/nonprofits/organizations/830370609"
driver.get(wLink) 
driver.execute_script("arguments[0].click();", waitWD.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]'))))  
driver.switch_to.window(driver.window_handles[1])    
time.sleep(3) 
print(driver.current_url)
soup = BeautifulSoup (driver.page_source, 'lxml')   
worker = soup.find("PhoneNum")
print(worker)

But as you can see in the result i am for exmaple not able to parse the element "PhoneNum"

(selenium) C:\DEV\Fiverr2025\TRY\austibn>python test.py
https://pp-990-xml.s3.us-east-1.amazonaws.com/202403189349311780_public.xml?response-content-disposition=inline&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA266MJEJYTM5WAG5Y%2F20250423%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250423T152903Z&X-Amz-Expires=1800&X-Amz-SignedHeaders=host&X-Amz-Signature=9743a63b41a906fac65c397a2bba7208938ca5b865f1e5a33c4f711769c815a4
None

How can i parse the xml-file from this site?

2
  • How do you know that it wasn't parsed correctly? Maybe there is no PhoneNum in it. Commented Apr 23 at 15:42
  • problem is that all html parsers convert tags to lowercase. Only xml parsers like lmxl-xml or xml keep original names - try print(BeautifulSoup('<PhoneNum/>', 'lxml-xml').prettify()) and do the same with lxml, htm5lib, html.parser, and xml
    – furas
    Commented 2 days ago

3 Answers 3

2

Fixes:

  1. Use requests.get() to fetch the XML directly (faster and more reliable than Selenium for raw XML).

  2. Parse with BeautifulSoup(..., 'xml') (not 'lxml', which is for HTML).

  3. Close Selenium after getting the URL (since it's no longer needed).

  4. Check if the tag exists before accessing .text.

soup.find("PhoneNum" will return first one phone number. However, I use find_all() to return all matching elements.

The following code will save the xml data in a xml file. If you don't need it, you could delete this part:

with open("propublica_data.xml", "wb") as f:
    f.write(response.content)
print("XML saved to 'propublica_data.xml'")

You also utilized time.sleep(3), which is generally not recommended for production code. A more robust approach would be to use the line below instead (please note, I have not modified the time.sleep in your original code):

waitWD.until(EC.presence_of_element_located((By.XPATH, '//*')))

The full code with corrections:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("start-maximized")
options.add_argument('--log-level=3')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument('--disable-blink-features=AutomationControlled')

srv = Service()
driver = webdriver.Chrome(service=srv, options=options)
waitWD = WebDriverWait(driver, 10)

url = "https://projects.propublica.org/nonprofits/organizations/830370609"
driver.get(url)

xml_button = waitWD.until(EC.element_to_be_clickable((By.XPATH, '(//a[text()="XML"])[1]')))
driver.execute_script("arguments[0].click();", xml_button)

driver.switch_to.window(driver.window_handles[1])
time.sleep(3)
xml_url = driver.current_url
driver.quit()

response = requests.get(xml_url)
if response.status_code != 200:
    print("Failed to download XML")
    exit()

soup = BeautifulSoup(response.content, 'xml')
phone_numbers = soup.find_all('PhoneNum')

if phone_numbers:
    print(f"Found {len(phone_numbers)} phone numbers:")
    for idx, phone in enumerate(phone_numbers, start=1):
        print(f"{idx}. {phone.text.strip()}")
else:
    print("No <PhoneNum> tags found in the XML.")

with open("propublica_data.xml", "wb") as f:
    f.write(response.content)
print("XML saved to 'propublica_data.xml'")

Output:

Found 4 phone numbers:
1. 6023146022
2. 6022687502
3. 6028812483
4. 6023146022
XML saved to 'propublica_data.xml'
2
  • Thanks a lot for your detailed solution! But at then end the main issue was that i used "lxml" instead of "xml" in the BeautifulSoup statement.
    – Rapid1898
    Commented 2 days ago
  • I agree with you. I am pleased to hear that it is functioning properly for you. Thank you. @Rapid1898 Commented 2 days ago
2

When you print the soup object you will see the below:

<phonenum>
  6023146022
</phonenum>

So it's not PhoneNum, but phonenum.

Change the code to:

worker = soup.find("phonenum")
1

You don't need selenium for this because the XML buttons are static.

You could just do this:

import requests
from bs4 import BeautifulSoup as BS
from urllib.parse import urlparse

URL = "https://projects.propublica.org/nonprofits/organizations/830370609"
PU = urlparse(URL)

with requests.Session() as session:
    with session.get(URL) as response:
        response.raise_for_status()
        soup = BS(response.text, "html.parser")
        for b in soup.select("a.btn[href]:-soup-contains(XML)"):
            with session.get(f"{PU.scheme}://{PU.netloc}{b.attrs['href']}") as response:
                response.raise_for_status()
                soup = BS(response.text, features="xml")
                for phone in soup.find_all("PhoneNum"):
                    print(phone.text)

Output:

6022687502
6022687502
6027766300
6022687502
6022687502
6022687502
6027766300
6022687502
6022687502
4806637867
6022687502

You may want to adjust the code to just get one of the XML files. This gets all of them

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.