I'm trying to figure out how to scrape the data from the following url: https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx
Here is the type of data:
It appears that everything is populated from a database and loaded into the webpage via javascript.
I've done something similar in the past using selenium
and PhantomJS
but I can't figure out how to get these data fields in Python.
As expected, I can't use pd.read_html
for this type of problem.
Is it possible to parse the results from:
from selenium import webdriver
url="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"
browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source
Or maybe to access the actual underlying data?
If not, what are other approaches short of copy and pasting for hours?
EDIT:
Building on the answer below, from @thenullptr I have been able to access the material but only on page 1. How can I adapt this to go across all of the pages [recommendations to parse properly]? My end goal is to have this in a pandas dataframe
import requests
from bs4 import BeautifulSoup
r = requests.post(
url = 'https://search.aap.org/nicu/',
data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'},
) #key:value
html = r.text
# Parsing the HTML
soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")
div = soup.find("div", {"id": "main"})
div = soup.findAll("div", {"class":"blue-border panel list-group"})
def f(x):
ignore_fields = ['Collapse all','Expand all']
output = list(filter(bool, map(str.strip, x.text.split("\n"))))
output = list(filter(lambda x: x not in ignore_fields, output))
return output
results = pd.Series(list(map(f, div))[0])
browser.find_elements_by_class_name('yourClassName')
, and if you need to wait for those elments to load you can do withWebDriverWait(browser,secondsToWaitFloat)
. When you find elements by class name you get a list of elements on which you can callget_attribute('innerText')
to get the innerText.XHR
objects. Here is the view: i.imgur.com/BddXtgA.png How can I access these? What would I look for in particular (sorry, I've never used those objects before).inspected
individual elements using Google Chrome. I found<div class="col-md-7"><label>Email address: </label>emailaddresshere</div>
for example to get e-mail address. Would the class I'm looking for be calledcol-md-7
. Can you show an example how I would extract that info? Thank you.