0

I'm trying to figure out how to scrape the data from the following url: https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx

Here is the type of data:

enter image description here

It appears that everything is populated from a database and loaded into the webpage via javascript.

I've done something similar in the past using selenium and PhantomJS but I can't figure out how to get these data fields in Python.

As expected, I can't use pd.read_html for this type of problem.

Is it possible to parse the results from:

from selenium import webdriver

url="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"

browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source

Or maybe to access the actual underlying data?

If not, what are other approaches short of copy and pasting for hours?

EDIT:

Building on the answer below, from @thenullptr I have been able to access the material but only on page 1. How can I adapt this to go across all of the pages [recommendations to parse properly]? My end goal is to have this in a pandas dataframe

import requests
from bs4 import BeautifulSoup

r = requests.post(
    url = 'https://search.aap.org/nicu/', 
    data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'}, 

) #key:value
html = r.text

# Parsing the HTML
    soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")
div = soup.find("div", {"id": "main"})

div = soup.findAll("div", {"class":"blue-border panel list-group"})
def f(x):
    ignore_fields = ['Collapse all','Expand all']
    output = list(filter(bool, map(str.strip, x.text.split("\n"))))
    output = list(filter(lambda x: x not in ignore_fields, output))
    return output
results = pd.Series(list(map(f, div))[0])
6
  • If you're using a webdriver you can take advantage of browser.find_elements_by_class_name('yourClassName'), and if you need to wait for those elments to load you can do with WebDriverWait(browser,secondsToWaitFloat). When you find elements by class name you get a list of elements on which you can call get_attribute('innerText') to get the innerText.
    – libby
    Commented Oct 20, 2020 at 20:01
  • Try using the network tab on your browser dev tools to view what XHR calls are being made. There is probably a public API that the page is getting the information from which you can scrape.
    – thenullptr
    Commented Oct 20, 2020 at 20:04
  • @thenullptr there are a few XHR objects. Here is the view: i.imgur.com/BddXtgA.png How can I access these? What would I look for in particular (sorry, I've never used those objects before).
    – O.rka
    Commented Oct 20, 2020 at 20:13
  • @libby How would you recommend finding the classes? I've inspected individual elements using Google Chrome. I found <div class="col-md-7"><label>Email address: </label>emailaddresshere</div> for example to get e-mail address. Would the class I'm looking for be called col-md-7. Can you show an example how I would extract that info? Thank you.
    – O.rka
    Commented Oct 20, 2020 at 20:17
  • @O.rka I think this may help: imgur.com/a/C4fJQhn As far as I can see, the page (search.aap.org/nicu) accepts a POST request to search, then as you can see in the screenshot, the page replies with the HTML which you can see in the response tab.
    – thenullptr
    Commented Oct 20, 2020 at 20:25

1 Answer 1

1

To follow on from my last comment, the below should give you a good starting point. When looking through the XHR calls you just want to see what data is being sent and received from each one to pinpoint the one you need. The below is the raw POST data being sent to the API when doing a search, it looks like you need to use at least one and include the last one.

{
    "SearchCriteria.Name": "smith",
    "SearchCriteria.City": "",
    "SearchCriteria.State": "",
    "SearchCriteria.Zip": "",
    "SearchCriteria.Level": "",
    "SearchCriteria.LevelAssigner": "",
    "SearchCriteria.BedNumberRange": "",
    "X-Requested-With": "XMLHttpRequest"
}

Here is a simple example of how you can send a post request using the requests library, the web page will reply with the raw data so you can use BS or similar to parse it to get the information you need.

import requests
r = requests.post('https://search.aap.org/nicu/', 
data = {'SearchCriteria.Name':'smith', 'X-Requested-With':'XMLHttpRequest'}) #key:value
print(r.text)

prints <strong class="col-md-8 white-text">JOHN PETER SMITH HOSPITAL</strong>...

https://requests.readthedocs.io/en/master/user/quickstart/

6
  • Thanks for this. Which part are you actually using data dictionary?
    – O.rka
    Commented Oct 20, 2020 at 21:48
  • 1
    data is just a parameter of the post function eg. r = requests.post('https://httpbin.org/post', data = {'key':'value'}) if that's what you mean
    – thenullptr
    Commented Oct 20, 2020 at 21:58
  • oh I don't know how I didn't catch that. Do you recommend any way to parse the r.text? I thought the most effective way to do this was to start with BedNumberRange as 1 and then moving upwards. Is this HTML or is this java script that is output? At first I thought this was XML and tried using XML trees but the format was off.
    – O.rka
    Commented Oct 21, 2020 at 2:41
  • I'm trying to use BeautifulSoup: soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html") div = soup.find("div", {"id": "main"}) Is this the best way to parse the results? I'm not sure how to use the Name search as I found a Name and then searched but got no results. Although, your smith example works perfectly.
    – O.rka
    Commented Oct 21, 2020 at 2:51
  • I'm also not clear how to access different page numbers
    – O.rka
    Commented Oct 21, 2020 at 5:11

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.