Selenium PhantomJS webdriver failing to grab ajax content

Question

I am trying to scrape a page that loads most of its content via ajax.

I am trying to grab all li nodes with a data-section attribute from this webpage, for example. The response html has six required nodes that I need, but the majority of the rest are loaded via an ajax request which returns html containing the remaining li nodes.

So I switched from using requests to using selenium with PhantomJS driver a its supposed to be xhr friendly but I am not getting the extra ajax loaded content.

Runnable:

from selenium import webdriver
from lxml import html

br = webdriver.PhantomJS()
br.get(url)
tree = html.fromstring(br.page_source)
print tree.xpath('//li[@data-section]/a/text()')

In brief, above code cannot get html injected into the webpage via xhr. How can I make it do so? If not, what are my other headless options.

possible duplicate of Waiting for a table to load completely using selenium with python — Artjom B.
– Artjom B., Commented Nov 15, 2014 at 17:41
@ArtjomB. thanks, although in that question, there is a unique table that the expected condition can check for, here there seems to be an arbitrary number of identical li elements being loaded. Would you have any hints how to check for that? EC solution sounds better than an implicit wait which will slow down the crawl — pad
– pad, Commented Nov 15, 2014 at 17:48
Since the number of elements is previously not known, you should try it with an implicit wait. I don't know your site, if there is nothing that can be used as a condition then you need to use an implicit wait. — Artjom B.
– Artjom B., Commented Nov 15, 2014 at 17:51
@ArtjomB. just edited that in my earlier comment. I am going to crawl tens of thousands of pages and implicit wait doesn't sound very attractive. The network is erratic so I'll have to set a crippling value to implicit wait to account for slow periods, which will also drag down the crawl during good network as well. — pad
– pad, Commented Nov 15, 2014 at 17:54

Community · Accepted Answer · 2017-05-23 12:25:33Z

8

The linked page prominently displays a loading spinner (.archive_loading_bar) which vanishes as soon as the data is loaded. You can use an explicit wait with the expected condition of invisibility_of_element_located.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from lxml import html

driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.archive_loading_bar')))
tree = html.fromstring(driver.page_source)

This is adapted from this answer and waits up to 10 seconds or until the data is loaded.

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Nov 15, 2014 at 18:08

Artjom B.

62k26 gold badges137 silver badges236 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pad Over a year ago

Thanks a lot. For any future viewers- the driver on line 6 should be replaced by br and the arguments inside invisibility_of_element_located should be a tuple (only accepts one argument) so an extra pair of brackets must be added.

Collectives™ on Stack Overflow

Selenium PhantomJS webdriver failing to grab ajax content

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related