I am trying to scrape a page that loads most of its content via ajax.
I am trying to grab all li nodes with a data-section attribute from this webpage, for example. The response html has six required nodes that I need, but the majority of the rest are loaded via an ajax request which returns html containing the remaining li nodes.
So I switched from using requests to using selenium with PhantomJS driver a its supposed to be xhr friendly but I am not getting the extra ajax loaded content.
Runnable:
from selenium import webdriver
from lxml import html
br = webdriver.PhantomJS()
br.get(url)
tree = html.fromstring(br.page_source)
print tree.xpath('//li[@data-section]/a/text()')
In brief, above code cannot get html injected into the webpage via xhr. How can I make it do so? If not, what are my other headless options.
lielements being loaded. Would you have any hints how to check for that? EC solution sounds better than an implicit wait which will slow down the crawl