Scraping the full content from a lazy-loading webpage

Question

I've written a script in python in combination with selenium which is able to scrape 1000 links from a webpage in which lazy-loading method is applied for that reason it displays it's content 20 at a time and full content can only be seen when it is made to scroll downmost. However, my script can scroll the webpage to the end. After collecting the 1000 links from main page, it then gets to each individual link to scrape the Name of CEO and Web address of that organization. It is working great now. I tried to make the whole thing accordingly. Here is the full code:

from selenium import webdriver
import time

def get_links(driver):

    driver.get('http://fortune.com/fortune500/list/')

    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)

        links = [posts.get_attribute("href") for posts in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")]
        if (len(links) == 1000):
                break

    for link in links:
        process_links(driver, link)

def process_links(driver, sub_links):

    driver.get(sub_links)

    for items in driver.find_elements_by_xpath('//div[contains(@class,"company-info-card-table")]'):
        Name = items.find_element_by_xpath('.//div[contains(@class,"small-7")]/p[@class="remove-bottom-margin"]')
        Web = items.find_element_by_xpath('.//div[contains(@class,"small-9")]/a')

        print(Name.text, Web.get_attribute("href"))

if __name__ == '__main__':

    driver = webdriver.Chrome()
    try:
        get_links(driver)
    finally:
        driver.quit()

Community · Accepted Answer · 2020-06-10 13:24:26Z

Code Flow:

you are redefining links on every iterations of the while loop - you basically need to do it once
as a while loop exit condition, we can use the fact that there are line numbers in the company list grid - we can simply wait for the number 1000 to show up while scrolling
I would also create a class to have the driver and WebDriverWait instance shared across the class instance methods
instead of a hardcoded 3 second delay, use an Explicit Wait with a condition of the last line number to change - this would be much faster and reliable overall

Code Style:

posts variable name does not actually correspond to what it is - name it company_link instead
Name and Web violate PEP8 Python naming guidelines
process_links should be process_link - since you are processing a single link at a time. And, actually, we can name it get_company_data and let it return the data instead of printing it

Locating Elements:

don't use XPaths to locate elements - they are generally the slowest and the least readable
for the company links, I'd better use a more readable and concise ul.company-list > li > a CSS selector
in the process_links method you don't actually need a loop since there is a single company being processed. And, I think, you can generalize and return a dictionary generated from a company web page data dynamically - data labels to data values

Here is the modified working code:

from pprint import pprint

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class Fortune500Scraper:
    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def get_last_line_number(self):
        """Get the line number of last company loaded into the list of companies."""
        return int(self.driver.find_element_by_css_selector("ul.company-list > li:last-child > a > span:first-child").text)

    def get_links(self, max_company_count=1000):
        """Extracts and returns company links (maximum number of company links for return is provided)."""
        self.driver.get('http://fortune.com/fortune500/list/')
        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.company-list")))

        last_line_number = 0
        while last_line_number < max_company_count:
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            self.wait.until(lambda driver: self.get_last_line_number() != last_line_number)
            last_line_number = self.get_last_line_number()

        return [company_link.get_attribute("href")
                for company_link in self.driver.find_elements_by_css_selector("ul.company-list > li > a")]

    def get_company_data(self, company_link):
        """Extracts and prints out company specific information."""
        self.driver.get(company_link)

        return {
            row.find_element_by_css_selector(".company-info-card-label").text: row.find_element_by_css_selector(".company-info-card-data").text
            for row in self.driver.find_elements_by_css_selector('.company-info-card-table > .columns > .row')
        }

if __name__ == '__main__':
    scraper = Fortune500Scraper()

    company_links = scraper.get_links(max_company_count=100)
    for company_link in company_links:
        company_data = scraper.get_company_data(company_link)
        pprint(company_data)
        print("------")

Prints:

{'CEO': 'C. Douglas McMillon',
 'CEO Title': 'President, Chief Executive Officer & Director',
 'Employees': '2,300,000',
 'HQ Location': 'Bentonville, AR',
 'Industry': 'General Merchandisers',
 'Sector': 'Retailing',
 'Website': 'www.walmart.com',
 'Years on Fortune 500 List': '23'}
------
{'CEO': 'Warren E. Buffett',
 'CEO Title': 'Chairman & Chief Executive Officer',
 'Employees': '367,700',
 'HQ Location': 'Omaha, NE',
 'Industry': 'Insurance: Property and Casualty (Stock)',
 'Sector': 'Financials',
 'Website': 'www.berkshirehathaway.com',
 'Years on Fortune 500 List': '23'}
------
...

Stack Exchange Network

Scraping the full content from a lazy-loading webpage

1 Answer 1

Code Flow:

Code Style:

Locating Elements:

You must log in to answer this question.

Hot Network Questions

Scraping the full content from a lazy-loading webpage

1 Answer 1

Code Flow:

Code Style:

Locating Elements:

You must log in to answer this question.

Related

Hot Network Questions