Code Flow:
- you are redefining
links on every iterations of the while loop - you basically need to do it once
- as a
while loop exit condition, we can use the fact that there are line numbers in the company list grid - we can simply wait for the number 1000 to show up while scrolling
- I would also create a class to have the
driver and WebDriverWait instance shared across the class instance methods
- instead of a hardcoded 3 second delay, use an Explicit Wait with a condition of the last line number to change - this would be much faster and reliable overall
Code Style:
posts variable name does not actually correspond to what it is - name it company_link instead
Name and Web violate PEP8 Python naming guidelines
process_links should be process_link - since you are processing a single link at a time. And, actually, we can name it get_company_data and let it return the data instead of printing it
Locating Elements:
- don't use XPaths to locate elements - they are generally the slowest and the least readable
- for the company links, I'd better use a more readable and concise
ul.company-list > li > a CSS selector
- in the
process_links method you don't actually need a loop since there is a single company being processed. And, I think, you can generalize and return a dictionary generated from a company web page data dynamically - data labels to data values
Here is the modified working code:
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Fortune500Scraper:
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def get_last_line_number(self):
"""Get the line number of last company loaded into the list of companies."""
return int(self.driver.find_element_by_css_selector("ul.company-list > li:last-child > a > span:first-child").text)
def get_links(self, max_company_count=1000):
"""Extracts and returns company links (maximum number of company links for return is provided)."""
self.driver.get('http://fortune.com/fortune500/list/')
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.company-list")))
last_line_number = 0
while last_line_number < max_company_count:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.wait.until(lambda driver: self.get_last_line_number() != last_line_number)
last_line_number = self.get_last_line_number()
return [company_link.get_attribute("href")
for company_link in self.driver.find_elements_by_css_selector("ul.company-list > li > a")]
def get_company_data(self, company_link):
"""Extracts and prints out company specific information."""
self.driver.get(company_link)
return {
row.find_element_by_css_selector(".company-info-card-label").text: row.find_element_by_css_selector(".company-info-card-data").text
for row in self.driver.find_elements_by_css_selector('.company-info-card-table > .columns > .row')
}
if __name__ == '__main__':
scraper = Fortune500Scraper()
company_links = scraper.get_links(max_company_count=100)
for company_link in company_links:
company_data = scraper.get_company_data(company_link)
pprint(company_data)
print("------")
Prints:
{'CEO': 'C. Douglas McMillon',
'CEO Title': 'President, Chief Executive Officer & Director',
'Employees': '2,300,000',
'HQ Location': 'Bentonville, AR',
'Industry': 'General Merchandisers',
'Sector': 'Retailing',
'Website': 'www.walmart.com',
'Years on Fortune 500 List': '23'}
------
{'CEO': 'Warren E. Buffett',
'CEO Title': 'Chairman & Chief Executive Officer',
'Employees': '367,700',
'HQ Location': 'Omaha, NE',
'Industry': 'Insurance: Property and Casualty (Stock)',
'Sector': 'Financials',
'Website': 'www.berkshirehathaway.com',
'Years on Fortune 500 List': '23'}
------
...