I just wrtoe a short function to parse SO jobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:
And here the helper function used to grab company logo urls from get_so_extras module.
get_so_extras.py:
It'd be great to understand whether I made any obvious mistake or bad choice along the way.
Obviously calling get_company_logo within the xml parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.