Skip to main content
Tweeted twitter.com/StackCodeReview/status/1205185731154632704
Added question about method logic
Source Link
anddt
  • 195
  • 1
  • 7

edit: also, given job listings have pretty much similar attributes, could it make sense to create a class listing with attribute and a method to push it to db (with de-duplication logic perhaps)?

Be ruthless, thanks in advance.

Be ruthless, thanks in advance.

edit: also, given job listings have pretty much similar attributes, could it make sense to create a class listing with attribute and a method to push it to db (with de-duplication logic perhaps)?

Be ruthless, thanks in advance.

added 2 characters in body
Source Link
AlexV
  • 7.4k
  • 2
  • 24
  • 47

I just wrtoewrote a short function to parse SO jobs xmlXML feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo urlURL, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
  

parse_feed.py:

And here the helper function used to grab company logo urlsURLs from get_so_extras module.   

get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the xmlXML parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

I just wrtoe a short function to parse SO jobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
 parse_feed.py:

And here the helper function used to grab company logo urls from get_so_extras module.  get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the xml parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

I just wrote a short function to parse SO jobs XML feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo URL, salary info, etc.) through bs4 with couple functions in the get_so_extras module. 

parse_feed.py:

And here the helper function used to grab company logo URLs from get_so_extras module. 

get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the XML parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

Added xml feed url
Source Link
anddt
  • 195
  • 1
  • 7

I just wrtoe a short function to parse SO xml feedjobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

I just wrtoe a short function to parse SO xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

I just wrtoe a short function to parse SO jobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

added 8 characters in body
Source Link
anddt
  • 195
  • 1
  • 7
Loading
Source Link
anddt
  • 195
  • 1
  • 7
Loading