Revisions to XML parser + bs4 combo - Code Review Stack Exchange

Tweeted twitter.com/StackCodeReview/status/1205185731154632704

occurred Dec 12, 2019 at 18:00

Added question about method logic

Source Link

edited Dec 11, 2019 at 13:36

anddt

195
1
7

edit: also, given job listings have pretty much similar attributes, could it make sense to create a class listing with attribute and a method to push it to db (with de-duplication logic perhaps)?

Be ruthless, thanks in advance.

edit: also, given job listings have pretty much similar attributes, could it make sense to create a class listing with attribute and a method to push it to db (with de-duplication logic perhaps)?

Be ruthless, thanks in advance.

added 2 characters in body

Source Link

edited Dec 11, 2019 at 12:10

AlexV

7.4k
2
24
47

I just wrtoewrote a short function to parse SO jobs xmlXML feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo urlURL, salary info, etc.) through bs4 with couple functions in the get_so_extras module.

parse_feed.py:

And here the helper function used to grab company logo urlsURLs from get_so_extras module.

get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the xmlXML parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

I just wrtoe a short function to parse SO jobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

And here the helper function used to grab company logo urls from get_so_extras module. get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the xml parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

I just wrote a short function to parse SO jobs XML feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo URL, salary info, etc.) through bs4 with couple functions in the get_so_extras module.

parse_feed.py:

And here the helper function used to grab company logo URLs from get_so_extras module.

get_so_extras.py:

It'd be great to understand whether I made any obvious mistake or bad choice along the way. Obviously calling get_company_logo within the XML parsing loop slows things down, what would be the best approach to get those fields?
Probably checking if we're looking at a new company/listing from db before visiting the page to scrape the logo would be more efficient.

Added xml feed url

Source Link

edited Dec 11, 2019 at 11:59

anddt

195
1
7

I just wrtoe a short function to parse SO xml feedjobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

I just wrtoe a short function to parse SO xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

I just wrtoe a short function to parse SO jobs xml feed and return dicts containing information about each job entry.
Now, each job entry page is visited to grab info not present in the xml feed (company logo url, salary info, etc.) through bs4 with couple functions in the get_so_extras module.
parse_feed.py:

added 8 characters in body

Source Link

edited Dec 11, 2019 at 10:52

anddt

195
1
7

Loading

Source Link

asked Dec 11, 2019 at 10:41

anddt

195
1
7

Loading

Stack Exchange Network

Return to Question