2
votes
Accepted
How to reverse engineer URL routes from a bulk of HTTP requests/responses
So in generic terms you are looking for a fitness function to determine the probability that a web request will be handled by a code path that has not already been probed, based on the URL and the set ...
1
vote
How to Build a Polite, Per-Domain Rate-Limited Web Crawler with Airflow and Celery?
This design question is generic across languages and technologies;
the fact that you're using specific python libs isn't relevant.
A work table contains (domain, url) pairs,
and at any instant only a ...
1
vote
How to check if user is logged in after logging using http post?
Instead of the pessimistic approach (check before every access) you might want to use an optimistic approach: just access the URLs you want, and if you get a HTTP 403 error you know that your ...
1
vote
How large-scale production web crawlers improve performance when it comes to running into already visited links
If you are hitting the DB multiple times per page, you are doing it wrong. Try sending the list of links to the DB in one query.
This will be a MERGE ... WHEN NOT MATCHED operation.
At some point ...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
web-crawler × 23web-scraping × 7
architecture × 4
java × 3
crawlers × 3
algorithms × 2
php × 2
http × 2
concurrency × 2
design × 1
javascript × 1
python × 1
programming-practices × 1
.net × 1
programming-languages × 1
security × 1
data-structures × 1
performance × 1
asp.net-mvc × 1
web × 1
efficiency × 1
search × 1
artificial-intelligence × 1
automation × 1
problem-solving × 1