Tag Info

Hot answers tagged web-crawler

2 votes

Accepted

How to reverse engineer URL routes from a bulk of HTTP requests/responses

So in generic terms you are looking for a fitness function to determine the probability that a web request will be handled by a code path that has not already been probed, based on the URL and the set ...

Ergwun

answered Nov 2, 2017 at 0:51

1 vote

How to Build a Polite, Per-Domain Rate-Limited Web Crawler with Airflow and Celery?

This design question is generic across languages and technologies; the fact that you're using specific python libs isn't relevant. A work table contains (domain, url) pairs, and at any instant only a ...

J_H

7,997

answered Oct 20 at 19:22

1 vote

How to check if user is logged in after logging using http post?

Instead of the pessimistic approach (check before every access) you might want to use an optimistic approach: just access the URLs you want, and if you get a HTTP 403 error you know that your ...

Hans-Martin Mosner

18.6k

answered Aug 4, 2019 at 7:15

1 vote

How large-scale production web crawlers improve performance when it comes to running into already visited links

If you are hitting the DB multiple times per page, you are doing it wrong. Try sending the list of links to the DB in one query. This will be a MERGE ... WHEN NOT MATCHED operation. At some point ...

Caleth

12.4k

answered May 8, 2019 at 9:22

Only top scored, non community-wiki answers of a minimum length are eligible

Tag Info

Hot answers tagged web-crawler

Related Tags