Newest 'web-crawler' Questions - Software Engineering Stack Exchange

2 votes

2 answers

208 views

How to Build a Polite, Per-Domain Rate-Limited Web Crawler with Airflow and Celery?

I'm designing a "polite" web crawler using Airflow with the Celery Executor, PostgreSQL for metadata and actual content used by the crawler, and Redis as the Celery broker. My goal is to ...

sebap123

129

asked Jul 6 at 6:48

0 votes

1 answer

917 views

How to check if user is logged in after logging using http post?

I'm developing a Scrap app to extract some information from a sit. To get that information I have to be logged in to that site. So I use Http post and pass the data needed for login using FormData ...

alexpfx

313

asked Aug 3, 2019 at 20:44

0 votes

1 answer

197 views

How large-scale production web crawlers improve performance when it comes to running into already visited links

I have built a very basic webcrawler running off my laptop so it has limited memory and limited hard drive space. The way I have it now is I'm using MongoDB to store the links I find on pages. I make ...

Lokasa Mawati

131

asked May 8, 2019 at 8:48

2 votes

1 answer

1k views

How to reverse engineer URL routes from a bulk of HTTP requests/responses

I am building a web application crawler that crawls for HTTP requests (GET, PUT, POST, ...). It is designed for one specific purpose; bug bounty hunting. It enables pentesters to insert exploit ...

Tijme

31

asked Oct 30, 2017 at 23:19

1 vote

0 answers

1k views

What are the best practices for picking selectors for web scrappers?

The following is an example using https://github.com/GoogleChrome/puppeteer 'use strict'; const puppeteer = require('puppeteer'); (async() => { // const browser = await puppeteer.launch(); // ...

alex

383

asked Aug 31, 2017 at 13:06

1 vote

1 answer

1k views

detecting website opening in new tab/window?

So, as part of my final year project, I'm writing a web crawler in Java to gather website data that I will then process. One of the attributes I need to gather is "number of popups". I know a pop-up ...

Sophie Brown

19

asked Oct 25, 2016 at 19:24

3 votes

1 answer

221 views

Is it considered bad practice to crawl through the mobile version of a site?

I am building a web spider to crawl through several different sites, but one of them uses javascript buttons instead of links for several functions. And while I could learn to follow them, it adds an ...

Devon M

532

asked Jul 17, 2016 at 6:24

4 votes

1 answer

338 views

How much processing to do in the crawler? - good crawling practices

I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more ...

nikitautiu

143

asked Jun 17, 2016 at 16:58

5 votes

1 answer

752 views

Can I whitelist user agents that will execute JavaScript?

I'm building a SPA (single page application) so that when a browser request a page from my server, it only receives a small HTML and a big JavaScript app that then asks the appropriate data from the ...

Pablo Fernandez

313

asked Sep 7, 2015 at 20:21

-2 votes

1 answer

2k views

Crawler - programming language choice [closed]

I'm currently developing a web crawler. The first version was developed in Node.js and runs pretty well. The issues that I encountered with Node.js are in no particular order: slow URL and query-...

m_vdbeek

125

asked May 1, 2015 at 11:41

4 votes

0 answers

119 views

IRLBot Paper DRUM Implementation - Why keep key, value and auxiliary buckets separate?

Repost from here as I think it may be more suited to this exchange. I'm trying to implement DRUM (Disk Repository with Update Management) as per the IRLBot paper (relevant pages start at 4) but as ...

Isaac

183

asked Apr 7, 2015 at 22:47

2 votes

2 answers

2k views

Patterns for creating adaptive web crawler throttling

Im running a service that crawls many websites daily. The crawlers are run as jobs processed by a bunch of independent background worker processes, that picks up the jobs as they get enqueued. Now ...

Niels Kristian

181

asked Sep 30, 2014 at 10:27

1 vote

2 answers

384 views

How do I ensure my site will be crawled when articles are generated by the database?

I wasn't sure how to ask the question. But basically, it's a textbook scenario. I'm working on a site that's article based, but the article information is stored in a database. Then the page is ...

Sinaesthetic

302

asked Aug 1, 2014 at 5:14

-1 votes

1 answer

3k views

web spider for facebook

I just subscribed to a facebook page which post links to different open source projects or code archives. I'll like to save those links and descriptions to a local db. How can I do that? I heard ...

dole doug

197

asked Aug 24, 2013 at 14:30

4 votes

1 answer

13k views

What is the way to go to extract data from websites? [closed]

I've been thinking about a side project that envolves web data scraping. Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights. ...

salaniojr

49

asked May 23, 2013 at 12:21

Stack Exchange Network

Questions tagged [web-crawler]

How to Build a Polite, Per-Domain Rate-Limited Web Crawler with Airflow and Celery?

How to check if user is logged in after logging using http post?

How large-scale production web crawlers improve performance when it comes to running into already visited links

How to reverse engineer URL routes from a bulk of HTTP requests/responses

What are the best practices for picking selectors for web scrappers?

detecting website opening in new tab/window?

Is it considered bad practice to crawl through the mobile version of a site?

How much processing to do in the crawler? - good crawling practices

Can I whitelist user agents that will execute JavaScript?

Crawler - programming language choice [closed]

IRLBot Paper DRUM Implementation - Why keep key, value and auxiliary buckets separate?

Patterns for creating adaptive web crawler throttling

How do I ensure my site will be crawled when articles are generated by the database?

web spider for facebook

What is the way to go to extract data from websites? [closed]

Hot Network Questions