Newest 'web-scraping' Questions - Software Engineering Stack Exchange

1 vote

1 answer

249 views

Can I get Open Graph Protocol data without behaving as a web scraper?

In general, to get Open Graph protocol (OGP) data for a given web page, one would need to retrieve the actual HTML, and then extract the meta tags from it. However, this has two problems: Instead of ...

Lamron

119

asked Sep 8 at 21:39

1 vote

2 answers

804 views

Async scraper in Python

I've written a web scraper and would like it to run as quickly as possible. The scraping isn't trivial; I scrape a few web-pages, gather links from them, scrape those, then gather links from those, ...

Pavlin

159

asked Oct 21, 2022 at 9:18

-4 votes

3 answers

246 views

How to identify whether or not 2 pieces of text are identical? [closed]

Let's say I was to create a scraper. At some point I'll need to come up with algorithm of identifing whether or not a piece of a newly scraped text matches the one that's already in the DB. How would ...

Nicholas E. Harding

31

asked Jun 29, 2022 at 2:07

-1 votes

1 answer

101 views

Is Web Scraping Viable When Information Needs to Be Formatted and Displayed

I'm working on a project where I want to display a list of wedding venues within X miles of a users location. My first thought is that I will use some type of web scraper to pull in a list of venues. ...

tdammon

121

asked Aug 17, 2020 at 13:58

-2 votes

3 answers

92 views

Is there a secure way to ensure a data in an API endpoint of mine came from an Instagram endpoint?

Is there a way through encryption/keys/jwt or anything else to ensure that the data being sent through a POST request is only data coming from another request I made on the client to a 3rd party ...

David

219

asked May 5, 2020 at 10:51

-3 votes

1 answer

100 views

How to manage source data for movies and series?

I am trying to build a system that tells the user on which platforms (like Netflix, prime, etc.) a movie or series is available. What is the best way to go about it? I have considered the following: ...

Jacob Antony

1

asked Apr 29, 2020 at 14:42

1 vote

1 answer

110 views

Scraper in separate repo from visualization component?

Let me explain my thoughts about architecture of the project I'm working on. The project code repository consist of: Scrapy component - of course it serves to scrape data, process it and calculate ...

Bob

13

asked Apr 24, 2020 at 9:58

3 votes

1 answer

11k views

How to approach a large number of multiple, parallel HttpClient requests?

I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even ...

nicktheone

41

asked Feb 23, 2020 at 17:33

0 votes

2 answers

271 views

Normalising pagination regardless of the target source

I have a service that fetches data from a target source (not through an API but via scraping) which can change. I want to do pagination so that I return 35 items per page but the target source is 25 ...

Alexander Hunt

349

asked Nov 29, 2019 at 22:04

1 vote

1 answer

563 views

Advice on designing a scraper DSL

I am creating a DSL for a scraping library I am writing. I would like advice on how to design a DSL, and if the designs I have below are good ones. Apologies if this is an open-ended question, but it ...

andykais

111

asked Sep 8, 2019 at 15:41

0 votes

1 answer

917 views

How to check if user is logged in after logging using http post?

I'm developing a Scrap app to extract some information from a sit. To get that information I have to be logged in to that site. So I use Http post and pass the data needed for login using FormData ...

alexpfx

313

asked Aug 3, 2019 at 20:44

0 votes

1 answer

197 views

How large-scale production web crawlers improve performance when it comes to running into already visited links

I have built a very basic webcrawler running off my laptop so it has limited memory and limited hard drive space. The way I have it now is I'm using MongoDB to store the links I find on pages. I make ...

Lokasa Mawati

131

asked May 8, 2019 at 8:48

0 votes

1 answer

114 views

Lightweight data mining + organization & visualization

I'm looking to do some simple data mining that consists of going once per day to a single page and collect the following information: List of movie theaters Movies today on each theater Session times ...

dR_

11

asked Jan 22, 2019 at 11:45

2 votes

1 answer

255 views

Is it possible to layer an API (REST, GraphQL, etc.) in front of data that is currently only accessible via an enterprise desktop GUI?

Currently, my thoughts are that GET requests would be feasible by using the concept of screen scraping combined with a cron job that runs at a set interval to scrape data from the GUI and sync to my ...

J. Munson

137

asked Jan 4, 2019 at 1:23

0 votes

1 answer

189 views

Should I opt to not use someone's API?

At a job that I recently started, I inherited some of the projects from the guy who previously held this position. One of the projects was a program that used a Website Platform's public API to get ...

SH7890

277

asked Sep 12, 2018 at 20:39

Stack Exchange Network

Questions tagged [web-scraping]

Can I get Open Graph Protocol data without behaving as a web scraper?

Async scraper in Python

How to identify whether or not 2 pieces of text are identical? [closed]

Is Web Scraping Viable When Information Needs to Be Formatted and Displayed

Is there a secure way to ensure a data in an API endpoint of mine came from an Instagram endpoint?

How to manage source data for movies and series?

Scraper in separate repo from visualization component?

How to approach a large number of multiple, parallel HttpClient requests?

Normalising pagination regardless of the target source

Advice on designing a scraper DSL

How to check if user is logged in after logging using http post?

How large-scale production web crawlers improve performance when it comes to running into already visited links

Lightweight data mining + organization & visualization

Is it possible to layer an API (REST, GraphQL, etc.) in front of data that is currently only accessible via an enterprise desktop GUI?

Should I opt to not use someone's API?

Hot Network Questions