Is Stack Exchange explicitly blocking web crawlers that have a potential to be used for training AI models?

Ask Question

Asked 2 years, 3 months ago

Modified 1 year ago

Viewed 1k times

The robots.txt for all SE sites explicitly blocks GPTBot and Amazonbot.

GPTBot is OpenAI's web crawler that is used for training their large language models as well as allowing their applications to retrieve content from current live web pages based on user requests. Amazonbot is a web crawling bot used to improve Amazon services (which may or may not include services such as Comprehend and CodeWhisperer) and give Alexa services access to live web pages.

I'm focusing the question on these specific bots. I was able to find information and discussions about voltron/008 and Bytespider. Site owners and administrators have reported that these bots excessively crawl and access resources. Yahoo Pipes (although defunct) is addressed in an early Meta question. However, I am unable to find similar reports for GPTBot and Amazonbot, so I'd like to understand the rationale behind blocking these crawlers.

Other changes have been made. For example, on the 27 March 2024 capture of Stack Overflow's robots.txt, GPTBot is no longer blocked. It was blocked as late as 26 March 2024. PerplexityBot was first blocked sometime around 12 March 2024.

What is the rationale for blocking these specific bots from crawling any SE pages? And, more generally, what is the process used to determine if a bot should be blocked from crawling SE sites?

edited Jan 10, 2025 at 20:27

bobble

19.9k5 gold badges45 silver badges117 bronze badges

asked Oct 5, 2023 at 14:47

Thomas Owens

65.3k20 gold badges120 silver badges220 bronze badges

8

Related from MSO: Will Stack Overflow disallow the GPT web crawler? and Why has Stack Overflow banned the GPT web crawler? (now deleted)

Brian61354270
– Brian61354270

2023-10-05 14:53:26 +00:00
Commented Oct 5, 2023 at 14:53
2

@Makoto Ensuring access to content has nothing to do with the type of content allowed. That's like asking why it matters if a crawler is blocked because we disallow rude and offensive answers on the site. Depending on the answer to this question, I have a number of follow up questions that also need to be addressed.

Thomas Owens
– Thomas Owens

2023-10-05 15:57:04 +00:00
Commented Oct 5, 2023 at 15:57
4

I mean, I'm just not seeing why one should care, is all. They disallowed this crawler. So what? What are you really trying to angle at here?

Makoto
– Makoto

2023-10-05 16:05:16 +00:00
Commented Oct 5, 2023 at 16:05
9

@Makoto As a contributor, I care about why access to the content that I create and share under the terms of a free and open license is being restricted. And that is what is happening here. OpenAI and Amazon are being blocked from accessing my content through this platform, and the decision was made unilaterally by the company with no input from contributors. As a contributor, I simply want to understand the rationale behind the decision and, generally, how the decision is made regarding other crawlers.

Thomas Owens
– Thomas Owens

2023-10-05 16:09:27 +00:00
Commented Oct 5, 2023 at 16:09
8

Oh, so this isn't a question about why a crawler is banned. It's a question about whether or not the company is overreaching when they explicitly prohibit something like OpenAI from accessing the data via robots. Going to say that this is already a well-settled pattern; the company can absolutely choose to republish content in any way they wish, and this is not in violation of any license that has already been pre-established. CC-by-SA does not prescribe a delivery vehicle for the content published by it, nor does it mandate that data be available 100% free-of-charge.

Makoto
– Makoto

2023-10-05 16:13:40 +00:00
Commented Oct 5, 2023 at 16:13
1

@Makoto Maybe. Maybe not. It depends on the rationale for blocking the crawler. Blocking the crawler because they behave poorly and risk the stability of the network is a fundamentally different decision than blocking the crawler simply because it's being used to train an AI model. The company can choose to control how material is distributed, but why and how that choice is made matters. So let's start with how and why the decision was made before going to the next step.

Thomas Owens
– Thomas Owens

2023-10-05 16:17:03 +00:00
Commented Oct 5, 2023 at 16:17
7

I'm going to forcefully say that it does not matter why the company chooses to ban scrapers on the site. If they don't want the traffic coming from that kind of client, that is entirely within their wont to decide. If you're concerned about your data not being made available to bots to train on, then the bots could, in practice, just use the data dumps. Or, they could use Stack Overflow's curated/cleaned data set, whenever that's available for purchase. This is why I struggle to see what you're in a huff about here.

Makoto
– Makoto

2023-10-05 16:20:01 +00:00
Commented Oct 5, 2023 at 16:20
5

@Makoto I don't care that you don't see the problem. I see a problem and I want answers. There is no other way to go about requesting these answers from the company.

Thomas Owens
– Thomas Owens

2023-10-05 16:21:42 +00:00
Commented Oct 5, 2023 at 16:21
1

It matters... to some people. It doesn't matter from the perspective of honoring the license that the data exists under. It's fair to be concerned and want answers, but it's certainly not a case of SE interfering with the usage of our content as allowed by the license. (which i don't think is what you're arguing anyway)

user400654
– user400654

2023-10-05 16:35:58 +00:00
Commented Oct 5, 2023 at 16:35
1

@KevinB If I understand you right, yes. I'm definitely not arguing that SE can't do what they are doing, especially in a legal context. However, understanding these policies is important to me (and likely other contributors). Who has access to our contributions on this platform, who is being blocked, why, and how these decisions are made are all important to consider. Not only should the decisions and policies be up for discussion among the affected contributors, but it may impact how people may want to make their contributions available on other platforms.

Thomas Owens
– Thomas Owens

2023-10-05 17:28:58 +00:00
Commented Oct 5, 2023 at 17:28
9

This is why I struggle with your complaint. Blocking bots and crawlers from making connections and draining resources is such a mundane, routine thing to do when hosting a website that it seems pointless to complain about. Stack Exchange has no obligation to publish your content either, so whether or not bots can scrape the content at all is just a settled debate. If your sentiment is that Stack Exchange should allow these kinds of bots to scrape data so that your data can be a part of an LLM, then that's a question about expense, which would justify them charging for the data.

Makoto
– Makoto

2023-10-05 17:54:18 +00:00
Commented Oct 5, 2023 at 17:54
3

@Makoto Historically, content has been open and accessible. Yahoo Pipes was blocked because it crawled excessively and did not respect robots.txt for what not to crawl, for example. There are similar reports for other blocked bots. There is no evidence that I can find that Amazonbot and GPTBot are abusive or draining resources from human users of the platform. If they are indeed respectful and the only reason they were blocked was because they train AI models, that's a conversation that contributors need to be a part of and not a unilateral decision on the part of the company.

Thomas Owens
– Thomas Owens

2023-10-05 18:00:48 +00:00
Commented Oct 5, 2023 at 18:00
1

Just because it's of extreme importance to you doesn't mean they suddenly need to start announcing why they make every temporary or permanent IP/UserAgent decision they make. And as I suggested before, just because it's not publicly documented doesn't mean there isn't bad behavior, and there may very well be plausible reasons why they're not "announcing" anything. A lot of "stopping bad behavior" actions are purposely secret / vague / obscure because more information only helps the attacker. And not helping the attacker, IMO, should be way more important than satisfying your curiosity.

Stuck at 1337
– Stuck at 1337

2023-10-05 23:03:27 +00:00
Commented Oct 5, 2023 at 23:03
3

Suddenly, "Prevent datadump use" vibes... I see a pattern here, do you see one also?

SPArcheon
– SPArcheon

2023-11-06 12:11:21 +00:00
Commented Nov 6, 2023 at 12:11
3

Wow no answer yet. What a shock! I'm shocked, shocked!

Fattie
– Fattie

2024-03-12 01:41:50 +00:00
Commented Mar 12, 2024 at 1:41

| Show 25 more comments

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Is Stack Exchange explicitly blocking web crawlers that have a potential to be used for training AI models?

0

You must log in to answer this question.

Linked

Hot Network Questions

Is Stack Exchange explicitly blocking web crawlers that have a potential to be used for training AI models?

0

You must log in to answer this question.

Linked

Related

Hot Network Questions