The robots.txt for all SE sites explicitly blocks GPTBot and Amazonbot.
GPTBot is OpenAI's web crawler that is used for training their large language models as well as allowing their applications to retrieve content from current live web pages based on user requests. Amazonbot is a web crawling bot used to improve Amazon services (which may or may not include services such as Comprehend and CodeWhisperer) and give Alexa services access to live web pages.
I'm focusing the question on these specific bots. I was able to find information and discussions about voltron/008 and Bytespider. Site owners and administrators have reported that these bots excessively crawl and access resources. Yahoo Pipes (although defunct) is addressed in an early Meta question. However, I am unable to find similar reports for GPTBot and Amazonbot, so I'd like to understand the rationale behind blocking these crawlers.
Other changes have been made. For example, on the 27 March 2024 capture of Stack Overflow's robots.txt, GPTBot is no longer blocked. It was blocked as late as 26 March 2024. PerplexityBot was first blocked sometime around 12 March 2024.
What is the rationale for blocking these specific bots from crawling any SE pages? And, more generally, what is the process used to determine if a bot should be blocked from crawling SE sites?