Thanks a lot for publishing this. It's much appreciated. When reading I wondered why you didn't take into account different possible thresholds. GPT detectors surely are tunable to some extent and this
This survey concluded that GPT detectors misclassify 32% (+/- 6%) of non-GPT posts on Stack Exchange sites as having been written by GPT.
only makes sense for a specific setting. Or was this included in the +/- 6%?
The country dependent suspension rate might not be a bias. Is the underlying assumption that otherwise all countries behave the same with regard to GPT? It may be good to explicitly state all assumptions.
The biggest mistake however seems to be to not ask for feedback before making the decision. Just imagine if you had presented the results and discussed them with the mods before making a decision. There might simply be a discussion about false positive rates now and the strike may never have happened. If anything I think this should be the take-home message: getting feedback reduces risksis often enormously helpful.