82

More than a year ago - during the initial discussions about the new, per site data dump system, I'd asked if I could get a full copy of the data dumps. I followed up during the public release and I was told this was possible - on Jul 12th.

For Reference - this is the conversation that resulted in me making the request.

A screenshot of a section of the comments linked above where Philippe the VP of Community says “If you request the backup for that purpose, I'll see that we get it to you.“ in response to “the simplest way would be to have someone request a full copy, for the explicit goal of having a backup of the entire database”

I lodged a ticket on August 8th 2024 (134427). I've been waiting for a rather long time, but so far I've not got a response about it outside "we're checking".

Now, my intention was to help keep a trusted, archive copy, with my own resources for these data dumps. I'd agreed not to use it to train LLMs, and basically went through the process Stack Exchange had suggested. I've provided very early notification I was going to request, and practically, if they'd said no, I'd probably be thinking of alternative ways to do this.

I'd also reminded the staff involved multiple times over the past year, though at one point, I'd given up that I'd get an answer and asked them to let me know when it was sorted. We're a full year of dumps on with no real answer. Practically, Our best option ended up being public third party archives online - which ironically was what the company was trying to prevent.

Now, if I didn't want to go through the process the company suggested - in theory, works for downloading, or I can go after the fact and download it from Internet Archive. However, as with many things - I was hoping the company would have given a straight answer and not made promises it didn't keep.

As such, I'd like to ask - was the process of obtaining a full/complete data dump ever planned and a year on, are there any plans to actually make these available for legitimate uses?

I believe I've been incredibly patient over the issue - and considering one of my main goals was to show the company would accede to reasonable requests. So, are there any plans to actually make the full data dump available with reasonable cause or did the company mislead us? If the process I followed was incorrect, can someone advise me what the company's expectations are of someone requesting a full copy of the data dump for legitimate reasons, and if I've missed any criteria required?

14
  • 17
    The original is already promised this: "You may download the dumps, free of charge, as you always have, just in a different and we hope more convenient location. The CC BY-SA license is unchanged." (emphasis mine). I read this as promising that the way we consume the data dump won't change. As in - we could download the entire thing, the plan was to give us the entire thing. I, too, am disappointed that once again we were lied to. Commented Aug 22 at 12:44
  • 9
    A global access point also resolves the issue of needing a profile on every account to download the data dump. Previously, the official download location was the Internet Archive. Now, it's per-site, so you need an SE account on each site you want to download from, which can leak PII to mods. There are some controls in place about mods accessing PII, but it's still a change to the original ways. It seems like a global download location that lets you download multiple (or all) sites would also resolve this privacy concern. Commented Aug 22 at 12:48
  • 6
    Thanks for not giving up and following up on that topic. A pity (and a shame) that the company cannot deliver on this. Commented Aug 22 at 12:55
  • 2
    @ThomasOwens practically there's very little stopping someone from using a spare/throwaway account and a VPN to download everything. I'd say the feature request was made - its trusting SE to keep its word that's the primary/core issue here. Commented Aug 22 at 13:02
  • 8
    @JourneymanGeek Yes, but that's still not, as VLAZ says, offering the ability to download the dumps as I always have. I don't have a profile on every site nor do I have a one-stop-shop to obtain what I want. On top of that, privacy-preserving mechanisms like throwaway accounts and VPNs may not come without a cost (either monetary costs or the time to make sure the use is effective to not leak data). Commented Aug 22 at 13:09
  • 8
    @NoDataDumpNoContribution There's a reasonable chance we'll either not get an answer, or that the effort isn't worth the payoff, since there's no other requests for the full dumps. My trust in the company to do the right thing without pressure is very low at the moment. And even if I do get it - we had to chase them to do this Commented Aug 22 at 14:36
  • 3
    @JourneymanGeek Actually, making accounts on a VPN appears to be blocked now, at least on my VPN provider. I checked a few months ago for science reasons. Commented Aug 22 at 18:19
  • 32
    @ThomasOwens A global access point does not resolve how flaky the downloads are. It took me around 6 or 7 attempts to download the 2025-06-30 v1 SO data dump, where the majority of the failures were Cloudflare interruption-related. The majority of the rest of the data dump failures were also caused by various Cloudflare failure modes. I wasted around 3-4 complete SO data dumps worth of download (approximately 240GB) because cloudflare was flaky, and the downloads aren't resumable. This would not have been an issue with the torrents Commented Aug 22 at 18:24
  • 34
    The reason I bring this up is that this intentional design of the new data dump process automatically excludes anyone who might actually need the data dump due to locally bad internet. 60+GB is a lot to download already, and over tripling the combined download just on stable internet connections is horrible UX - but anyone on unstable internet connections functionally have to wait for the unofficial torrents, because the bigger dumps cannot be downloaded on slow + unstable connections due to the lack of resumable downloads Commented Aug 22 at 18:27
  • It feels like it's time to escalate this and increase the visibility outside of Meta @Zoe-Savethedatadump The company is just going to keep ignoring the post. Commented Nov 3 at 14:52
  • 1
    Not sure who outside the community would care - or how and how we'd escalate this. Commented Nov 3 at 15:58
  • 1
    @JourneymanGeek Who covered the moderator strike? I think SE breaking promises and the AI tie-in might merit a story somewhere about attempting to silo knowledge in proprietary LLM models. I doubt the company will ever respond unless lawyers get involved though, so it's probably not worth the effort. I got asked the other day by another user if I would consider asking to be reinstated as mod. Nope. Not another minute of my time gets donated to the company. It's completely reasonable for SE to prioritize profit but don't expect me to be a chump giving my time to curate data for a closed AI. Commented Nov 3 at 16:58
  • 2
    @ColleenV i have many other things I'd rather spend my time on. This will be the last bounty though, I don't want to drop under 10k Commented Nov 3 at 18:14
  • 2
    That would be the register. But personally - since my initial goal was a trust building opportunity as much as getting the dumps, it wouldn't quite fit my agenda. If I just wanted the dumps there's 'easier' ways. My problem is really about SE making commitments they won't keep - and forcing them to keep it by bad publicity means they're not really sincere even if they do. Commented Nov 4 at 0:33

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.