Are the newest Stack Overflow data dumps still usable today?

Question

The network connection hasn't worked for some weeks in my place, so I'd want something like hosting a Stack Overflow Dump offline before the Internet connection stops working.

I don't think I'll have time to code something, so I've tried this, which works fast on importing smaller Stack Exchange websites.

But that's not the case for SO, wherefore it gets to be overkill (it needs too much time and too many resources) since it has too many posts, comments, and other.

My questions are:

For which user use does Stack Overflow get data-dumped if it's almost unmanageable?
Is there a solution/alternative to what I want today?

@πάνταῥεῖ nah, it's not. I'm passably asking for a solution on something like offline "hosting" any StackExchange dump, so I'm generally talking about SE too. I don't think it's comfortable to run stackdump together with an other hypothetical solution for SO dump. — SO L0v3r
– SO L0v3r, Commented May 25, 2020 at 21:33
What is the actual size these days? (E.g. only the posts without the revision history.) 50 GB compressed? — This_is_NOT_a_forum
– This_is_NOT_a_forum, Commented May 27, 2020 at 11:52
@P.Mort.-forgotClayShirky_q all SO dumps together are about 50 GB: i.sstatic.net/y9K0X.png — user152859
– user152859, Commented May 27, 2020 at 12:37
It is all convoluted. To start a few steps ahead, Where are the Stack Exchange data dumps? and then https://archive.org/details/stackexchange. Only the BitTorrent option seems viable ("total size of requested files (63 GB) is too large for zip-on-the-fly"). — This_is_NOT_a_forum
– This_is_NOT_a_forum, Commented May 31, 2020 at 18:30
PCMag recommends qBittorrent (open-source, for FreeBSD, Linux, macOS, and Windows). — This_is_NOT_a_forum
– This_is_NOT_a_forum, Commented May 31, 2020 at 18:35
It works! - (at least on Ubuntu 18.04 (Bionic Beaver)): Install as sudo add-apt-repository ppa:qbittorrent-team/qbittorrent-stable and sudo apt-get update && sudo apt-get install qbittorrent — This_is_NOT_a_forum
– This_is_NOT_a_forum, Commented May 31, 2020 at 19:46
Though there seems to be no way to select another subset for download. The workaround is to delete the line with the (completed) transfer and start over. — This_is_NOT_a_forum
– This_is_NOT_a_forum, Commented May 31, 2020 at 20:56

This_is_NOT_a_forum · Accepted Answer · 2020-05-27 11:49:39Z

2

Universities spring to mind as they like to tap into that huge set for all kind of purposes. Quality of texts, interactions, social effects, artificial intelligence, you name it.

Those kind of use cases don't mind the size. The more data they have, the more reliable their models can get. And they have all the time. Mind you that the Stack Overflow dump is already split in several files (Posts, Comments, Posthistory, and Votes) so you don't have to load all data if you're only interested in Comments.
What you can do is create a couple of SEDE queries to get you just the data you need that is of your interest. You can for example only download the posts from the Haskell tag in the CSV format and import that into your data-store of choice. Keep in mind that your dataset can be at most 50,000 rows so you might need some trickery/filtering in a few steps if your need more data. Additional benefit is that the SEDE database is refreshed weekly so you get more recent stuff as opposed to the data-dump that is still on a quarterly refresh scheme.

edited May 27, 2020 at 11:49

This_is_NOT_a_forum

6,5114 gold badges38 silver badges56 bronze badges

answered May 26, 2020 at 5:48

reneMod

94.8k18 gold badges255 silver badges539 bronze badges

Thanks for your reply. Could you show how you practically do this "trickery/filtering" you're talking about with SEDE?

SO L0v3r
– SO L0v3r

2020-05-26 22:34:13 +00:00
Commented May 26, 2020 at 22:34
@SOL0v3r the 50,000 rows is linked to an example of doing trickery but for your convenience here is the link again: meta.stackexchange.com/a/234774/158100 there are other answers on that Q/A that offers other options.

rene
– rene Mod

2020-05-27 05:56:13 +00:00
Commented May 27, 2020 at 5:56

Add a comment |

Stack Exchange Network

Are the newest Stack Overflow data dumps still usable today?

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Are the newest Stack Overflow data dumps still usable today?

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions