-5

The network connection hasn't worked for some weeks in my place, so I'd want something like hosting a Stack Overflow Dump offline before the Internet connection stops working.

I don't think I'll have time to code something, so I've tried this, which works fast on importing smaller Stack Exchange websites.

But that's not the case for SO, wherefore it gets to be overkill (it needs too much time and too many resources) since it has too many posts, comments, and other.

My questions are:

  1. For which user use does Stack Overflow get data-dumped if it's almost unmanageable?
  2. Is there a solution/alternative to what I want today?
8
  • 2
    Maybe better ask this question at Meta SO. Commented May 25, 2020 at 20:36
  • @πάνταῥεῖ nah, it's not. I'm passably asking for a solution on something like offline "hosting" any StackExchange dump, so I'm generally talking about SE too. I don't think it's comfortable to run stackdump together with an other hypothetical solution for SO dump. Commented May 25, 2020 at 21:33
  • What is the actual size these days? (E.g. only the posts without the revision history.) 50 GB compressed? Commented May 27, 2020 at 11:52
  • @P.Mort.-forgotClayShirky_q all SO dumps together are about 50 GB: i.sstatic.net/y9K0X.png Commented May 27, 2020 at 12:37
  • It is all convoluted. To start a few steps ahead, Where are the Stack Exchange data dumps? and then https://archive.org/details/stackexchange. Only the BitTorrent option seems viable ("total size of requested files (63 GB) is too large for zip-on-the-fly"). Commented May 31, 2020 at 18:30
  • PCMag recommends qBittorrent (open-source, for FreeBSD, Linux, macOS, and Windows). Commented May 31, 2020 at 18:35
  • It works! - (at least on Ubuntu 18.04 (Bionic Beaver)): Install as sudo add-apt-repository ppa:qbittorrent-team/qbittorrent-stable and sudo apt-get update && sudo apt-get install qbittorrent Commented May 31, 2020 at 19:46
  • Though there seems to be no way to select another subset for download. The workaround is to delete the line with the (completed) transfer and start over. Commented May 31, 2020 at 20:56

1 Answer 1

2
  1. Universities spring to mind as they like to tap into that huge set for all kind of purposes. Quality of texts, interactions, social effects, artificial intelligence, you name it.

    Those kind of use cases don't mind the size. The more data they have, the more reliable their models can get. And they have all the time. Mind you that the Stack Overflow dump is already split in several files (Posts, Comments, Posthistory, and Votes) so you don't have to load all data if you're only interested in Comments.

  2. What you can do is create a couple of SEDE queries to get you just the data you need that is of your interest. You can for example only download the posts from the Haskell tag in the CSV format and import that into your data-store of choice. Keep in mind that your dataset can be at most 50,000 rows so you might need some trickery/filtering in a few steps if your need more data. Additional benefit is that the SEDE database is refreshed weekly so you get more recent stuff as opposed to the data-dump that is still on a quarterly refresh scheme.
2
  • Thanks for your reply. Could you show how you practically do this "trickery/filtering" you're talking about with SEDE? Commented May 26, 2020 at 22:34
  • @SOL0v3r the 50,000 rows is linked to an example of doing trickery but for your convenience here is the link again: meta.stackexchange.com/a/234774/158100 there are other answers on that Q/A that offers other options. Commented May 27, 2020 at 5:56

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.