Choosing best resource for crawling data from Stack Exchange sites

Question

For the purpose of an academic research project, I would like to obtain detailed data on questions, answers, tags, users, etc. That is, I seek historical data, as detailed as possible. I have seen that there are three resources, as listed in this answer. Namely, the API, the data dump, and the Stack Exchange Data Explorer.

From what I understood, the API is more suitable for obtaining live data. Upon viewing the two other alternatives - the dump and the SEDE - it is not clear which one is more suitable. In the dump, one can just download zipped XMLs whereas in the SEDE one can send customized queries.

Is it the case that the dump includes everything that can be gotten through the SEDE? Or does the SEDE provide richer data in some sense? Can someone explain the differences between these two and which advice on which one is more suitable given my purpose?

rene · Accepted Answer · 2017-09-22 13:41:59Z

It depends a bit on which data you need.

The StackAPI has indeed up-to-date data but it isn't very well suited to download large amounts of data. It is limited and throttled to max 10,000 calls per day and it only allows for only a limited number of calls within a timeframe (expect allowed bursts of max 60 calls per 90 seconds but it might drop quickly to lower rates if you do this repeatedly ending in a full IP ban if you don't pay attention/respect the back_off parameter).

The datadump is refreshed quarterly and consist of an dump of certain tables in XML format. You can find the tables and columns that are available in the Read Me. The only limit in this is your own network and storage capabilities. You have to provide your own database.

The Stack Exchange Data Explorer is refreshed weekly, on Sunday and has the most extensive dataset available. You can find a description of its content in Database schema documentation for the public data dump and SEDE
You'll notice that some tables match with the datadump but on top of that it offers a few more tables/columns. SEDE allows to run queries directly on the SqlServer instance so you can shape and filter data on the server to get reasonable resultsets. You are limited to a maximum of 50,000 rows and your queries need to run to completion within 2 minutes.
SEDE has an option to export the results to a CSV but it doesn't come with a convenient API to do this. You'll need to click yourself.

Keep in mind that although some history is recorded and available, for example in the PostHistory table for Posts, you can't reliably reconstruct reputation history for users. The reputation events are not public.

Thank you @rene. So just to be clear, neither of the three methods can reconstruct reputation history for users, correct? — splinter
– splinter, Commented Sep 22, 2017 at 14:44
@splinter that is correct. You can get a rough estimate by combining several stuff but you'll miss the down votes on answers cast by a user for example. See meta.stackexchange.com/questions/144716/… and meta.stackexchange.com/questions/280569/… — rene
– rene Mod, Commented Sep 22, 2017 at 15:20

Stack Exchange Network

Choosing best resource for crawling data from Stack Exchange sites

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Choosing best resource for crawling data from Stack Exchange sites

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions