-27

Sometimes we find a post on Stack Overflow which explains the very thing that we are trying to achieve in our code. When we commit the respective code into our VCS, we do not write another lengthy explanation, but instead just add the URL to the code on Stack Overflow available under the "share" link of that question or answer.

In the past, this has seemed practical, specifically since SO explains in their legal terms that we give up all our rights on our writings here, in order for SO to protect the documentation.

But when reading some old commit logs, we now find that, ten years later these links into SO return a "page not found" reply or the message "removed for reasons of moderation". Specifically, the concerned topic for us is SQL query semantics.

Sadly, we learn about such consequences only when it is too late. Now, about solving the problem:

  • getting back the contents may or may not be possible, as internet archives may or may not have archived them. Are there any other services that document what gets deleted over time by Stack Overflow?
  • then fixing a dead link in a ten-year-old VCS appears to be a difficult task. What are the best strategies to achieve that?
  • and finally, how can we in the future effectively warn people to protect themselves from such dangers, such as to not use Stack Overflow, specifically for documentation purposes?

We are aware that this issue is part of a larger ongoing phenomenon where ever more of our undertakings are moved to the online world, while at the same time in the online world, practically everything that doesn't bring in big advertisement revenues, gets deleted. Sadly, this in inherently incompatible with professional work.

33
  • 22
    I'm confused how we got from "removed for reasons of moderation" (which I guess is the text for any deleted content), to harassment, stolen content or political correctness. I'd have to see the specific post you encountered (users > 10k are able to access deleted content), but chances are very high that they were off-topic or closed for some other reason and were automatically deleted after some time. No content has ever been deleted on SO because "it doesn't bring big advertisement revenues". Commented Oct 30 at 13:51
  • 27
    If your way to preserve important information is by saving links to websites, you are not doing it right. Commented Oct 30 at 14:02
  • 12
    But you don't have any contract or anything with SO that states they will preserve any and all content posted here. As I said, I can't say anything about your specific case, but it would be rather strange if off-topic stuff or spam would be left online. Commented Oct 30 at 14:03
  • 12
    Storing a link to any website in commit logs without any further explanation doesn't sound like a good way to preserve information anyway. Even if the content is still hosted, there's usually no guarantee that the links don't change (microsoft docs have broken their links several times over the last decade). Commented Oct 30 at 14:05
  • 15
    This sounds like a mistaken assumption made by your organization that the content on SO is guaranteed to be preserved indefinitely. SO has never made that guarantee, so it sounds like there was some misunderstanding at some point. I suggest cutting your losses; track down all the SO links in your repo/commitlogs/etc, and replace them, if possible, with content that you can guarantee will be available in the future. Commented Oct 30 at 14:24
  • 10
    Without providing a link, I'm not sure what you aim to achieve with this post? Do you want moderation to just stop, because that obviously cannot happen... Commented Oct 30 at 14:26
  • 14
    Also, when we built the internet, we were taught not to treat ephemeral references, such as web pages, as permanent. I wonder when that changed Commented Oct 30 at 14:37
  • 6
    @PMc: I think you still don't understand when content is deleted on SO. Only a very tiny fraction of deletions are done by the company. A larger amount is done by moderators according the the community guidelines (spam, offensive, ...). Then there are regular users with more than 10k who can vote to delete (3 votes+ needed to actually delete). And then there are automatic processes that deletes posts under certain conditions as described in The Community user deleted my question! What gives?. Commented Oct 30 at 14:39
  • 14
    "I am mainly concerned about the fact THAT it disappeared, and the consequential remedies necessary to avoid such in the future." OK, very easy to solve then - don't rely entirely on references to information. At least summarise it and add the reference for as further (but non-essential) reading. Commented Oct 30 at 14:41
  • 17
    Old questions getting deleted being equated to a credit card data breach is a new one here that's for sure Commented Oct 30 at 14:42
  • 7
    That's an unanswered question, by yourself. It was automatically deleted. cln.sh/pk8TzrxlG4mBqWH355g0 You should be able to find it in your deleted posts, in your own profile. If you really had the link, then you shouldn't get a 404, since you own the post. So either you weren't logged in, or you just posted the wrong URL. Commented Oct 30 at 15:08
  • 9
    I thought all users could see their deleted posts? Is all of this just because you forgot to login?? Commented Oct 30 at 15:10
  • 18
    You have a number of assertions, most or all are which are based on faulty premises about Stack Overflow and the internet in general. Yes, it's annoying when links go dead. No, SO never promised to keep your content around forever, just the opposite. Our promise is to provide content curated for enduring value, which means deleting some content that doesn't meet this criteria. Commented Oct 30 at 15:34
  • 6
    I thought Wikipedia has a very similar deletion model to Stack Overflow. en.wikipedia.org/wiki/Wikipedia:Viewing_deleted_content Isn't this correct? Commented Oct 30 at 15:50
  • 17
    This can all be true of every single link you ever put in your code/comments/source control, and certainly isn't isolated to the subset of Stack Overflow posts that eventually get deleted. And while most of those will still be accessible to 10K users, there's also the Wayback Machine for any post up long enough to be archived there. In fact using that link might be better because it preserves the post at that time. If you want to magically preserve the content found on someone else's servers that you can't control, your best bet is to save off an HTML doc. Commented Oct 30 at 16:29

4 Answers 4

24

URIs aren't permanent. They have never been reliable because content on the internet goes missing all the time. Stack Overflow is one of the best sites in this regard, as the content behind the links is rarely removed and the links haven't changed their structure in 15+ years.

To protect yourself from losing the information behind a link, employ a policy similar to what Stack Overflow does: the relevant content must be available on the site even if the link goes offline. The link only serves as a reference to the original material. Link-only answers are unacceptable.

17

It's not unreasonable to link to a source when you include code from another source, but this is the internet, things come and go, and any important information from the source that you need to preserve should be documented as well in the event the link goes away. That's the same reason we have a policy against link only answers. Links are welcome, even required when taking information from elsewhere, but they must be accompanied with the relevant information.

Deletion on SO happens through four processes, Roomba, Spam/Moderator Flags, self deletion, and from votes from 10k+ rep users. Roomba, Spam/Moderator Flags, and self-deletion account for the overwhelming number of these primarily because there's no real consensus necessary in the majority of them. Deletion by 10k+ users on the other hand are exceedingly rare, because it requires not only for the post to be eligible for deletion but for every 10 upvotes the post has, it requires an extra delete vote all the way up to 10 votes. On top of that, active 10k+ users aren't common and it's very unlikely they'll ever end up on the same question without some kind of coordination, such as review queues or a chat room.

In this case, the post was deleted by roomba within a year of it being published because it has 0 score and only one comment, no answers. aka, it was an abandoned question.

11

When we commit the respective code into our VCS, we do not write another lengthy explanation, but instead just add the URL to the code on Stack Overflow available under the "share" link of that question or answer.

If you have an interest in preserving source information, this is a mistake, or rather a risk that you are taking. If information that you didn't copy into the codebase itself is truly that important, you should at least provide links to multiple sources, or preferably back them up yourselves (e.g., link to a Stack Overflow page and an Archive.org version of said page, made at the time you used/copied the code).

In the past, this has seemed practical, specifically since SO explains in their legal terms that we give up all our rights on our writings here, in order for SO to protect the documentation.

Not sure exactly what this is trying to say, but it's almost certainly an inaccurate representation. When you post on Stack Overflow, you don't "give up all your rights" to your content; you license it to Stack Overflow for them to use however they like, more or less, under the Creative Commons license or the MIT license (IIRC there's a slightly different license for prose vs code, or at least I recall reading about that once over on an MSE Q&A...). You still retain your own rights and can license it under a more permissive license than the one you assigned to Stack Overflow, if you wish.

But when reading some old commit logs, we now find that, ten years later these links into SO return a "page not found" reply or the message "removed for reasons of moderation".

Yep, that's unfortunate. But, hopefully this is your wake-up call to stop relying on this unreliable method for documentation. Unless you are paying Stack Overflow directly, you have no guarantee or recourse to follow to ensure information they are providing for you will remain available. See my response above for better methods to follow from here on out.

Sadly, we learn about such consequences only when it is too late. Now, about solving the problem:

  • getting back the contents may or may not be possible, as internet archives may or may not have archived them. Are there any other services that document what gets deleted over time by Stack Overflow?
  • then fixing a dead link in a ten-year-old VCS appears to be a difficult task. What are the best strategies to achieve that?

Well, no, not really. Archive.org versions may well exist of a given page, especially if it is/was popular. If an entire question is deleted, it was probably off-topic here, so if it was popular it probably already exists on a proper sister site on the network like Server Fault or Database Administrators (or it was not really of any value for things like a codebase).

However, anyone can see deleted content if they have 10,000+ reputation on a site. So... get to work! Alternatively, you can download the Stack Overflow data dumps which contain all information, including deleted content as far as I know, on the site. If they don't include deleted content, you can download a few different data dumps from just before the time when the link was added to your source control, and query the information from the data dump.

  • and finally, how can we in the future effectively warn people to protect themselves from such dangers, such as to not use Stack Overflow, specifically for documentation purposes?

I mean, this very Q&A is a good warning, I think. But, there's no need to warn people not to use Stack Overflow, because that wasn't your problem; your problem was thinking that all content on Stack Overflow would remain visible to everyone, forever. That's never been the case, and Stack Overflow has never made that promise anywhere. So really you can 'warn' people by encouraging them to read the site rules and Help Center before they start to use or rely on the site. That way, they don't experience the shock you did.

As for the one link you did provide (in the comments), https://stackoverflow.com/questions/49366079/postgresql-howto-avoid-multiple-execution-of-subquery-containing-stable-functio, that is an odd thing to write this post about. Not only is it a question that doesn't have any answers or comments (so what's the point of including it in your code base or source control?), but it's also your own question, which you can still see, as long as you are logged in, and you can access it from your questions list on your profile: https://stackoverflow.com/users/6201427/pmc?tab=questions (at the bottom) or directly via https://stackoverflow.com/users/deleted-questions/6201427 (again only while you are logged in, since it is your own question).

One thing you can also do is take a full-page screenshot of any content you want to link to, save that as an image, and then upload that image to your source control environment as well. This way you see exactly what was visible, in the same state, when the resource was originally preserved, no matter how far into the future you might need to look at it.

2
  • 2
    re screenshot: or use a browser print / save page feature :P Commented Oct 30 at 20:45
  • @starball Sure, that is more complete and allows interaction, but involves a lot more files to download and manage, and usually there's no need to interact with the page, just see the contents. A team's needs may vary of course, if they need searchable/screen-readable text, etc vs just an image. Commented Oct 30 at 21:27
4

Are there any other services that document what gets deleted over time by Stack Overflow?

There are the data dumps. The differences in consecutive dumps would tell you what was deleted or added and the dumps itself should retrieve any missing information for you.

..fixing a dead link in a ten-year-old VCS appears to be a difficult task. What are the best strategies to achieve that?

For most if not all version control systems one can rewrite history but that is very tedious. I wouldn't do that. I would only go through each link to StackOverflow in the current state of the software and either update the link or add information or if the link isn't available anymore and isn't archived or else, remove it (a truly dead link has no importance).

..how can we in the future effectively warn people to protect themselves from such dangers, such as to not use Stack Overflow, specifically for documentation purposes?

You mean how to not rely on any links as documentation? I could possibly think of a commit check that checks for the existence of links in a commit and warns to check for if the links are purely optional, i.e. if there is sufficient information also without the links.

Btw. there is also a slight legal aspect of this. If the use of the code was not fair use or for educational reasons, the links may need to stay (dead or alive), because they constitute the attribution that is required by the content license of the content from here. But I'm not a lawyer, I may be totally wrong.

2
  • 3
    I needed to check thousands of diverse links over a couple of years and on average I saw 3% link rot per year. Commented Oct 30 at 21:10
  • The not relying on external systems (because that's what a link basically represents) is kind of a failure to use tools properly, in my experience. In most environments I think an issue tracker such as Jira is used for example. But I think in the majority of those environments tickets are only there to do jira-driven-development and they're not a knowledge base. They should be. Commented Oct 31 at 11:00

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.