The Internet is changing: Here's how we're evolving with it

Question

Today, we published a series of posts on the Stack Overflow blog that discuss recent industry changes and Stack’s future positioning.

The series is a collection of thought leadership pieces by the company, grounded in the things that we all care deeply about – preserving the continuing cycles of human knowledge generation, ensuring that attribution happens when knowledge from Stack Exchange is utilized in other ways, and preserving the open nature of the internet when LLM providers continue to challenge that idea. Our goal is to explore the core challenges our industry is facing today and how they shape our view of where our product strategy should go.

The posts are:

A short introduction to the series (also linked above)
“Knowledge-as-a-service: The future of community business models,” will examine the history of internet disruption, how it has evolved business models more broadly, and what this means for Stack Overflow’s strategy and future.
“Attribution as the foundation of developer trust,” will outline critical insights from the recent Stack Overflow Developer survey and the process and technology OverflowAPI partners use for attributing content today.
“Ongoing community data protection,” concludes the series by outlining the strategies Stack is taking to protect the health and use of community data while balancing this against the importance of keeping content available for non-commercial use cases.

Many of you have observed some of the recent steps we’ve taken and changes we’ve made over the past year and speculated about the reasoning behind those. Some of these changes include: data protection measures, like changing how data dumps are accessed and changing how data scraping is managed, experiments with conversational search and question formatting assistance, and the introduction of the term “socially responsible AI” in announcements about data partnership deals. Aspects of these changes may have seemed unnecessarily restrictive or like zeitgeist projects, taking developer time that’s in short supply.

The Knowledge-as-a-Service concept and business model are very new. Our hope is that the context provided in these posts helps synthesize these ideas in tangible, specific and actionable ways. We've worked hard to get this to a place where we can begin to talk about it with you, and it has been a long journey. We expect that these ideas will continue to evolve in response to the changing landscape we're in.

It’s essential that the Stack Exchange community and company have a mutual understanding of the opportunities and challenges that lay ahead. An existential threat looms over the world of human knowledge preservation. That’s not meant to sound dire; it’s simply a fact, and the good news is that it’s very navigable. We believe we can take these bold steps into the changing world and stake a claim where it needs to be staked, on behalf of this community, contributors past and present, and those yet to come. Indeed, it’s essential to the survival of the platform that this happen – doing nothing is not an option.

At the same time, we can still be guided by the principles and ideals that were established on day one of Stack Exchange. They are not incompatible with the world of today and tomorrow, but like everything they do need to adapt to foundational changes. Community remains at the center of what happens on the platform and the core engine that drives everything. We stand with you and stand for you, and hope that this series helps make that clear.

As always, we seek and welcome discussion on this topic, and that is the purpose of this post.

Here’s a quick glossary of a few terms to help clarify:

OverflowAPI - a subscription-based API service that provides continuous access to Stack Exchange’s public dataset to train and fine-tune large language models.
OverflowAI - A suite of GenAI tools available to Stack Overflow for Teams customers that helps connect their employees to knowledge faster. There were some early opt-in experiments on the public platform that used this label.
RAG - Retrieval-augmented generation is an AI framework that combines generative large language models (LLMs) with traditional information retrieval systems to update answers with the latest knowledge in real time, without requiring re-training models.
Knowledge as a Service - An emerging type of internet business model that builds value by delivering, defining, refining, and surfacing up-to-date, accurate, and relevant knowledge.

"An existential threat looms over the world of human knowledge preservation" and its name is Prosus. — user1114
– user1114, Commented Sep 30, 2024 at 18:53
@Jeremy as evidenced by them attempting to kill data dump archival, the main activity in knowledge preservation — Zoe - Save the data dump
– Zoe - Save the data dump, Commented Sep 30, 2024 at 19:06
You didn't tell us what this "existential threat [that] looms over the world of human knowledge preservation" actually is. I am not seeing one. AI does not threaten human knowledge preservation. The only thing threatening it is your company riding the hype train of AI in such a way that it breaks long-standing promises, guarantees, and even legal contracts (in the form of licenses). You cannot blame any of that on AI. Or is this statement actually admitting that your company is posing this existential threat? It's unclear. — Cody Gray
– Cody Gray, Commented Sep 30, 2024 at 19:24
You (the company) surely spent tons of time and resources trying to justify (to whom, exactly?) something that can be summarized in few words as well, words that you can deny, but it won't convince me. The company one and only goal is to make money to justify its existence as a business. This is legit. You are not a charity. But what I find less legit is to keep misleading those people you claim to "work with" and coloring the truth in all kinds of false colors. If you would have been transparent about this from the beginning, trust could be kept. But alas, you weren't, not even a bit. — user152859
– user152859, Commented Sep 30, 2024 at 20:09
These posts seriously need a "TLDR" or a version without the fluff. — user405010
– user405010, Commented Sep 30, 2024 at 21:45
@MisterMiyagi TLDR: users create+curate data for free, and SE Inc. sell it, because Internet Is Changing. — Franck Dernoncourt
– Franck Dernoncourt, Commented Sep 30, 2024 at 23:16
@JonathanZ hey, we already do and would welcome a bunch of new fellow nerds! :) — 0Valt
– 0Valt, Commented Oct 1, 2024 at 3:11
If this is truly the future of the internet, eventually people aren't even going to see what users are doing here because they'll only interact with the LLM trained on knowledge shared here. What should be keeping users here, instead of getting one of those remote jobs advertised pretty much everywhere online these days, and get paid actual money instead of fake points, coloured circles and perhaps a 'shout-out' attribution link that no-one clicks while writing/editing/reviewing text for AI/LLM output? Will there be a future thought leadership blog on how to keep users sharing knowledge here? — Tinkeringbell
– Tinkeringbell Mod, Commented Oct 1, 2024 at 9:34
I still have to read the blog posts but I fear that the blog post titles will sell too much. Attribution in LLMs is still an unsolved problem as far as I know. It would be extraordinary, close to sensational if SE would have managed to solve it. The goal would practically be to give an exhaustive, very long indeed, list of all authors and contributions that are relevant for a generated output. I haven't seen that anywhere. — NoDataDumpNoContribution
– NoDataDumpNoContribution, Commented Oct 1, 2024 at 18:26
GenAI is a waste of your time. GenAI generated content is not helpful. OverflowAI and RAG are worthless “biz” words that will be meaningless in a year. How about improve the moderation tools and make GenAI generated content submitted by user easier to flag and get deleted? I will continue to downvote any and all contributions I believe were generated by GenAI. I will report these contributions and expect timely removal of this worthless AI generated garbage — Ramhound
– Ramhound, Commented Oct 6, 2024 at 15:33
@Ramhound I use GenAI for my work and it helps me. Definitely not worthless. SO can if they want go in that direction, but only if others can do so too and if they openly discuss with their content producers (us) on which terms that is possible and we feel comfortable with. So far this discussion hardly takes place. But then it's clear nobody is really interested in that. We will continue on that trajectory. GenAI is a game changer and something will need to give.. but we don't know what and who. Will we be only cheap labor in the AI machinery? Or not? Who knows. — NoDataDumpNoContribution
– NoDataDumpNoContribution, Commented Oct 31, 2024 at 14:18
@NoDataDumpNoContribution - I have spent more time trying to modify what GenAI has generated then what ended up taking to just do it myself. What I was asking GenAI to do was tedious to write not even difficult — Ramhound
– Ramhound, Commented Oct 31, 2024 at 14:53
@Ramhound Okay, experiences differ. But I can attest that the output of GenAI can be helpful. — NoDataDumpNoContribution
– NoDataDumpNoContribution, Commented Oct 31, 2024 at 14:57
@NoDataDumpNoContribution - It can be helpful. but contributions based of GenAI output, is still now allowed. GenAI will have to involve by a factor of a 100 before I trust it, or even attempt to use it, my experience hasn't been swayed another direction with any of the current models. — Ramhound
– Ramhound, Commented Oct 31, 2024 at 14:58

khelwood · Accepted Answer · 2024-10-06 13:17:14Z

Attribution is the foundation of developer trust commentary

At Stack Overflow, attribution is non-negotiable.

Say it with me now: Language model architectures such as GPT, PaLM and Llama are not capable of preserving attribution. Their use is fundamentally at odds with attribution.

All products based on models that consume public Stack data must provide attribution back to the highest-relevance posts that influenced the summary given by the model.

The technology to do this does not exist. (RAG does not count.) I suspect it's impossible in principle to get even this level of attribution out of the statistical models, given that you're training the attention heads as well as the output weights. How have you squared this circle?

partners work with us because they want the Stack Exchange Community to be bought into their use of community content,

I am happy for my work to be used by a person, or processed by a suitably-competent expert system (provided it's not being used for evil). I strongly oppose it being used for anything resembling OpenAI's GPT systems, which is a common refrain. Why does your marketing copy not reflect this? Are you being deliberately deceptive?

In our 2024 Stack Overflow Developer Survey, we found that the gap between the use of AI and trust in its output continues to widen: […] The heart of all of this?

The fact that these systems do not function as their companies claim they do, yet are being thrust upon us anyway.

The race by hundreds of companies to produce their own LLM models and integrate them into many products is driving a highly competitive environment.

You misspelled "highly-destructive bubble".

corporate customers are much less accepting of lapses in accuracy (vs. individual consumers) and demand accountability for the information provided by models and the security of their data.

Shame this technology is not capable of that. If only somebody had told you, so you could've invested in working technology and been in a position to ride this bubble's collapse.

Oh well. Second-best time is today.

All of our OverflowAPI partners have enabled attribution through retrieval augmented generation (RAG).

RAG does not provide attribution for training data.

RAG helps solve this by pairing information retrieval with carefully designed system prompts that enable LLMs to provide relevant, contextual, and up-to-date information from an external source.

[citation needed]

In instances involving domain-specific knowledge (like industry acronyms), RAG can drastically improve the accuracy of an LLM's responses.

Well, yeah, RAG will help “improve accuracy” with respect to decontextualised phrases, but it's not good for much more than that.

An example could be prompting your LLM to write good quality C# code by feeding it a specific example from your code base.

No it wouldn't, because they can't do this. Pick another example.

This technology is evolving rapidly,

No, it's not.

so it’s a good reminder for us all to question what we think we know is possible regarding what LLMs can do in terms of attribution.

No, it's not. Nothing about the theory has changed.

Recent developments showing the “thought process” behind LLM responses may open other avenues for attribution and source disclosure.

Aaaaaaa!

As these new avenues come online and legal standards evolve, we will continue to develop our approach and standards for partners concerning attribution.

Stop basing your business model on other companies' marketing hype! Please!

Sources:

Python Tkinter Reference Guide: Various community guides and references that describe Tkinter behavior.

Buy some damn ~~subtitles~~ alt text.
That's not attribution. That's the example you chose to show us‽ … Points for honesty, I guess.

These integrations represent a common theme: attributing content is not just an advantage for authors and community members. It also represents an opportunity […]

An opportunity for the companies to wash the AI slop with our reputations. That's the common theme.

RAG is not sufficient attribution.

Code-gen tools like these, especially with embedded AI, have great potential and utility for developers.

It looks like you're trying to make a positive claim. Would you like help?

Usually, claims are made on the basis of evidence that led the claimant to believe them. For example, I believe these systems have no potential, on the basis of theoretical arguments I'm certain you've heard before.

A claim of utility is an empirical claim, usually made on the basis of empirical evidence – such as Uplevel's study of 500 developers, Can GenAI Actually Improve Developer Productivity?, which found a 41% increase in bugs, a relative increase in extended working hours, and no significant improvements in any measured metrics.

If there's anything, anywhere, that supports these claims you're making, then you can describe how to find that information. That would be a citation. A good citation provides two ways to find the referenced material, in case one breaks. Title + authors + date + URL is a good default.

The evidence available to me says: Code-gen tools do not help in theory, and they do not help in practice. To my knowledge, Stack Exchange has not provided any evidence for any claims about any benefits of generative AI for anything.

TL;DR: Citation. Needed.

What do you do when your LLM doesn't have a sufficient answer to your customers’ questions? Stack Overflow can help.

Hurrah! Finally: a sensible marketing strategy. We're getting somewhere. Now all we need is the long-sought-after improvements to user onboarding, and we've got the very beginnings of a viable business strategy.

Finally, it enables Stack partners to support compliance with community licensing under Creative Commons, a necessity for any Stack Exchange community content user.

… No.

This isn't attributing our work that's used as training data. It doesn't enable them to “support compliance” at all – unless the courts rule that LLM-laundering makes something ineligible for copyright, in which case I can use a computer program to end copyright.

Ethics aside, this is not a coherent legal position. Please don't bet the farm on it.

Links like these provide an entry point for developers to go deeper into the world of Stack Community knowledge, drive traffic back to communities, and enable developers to solve complex problems that AI doesn’t have the answer to. […]

If the “sources” reliably contained the same “information” as the GenAI slop, I'd be inclined to agree. But since they don't… this is dubious. Maybe people will visit Stack Exchange, see it's got more information content than the chatbot, and decide to stay. Maybe it won't happen that way.

It was a nice idea, though. I'm sad, not angry, shooting this one down. (I'd be excited if I thought it were salvageable.)

maybe LLM&Attribution also needs an answer like the one for Parsing HTML with regular expressions... — ꓢPArcheon
– ꓢPArcheon, Commented Oct 1, 2024 at 10:27
Maybe the lack of sources in the blog indicates that it was written by a LLM? Except LLMs tend to be more skeptical against AI than SO staff. Hmm maybe they secretly trained one with the SO blog as learning material. Finally a use for the blog! — Lundin
– Lundin, Commented Oct 4, 2024 at 14:08
@Lundin The lack of sources is because they don't understand the theory, and all the tech company higher-ups are talking to each other about the "benefits of AI" so all the dissenting experts are taken as "just another opinion". The decision-makers don't have time to learn a whole bunch of theoretical computer science, and as far as they're concerned, they're seeing "other experts" (read: the marketing divisions of OpenAI and friends) claiming something different. — wizzwizz4
– wizzwizz4, Commented Oct 4, 2024 at 14:28

Mad Scientist · Accepted Answer · 2024-09-30 21:34:23Z

57

What you have shown and what some AI tools have implemented is not attribution. Proper attribution is likely impossible for an LLM in the general case. The network itself is trained on a very large amount of content, in many cases probably the entirety of the Stack Exchange network is part of that training data. That data is part of the model but will not be attributed in this way.

I'm not saying those "attributions" are useless, but even your screenshots label them "related content" and not actually say they are attributions or citations. One big issue is that LLMs will give outright false or sometimes even more dangerous subtly misleading answers. Directing the users towards the correct resources is a useful first step in limiting the damage caused by bad output (assuming that these links are actually well chosen). But they are not attributions or citations.

I don't know if SE is simply gaslighting us here or if they believe those "attributions" to be actually more than they are. But in any case SE is simply dumping their responsibility to the community onto the AI companies they partner with and declare mission accomplished when those companies claim to check this box.

edited Sep 30, 2024 at 21:34

answered Sep 30, 2024 at 19:36

Mad Scientist

199k77 gold badges379 silver badges701 bronze badges

8

No-one is going to credit the Stack answers for the information. They're going to think AI 'knows' it They aren't going to see a score given by a human to tell them how (un)reliable the information is, so they aren't going to follow the links to learn more. They will just keep prompting the AI until they get something that seems useful. This will kill SE eventually. I am excited about the potential of the new AI tools, but using them to bypass the SE community is destructive. We should be working for more human interaction and more open access to content, not less.

ColleenV
– ColleenV

2024-10-01 17:30:32 +00:00
Commented Oct 1, 2024 at 17:30
1

@ColleenV Exactly! And that would retain the (as SE Inc. points out themselves) invaluable added value of this platform, namely that it is indeed human knowledge. I'm certain this increasingly rare product can be commoditized as-is. But, alas, there's a cool-looking bandwagon we need to jump on.

Joachim
– Joachim

2024-10-02 14:56:51 +00:00
Commented Oct 2, 2024 at 14:56

Add a comment |

user1114 · Accepted Answer · 2024-10-18 03:45:19Z

51

What does attribution look like?

The examples "attribution" in your Attribution as the foundation of developer trust are inadequate by the standards that Stack Overflow established in 2009: none of the three provided examples credit any authors by name!

edited Oct 18, 2024 at 3:45

answered Sep 30, 2024 at 19:09

user1114

1

5

"It's important to distinguish between "citation" and "attribution".

Thomas Owens
– Thomas Owens

2024-09-30 19:16:46 +00:00
Commented Sep 30, 2024 at 19:16

Add a comment |

Joachim · Accepted Answer · 2024-10-02 15:23:34Z

The marketing blog post series linked in the question starts with

If you’re weary of reading about the latest chatbot innovations and the nine ways AI will change your daily life next year, this series of posts may be for you.

Perhaps if the rest of the series and the company's policy about AI was so self-aware, I'd be a little more hopeful about this.
It carries on to say

Consequently, in the current transformation, human-centered sources of knowledge are obscured. We face a world in which the old paradigms are no longer paramount, and their places in the world are redefined.

Now, this post has my name on it. As it's Creative Commons licensed, someone who wants to reuse this content and wants to do this the right way would attribute this to my username - not just the site it was on. Often times, especially in a smaller internet community like this, the source and providence matters as much as the content.

Essentially, LLM providers act parasitically, and almost voraciously, sucking up resources and content, with very little consideration for the commons, public good or the communities they take content from, and somehow I don't feel normalising or glorifying this is for the best.

Companies and organizations lucky enough to host these engines of knowledge production are at a decision point; how do they continue to monetize when the technological landscapes have changed?

Perhaps the reason SE's weathered many storms is luck, but this site and the knowledge within exist because of a solid foundation set up years ago, significant amounts of experience, curiosity and knowledge held by its users.

A blind focus on monetising doesn't always end in success. SE's commercial products are almost always going to rely on the community as its driving force - whether it was Careers 1.0 or Teams. Yet very often it feels like SE's struggling to find their *AAS while the communities' needs and desires get pushed aside on the promise of 'we'll look at it when we have the resources'.

If you believe in luck, buy a lottery ticket. A lot of work went into what SE is today.

Knowledge-as-a-service: The future of community business models

I'll zoom in on the 'new' problems you've cited. The beauty here is...

..none of those are our problems.

Let me pick these apart:

"Answers are not knowledge": having a pool of subject matter experts and enthusiasts rather than what's essentially a black box that puts worth together is why our answers contain knowledge. It's not just that LLMs lack context, they lack imagination, intellectual curiosity and the ability to reason and make connections.
"The LLM Brain Drain" assumes that there's any actual intelligence there.

A healthy community challenges itself and learns through reinforcement. We find 'real' problems we face; we share, not take; we have the freedom and intellectual curiosity to try to solve and resolve our own problems. We get nerd sniped.

Rather ironically, the strict rules of the network, aggressive meritocracy and quality standards is why we're a good source of information, but also why people complain about us.

Our 'community brain drain' has different causes, and these are probably more critical for ongoing survival of our communities.

Developers lack trust in AI tools

Well yes. So do many members of the community, and AI's a controversial topic here. There's a good reason for this. AI and the organisations that promote them often haven't proven trustworthy. I feel like the focus on AI being a substitute for the network leaves out something. For people who don't want an AI tool - and want trusted, human-driven answers and feedback - we literally have a full suite of tools.

I'd finish my critique of this post with this:

Stack Overflow and the larger Stack Exchange community need to be direct about our needs

We have always been direct about our needs. They're often ignored, or we get promised it'll get taken care of. The day we're not direct about what we need, y'all can probably shut down the network. Personally, I'm not opposed to LLM companies paying SE. I'm opposed to LLMs being the one basket the company puts all their eggs in. I'm convinced the bubble will burst and the GenAI industry isn't really a sustainable, long-term future.

If the company, which we've worked with for over a decade, isn't listening, how can we expect LLM companies to?

Attribution as the foundation of developer trust

For all the criticism SE gets - it is rarely that our knowledge is untrustworthy. That some folks find us intimidating, and on-boarding is hard? Sure. That things get outdated? Absolutely. But our answers are posted, refined, and occasionally questioned.

It seems very strange that as developers - the core audience of this network - distrust GenAI, a handful of organisations decides this absolutely must be the future - and wishes to force-feed the future. It never ends well for the goose.

I'm all for getting a fair deal for the community. It's just the strange love affair with GenAI despite the push back that gets me. There seem to be lots of attempts to 'sneak' in GenAI 'features' we don't need, or changes to the social contract around that sort of thing. And yet your own surveys are telling you developers don't trust GenAI.

Ongoing community data protection

Some of the assertions made here seem... charmingly optimistic. LLM providers have been shown to ignore robots.txt files, training on pirated material.

The blog post backs some of this up:

In the last year, Stack has seen numerous data points that suggest LLM providers have escalated their methods for procuring community content for commercial use. In the last few months, non-partners have begun posing as cloud providers to hack into community sites, which led us to take steps to block hosted commercial applications that do so without attributing community content.

On the other hand, I'm a little skeptical about the following part of the paragraph:

At the same time, these strategies will help turn potential violators into trusted customers and partners by re-directing them to mutually beneficial pathways for all parties. (This also serves as a reminder for users of all tools and services to pay close attention to terms, conditions, and policies to know what you agree to.)

The tricky part is really not treating your users like they're going to steal their own data.

Many of the choices made to 'safeguard' or add guide-rails to data access either ended up eroding community trust, resulting in friction for community-run tools, or resulted in worse quality-of-life for users.

For example, academic institutions wishing to use data for research purposes or communities looking to guard their collective work against unexpected systemic failure should not have their legitimate activities restricted.

Practically and rather ironically, it took the community a matter of days to build a tool that downloaded the individual site data dumps. There's supposed to be a way to request a full data dump the correct way, but as of the date of posting I'm still waiting on my request to be fulfilled.

I choose to believe the intent is there, but if a community member with somewhat deeper reserves of patience and a direct-ish line to staff can't get a data dump legitimately in a reasonable period of time, one would hope an academic institution requesting data could get it before their undergrads become professors. The tools for access should be built with the tools for restrictions. Those activities pretty much are restricted.

Let's be very honest - this is not about protecting data for the community. It's about protecting potential revenue. I've not seen as much movement in dealing with the community's needs as I'd like. I'm disappointed, but I know others are furious.

"Benefits for all"

Some really neat technology has been a solution looking for a problem. 3D TVs, the 'metaverse', blockchain and now LLMs.

And yet, here I am, on a 2D screen, on a text-based platform that runs on a traditional web platform.

The thing is, knowledge survives because people want it to. I often use SuperUser as a way to collect/store things I've learnt over the years. If it died, I'd probably grieve a little, and find somewhere else to do these things. Maybe post stuff on my blog, or some other site.

I'm going to follow on with something Jeremy quoted from the question in the comments:

"An existential threat looms over the world of human knowledge preservation"

Life finds a way. There's a certain level of hubris to assume humans don't preserve knowledge without the help of large platforms, quite the contrary. Large organisations often lack the focus or platform to preserve knowledge. The BBC destroyed many old tapes, and a lot of old content only survives cause an individual taped it. Humans are hoarders. We pass down stories, teach skills, read, and write.

People will write, share knowledge, get excited about the ugly, cable-tied personal project when it first whirs to life. And if LLMs are such a threat, I'm not entirely sure how becoming part of the problem by appeasement helps with long-term community health.

I'm reminded of one of my favourite works of J. Michael Straczynski, a poster he did called Never Surrender Dreams - some excerpts of the full work feel relevant to me personally:

Children sing and dance spontaneously, tell stories without fear, reveal their thoughts without inhibition, and reach for what logic tells us should be unattainable. We do, we explore, we ask questions; we pursue our heart's desires, we dream of achieving greatness.

We're entirely capable of preserving our stories and thoughts.

And, well, this is what makes SE a great place. And yet, we're told that putting all those things into a black box is essential to preserve knowledge.

But as time passes, we learn fear, we learn to second-guess ourselves, and we learn to suspect our abilities and our desires. We are told that some people tell stories, some people dance, and some people sing, but these things are not for everyone.

And well, perhaps these things are for no-one. ChatGPT won't judge you, will it? And yet it feels like the the 'story' here is that places like this, communities of people like us, aren't going to survive LLMs.

you missed the point on "these strategies will help turn potential violators into trusted customers and partner". This clearly betrays how they don't care at all about HOW the data is used, the issue with the scrapers is that Prosus isn't getting paid for the golden eggs their users keep laying for them. Thing like ethics, social contract, purpose of the site and users gratification are as worthless as an raincoat in the desert in their eyes: the only important thing is getting more money. At all costs. — ꓢPArcheon
– ꓢPArcheon, Commented Oct 1, 2024 at 9:17
Oh, that's a bit further down. I'm 'less' averse to SE making money off it, than well, they'll forget the community in the process, or keep changing the social contract in the service of that without any real progress. In this case I don't feel I missed it, but rather chose a specific focus — Journeyman Geek
– Journeyman Geek, Commented Oct 1, 2024 at 9:28

Joachim · Accepted Answer · 2024-10-08 12:58:45Z

This post and linked blog posts focus on a subset of concerns that, while valid, miss the mark for me when it comes to indicating that the company's concerns and priorities are connected to mine as a member of the community, particularly when I would expect a post here on MSE to be written for a different audience than the SO Blog.

As others have noted, the communication here feels very impersonal and distant. It fails to recognize or empathize with the struggles and concerns community members have voiced, let alone address them. And, honestly, using the word "observed" to refer to the community's reaction to many of your recent changes when most of them are extremely unpopular is... amusing.

The thing is, this post and those blogs are written as if y'all are presenting something novel and groundbreaking to the community. As if the community hasn't been living and breathing it and trying to get the company to do something for that entire time! For the last two years that AI has been a struggle, community members have been acting to protect the content from AI generated junk and understand the implications of data models training on SE user generated content... but the platform's issues existed long before then.

The blog posts and this question fail to recognize that one of the main concerns of community members is that y'all are spending money and time "securing" the data on a platform that's public and open on the internet for everyone to use, which many feel is futile and I haven't seen any recognition from staff of this concern. The converse is true, in fact - as one of the blog posts mentions that these companies find more and different ways to get around the restrictions y'all have created. And, while I know and trust the skills the staff have and am more likely to go along with "we can't discuss it publicly" as an explanation, even I have a hard time believing this isn't just going to be a perpetual game of whac-a-mole that will continue absorbing all of the money and time y'all have to invest.

In particular, to respond to these efforts to "steal" CC BY-SA data, you're making it significantly more difficult for users to get the data themselves and negatively impacting tools users have had to resort to building because the company hasn't significantly invested in improving the platform in a decade. You're also failing - or at best struggling - to respond to the issues community members express about these changes, priorities, and decisions.

There's a cost/benefit equation here and at this point, I'm seeing all costs and no benefits, even though in April 2022 Philippe's response to a question about the use of the company's revenue from AI companies made the promise:

The money that we raise from charging these huge companies that have billions of dollars on their balance sheet will be used for projects that directly benefit the community.

Despite that, this question doesn't say anything that reaffirms this promise nor does it indicate it's still forthcoming, which two of the blog posts at least manage to do in indefinite, conceptual ways. Unfortunately, the way it is worded in the blogs doesn't give me a lot of confidence, either.

In the attribution post, it says:

[...] we will build confidence in and with our communities as we use partner investment in these integrations to directly invest in building the tools and systems that drive community health.

The Data protection post says:

Knowledge authors and curators get: [...] Revenue from licensing invested into the platforms and tools they use to create the knowledge sets.

I'm going to be honest - the things y'all are planning to focus on seem designed to drive volume of content, not quality. I understand that having volume is much more measurable than quality, so it's tempting to increase volume - but much of the volume that exists currently is so predictably low-quality, the system should be able to prevent or handle it automatically.

You also seem focused on attracting new contributors, not investing in supporting the contributors who worked to create and curate the content that exists. New users/creators/curators are also valuable to seek out but existing tools and barriers to participation have core issues that writing better onboarding won't solve - these chores get bothersome, particularly when the tasks are so simple a computer could have - and should have - done them.

Y'all seem intent in relying on community curation through manual review queues (like Staging Ground and close/first questions/late answers) and user-built tools (Charcoal, SO Botics), rather than building tools (and even using AI creatively) to identify and/or prevent

questions that
- are duplicates
- are incomplete
- are poorly formatted
- are poorly tagged
- contain code in the form of an image
- are not in the correct language
- should have been posted on another SE site
- should have been posted on meta
- are spam
answers that
- are asking if anyone's found the answer yet
- are comments on the question or one of the answers
- are new questions
- are spam
- are outdated or version-specific
- are duplicates
comments that
- identify duplicates
- answer the question
- indicate the issue was solved
- indicate that answers don't work or are otherwise problematic
- are abusive/chatty/not needed

None of this needs to be manual any more! Unfortunately, by making the volume higher but relying on manual methods to review/prevent problematic content, it's unlikely the community will be able to ensure quality is assured! The thing is, we don't have to wait for the higher volume to know these issues exist. The community has been asking for the company's help addressing these issues for years and your solution has generally been to give them more manual work, leaving users to do what they could to slog through queues or automate things.

Charcoal, SOCVR, SOBotics, and other user projects have been addressing a ton of these issues for years, often with just REGEX - but they rely on users being willing to invest the time and effort to build the tools, invest their own money and hardware to keep them running, keep them updated, and use them. Stack Overflow Inc. hasn't done anything of this scale! The closest thing the company has done was an overhaul of mod tools to better identify suspicious votes - which y'all haven't even bothered to announce despite it being well-received by mods, particularly on SO!

Please, I implore you - invest in the platform to reduce how manual curation is rather than building more manual labor for volunteers. Free up the community members to do fun things. Reward them for the work they do, recognize they're people who are the key to this platform having the value the company now seems to see in it.

Talk about the value of the SE community the way Wikipedia's Director of Machine Learning, Chris Albon talks about theirs - as people, not as the "knowledge store" they create.

The ubiquity of Wikipedia can make it easy to forget that behind every fact, every image, and every word in every article is a person — a person with a life, with family, friends, and pressures on their time.

People are why Wikipedia continues to persist.

For over twenty years and multiple times a second as you read this, thousands of people are spending their time building a global commons of freely available, neutral, reliable information for the world. That is why, despite the rapid changes the internet has gone through, the online encyclopedia remains relevant. [...]

One thing that will never change is our fundamental belief that Wikipedia comes from its editors. The foreseeable future remains that a large group of humans working together — debating, discussing, fact-checking each other, even when using AI to boost their efforts — will be the best way to create reliable knowledge.

[...] But as the internet is flooded with AI-generated content (some of it good, much of it bad) the value of an individual human volunteer, spending their evenings after the kids are asleep or after they get off from their job, building and improving the knowledge commons that is Wikipedia, will only become more important.

There will be something after this artificial intelligence boom, and something else after that. Still, the work of Wikipedia and its volunteers to build a reliable resource of information freely for the world will continue. Wikipedia is here to stay.

Y'all, I tried to trim it as much as I could but so many of the paragraphs just felt so validating... and made me want to be part of their efforts. I'm not saying Wikimedia doesn't have a troubled relationship with the community of editors but at least they can seem supportive to an outsider. Stack Overflow, on the other hand is trying to extol the value of a human community while simultaneously making them sound like a machine!

Community remains at the center of what happens on the platform and the core engine that drives everything.

There are specific people inside the company who do actually manage to keep the community's needs in mind when making decisions. I worked with some really amazing people who did everything they could to make sure that was the case and some of them are still there fighting for it and pushing back against things that would harm the community and this platform. I appreciate their work and recognize how exhausting it is. Specifically, there was a time I was on the brink of quitting but for the intervention of one of those people.

But let's not kid ourselves. Those people are the exception, not the rule. If community was actually at the center of the platform,

these amazing people wouldn't have to fight and push back and pick and choose what to stick their neck out for or try to negotiate the "least bad" solution to something the company has decided is necessary.
Moderators and community members wouldn't have to rebuff plans over and over while trying to convince the company of the real harm the plans would do.
The blog posts in this series would read more like the one from Wikimedia's blog.

Yes, there are absolutely times for even community-centric companies to take a stand when it comes to what's best for a platform, even if it overrules or conflicts with the wishes of the community using it. It's totally reasonable for a company to set priorities and goals, determine how they'll solve problems, and ensure they're financially viable. But community-centric companies are transparent and solve problems with their communities because they know that the users are the ones best able to help them identify and address their core problems.

It’s essential that the Stack Exchange community and company have a mutual understanding of the opportunities and challenges that lay ahead.

I totally agree. I just doubt that the company as a whole actually understands this or that it adequately values the people who make up the community as something more than the content they create.

What's fascinating is the combination of "not investing in supporting the contributors who worked to create and curate the content that exists" and "intent in relying on community curation through manual review". Like, what do the people making these decisions really think about the active contributors and curators (as opposed to what their press releases say)? The way I interpret it, I don't really know how to express that politely. — Dan Mašek
– Dan Mašek, Commented Oct 2, 2024 at 20:56
"…community-centric companies are transparent and solve problems with their communities because they know that the users are the ones best able to help them identify and address their core problems." Man this nails it so well, I think. So much of the platform changes on Stack lately, even the great ones, feel much more like "we noticed a problem, here's how we're addressing it" than working with the community... The Community Asks Sprints have been a notable exception to this, a breath of fresh air, but they almost feel placating when the rest is so much of the opposite so consistently. — zcoop98
– zcoop98, Commented Oct 2, 2024 at 20:58
I agree with many issues that you pointed out but I retain that many of the items that you listed should only be manually handled. AI, despite its name, is not intelligent, and (as of today) will not be able to tell the difference between an answer that uses analogy and an "answer" which is inaccurate based on poor understanding. FWIW here's an upvote. — Mari-Lou A Слава Україні
– Mari-Lou A Слава Україні, Commented Oct 3, 2024 at 6:23
@Mari-LouAСлаваУкраїні The items I list under answers have nothing to do with judging accuracy or quality of answer and, instead, are all things that are patently obvious. If an answer reads "Did anyone find the answer out yet?" or "Person 3's answer doesn't work for me." - that's not something that users or mods should be having to delete from review manually. Also, I include "Identify" because we rely a lot on manual flagging. Maybe a sufficiently good, adaptable tool can actually do better about flagging - sort of how Smokey does. Tool flags, a user confirms - but one, not 6 or 4. — Catija
– Catija, Commented Oct 3, 2024 at 16:45
I suppose if you only read the paragraph, I can understand the concern - but that's why I included a list. There are only some things I'd want to automate, too... and being clear about what can and what should not (yet?) be automated is important. — Catija
– Catija, Commented Oct 3, 2024 at 16:47
I think some of the things listed can be handled quickly and easily by AI, e.g answers written in a foreign language on an English language site–which happens, answers that are gibberish, answers that consist of only an image etc. but we'll cross that bridge when that day arrives. The search engine needs to be improved, I thought that would have been a perfect candidate for AI then I saw yesterday's accidental release aaaaaand I'm not a fan. — Mari-Lou A Слава Україні
– Mari-Lou A Слава Україні, Commented Oct 3, 2024 at 17:20

Franck Dernoncourt · Accepted Answer · 2024-09-30 23:24:19Z

21

Your blogpost about putting SE data behind paywalls states:

AI providers who take from our community without giving back will have increasingly limited access to community data. The data used to train LLMs is not available in perpetuity. These partnerships are a recurring revenue model and a subscription service. Loss of access is retroactive—partners must retrain models without the data after this data is no longer available to consume and update.

This is 100% against the CC BY-SA license under which SE content is, which allows commercial use, LLM training and data retention.

answered Sep 30, 2024 at 23:24

Franck Dernoncourt

59.2k9 gold badges77 silver badges219 bronze badges

13

Don't worry, I am pretty sure they will promise to look into this as soon as their legal team comes back from Mars. You know, just after they post the reply to the concerns about the Data Request page that were promised more than two months ago.

ꓢPArcheon
– ꓢPArcheon

2024-10-01 09:22:43 +00:00
Commented Oct 1, 2024 at 9:22
1

@SPArcheon-onstrike If the legal team thought they were in trouble, new language would have been quickly introduced to protect the company. Remember when people were wondering whether volunteer mods could be legally considered employees and the mod agreement got rewritten real quick? No response means they asked and the legal team thinks we're barking up the wrong tree and has forbidden anyone from saying anything more that could possibly point out the correct tree.

ColleenV
– ColleenV

2024-10-01 18:01:13 +00:00
Commented Oct 1, 2024 at 18:01
2

I think that all is needed is some author or a collection of authors sueing SE for violating the Creative Commons license for putting the content behind walls. But then they have their own second license. And therefore I think the only way is for SE not getting good content anymore.

NoDataDumpNoContribution
– NoDataDumpNoContribution

2024-10-01 18:18:22 +00:00
Commented Oct 1, 2024 at 18:18
1

@ColleenV imho, the legal team is fully aware that some of the company actions are not compliant but they are so ingrained in the normal activities that they can't change them .... especially not without loosing the precious $$$ Google probably pays for the data harvesting (noticed how many of the last "community centered actions" seems to be aimed at protecting Google interest? They even have premium user data harvesting access thanks to all the cookie mess). Right now the company is simply considering very improbable that someone manages to get the EU Data Authority on their case

ꓢPArcheon
– ꓢPArcheon

2024-10-02 08:19:51 +00:00
Commented Oct 2, 2024 at 8:19
2

@SPArcheon-onstrike They don't care about being compliant; they only care about the risk of consequences. If the risk is low, their best course of action is to just shut up. All Meta is going to do is bitch about it. They'll care when someone gets their lawyers involved and not one minute before that.

ColleenV
– ColleenV

2024-10-02 13:50:04 +00:00
Commented Oct 2, 2024 at 13:50
On the other hand, AI training seems like something governments will start to legislate and look deeper into very soon. There's all these horror stories about abusing people in third world countries to train AI. So it's just a matter of time until that gets regulated and at that point governments will investigate where the training material is coming from. Abusing volunteers by blatantly violating the agreed copyright license isn't going to sit well at that point either - it will surface too and companies will be held responsible.

Lundin
– Lundin

2024-10-07 09:12:47 +00:00
Commented Oct 7, 2024 at 9:12
So SO shouldn't worry as much about random users starting a lawsuit fundraiser, as they should worry about governments coming to breathe down their neck in a not too distant future.

Lundin
– Lundin

2024-10-07 09:14:27 +00:00
Commented Oct 7, 2024 at 9:14

Add a comment |

Thomas Owens · Accepted Answer · 2024-09-30 20:01:42Z

Knowledge-as-a-service: The future of community business models

This post seems to ignore the uncertain legal and regulatory landscape around the consumption and use of knowledge, specifically in the case of AI/ML applications.

Specifically, two things are coming to mind. First, at least in the United States, there's an open question about the extent to which training an ML model is fair use, which will be examined in several pending court cases. Second, there are open questions in the United States about the anti-competitive nature of AI-generated content summaries.

Since this post is about the ability to commercialize a service around exposing knowledge, it seems like it would be essential to address these questions. If not in this post, somewhere else. Depending on how these questions are answered, both in the US and globally, the applications of this "knowledge-as-a-service" could be upended overnight.

Right now, this looks like marketing speak rather than a serious proposal for a path to something better and valuable.

Attribution as the foundation of developer trust

"It is important to distinguish between "citation" and "attribution"." However, you continue to make these same mistakes - whether it's conflating citation and attribution or the ethical requirements for citing sources and the legal requirements for attribution.

Ongoing community data protection

This is full of dubious and troubling claims.

As the owner and licensor of data, I'm very interested to understand how the company is offering data that is not available in perpetuity or how you intend to enforce retraining of models should data be made no longer available. This seems to fly in the face of the Creative Commons licensing, and even more so if it is determined that at least certain use cases of AI/ML training are protected by fair use.

The unilateral decision to attempt to block, limit, or gate access is also very troubling. Although, on the one hand, I understand the need to implement technical controls to maintain the stability and reliability of the platform for higher priority users, it seems counter to the open-source nature of the data to unilaterally implement such controls that go beyond what is necessary to maintain an acceptable level of performance or security of the platform.

That last part may be related to some meddling user laying a trap some time ago - "Can the "Right to be forgotten" and a LLM "non-negotiable attribution" be reconciled? Do they need to?" — ꓢPArcheon
– ꓢPArcheon, Commented Oct 1, 2024 at 10:24

VLAZ · Accepted Answer · 2024-10-01 15:21:30Z

Attribution as the foundation of developer trust

Let me be clear - it is not attribution which makes me distrust generative LLM tools. It is the fact that they are wrong. Adding attribution to a wrong information still makes it wrong. At best (and that is a bit of a stretch) attribution might help if you want to verify the claim. But if I wanted links to sources, I would search for the sources. The tool might give me sources which were wrong or even right. I would still have to do most of the work of verifying the result I got.

Maybe a retort would be that LLMs are perfect and do not make mistakes. I have certainly seen such claims. Let me debunk them with a couple of trivial examples. Recently, I had code very similar to:

using System.Text.Json.Serialization;

public enum BookingStatus
{
    Pending,
    Confirmed,
    Cancelled
}

public record Booking(
    string BookingId,
    
    [JsonConverter(typeof(JsonStringEnumConverter))]
    BookingStatus Status
);

but I got a compilation error on the line [JsonConverter(typeof(JsonStringEnumConverter))] because it was not allowed to be placed there (not valid on parameters, valid for properties, fields, classes, and others). I asked ChatGPT

Can I use a [JsonConverter(typeof(JsonStringEnumConverter))] annotation with a record class in C#?

The result I got was literally the code I already posted above. That was the answer - basically the code I already had. Which had a problem. Would attribution have helped? Maybe. But it would also defeat the purpose of using ChatGPT.

Here is a recent comment

OK I understand a little bit. ChatGPT told me that await Promise.race([sleep()]) will be immediately finished, so I asked this question. I think I was misled by it.

Yes, I can confirm that the user was mislead. Without getting into the boring details - no it does not work that way.

Congratulation. You have found the Catch 22 trap that lies in basically any LLM ToS right now: there is no assurance that any provided info is correct and you are required to verify any output. So, basically, you can use ChatGPT / LLama / Dragon / whatever to do complex task like summarizing a document to save you the time to read the original .... but you still have to read the original to check that the summary content is correct. — ꓢPArcheon
– ꓢPArcheon, Commented Oct 2, 2024 at 8:33

Resistance Is Futile · Accepted Answer · 2024-10-02 20:06:36Z

An existential threat looms over the world of human knowledge preservation. That’s not meant to sound dire; it’s simply a fact, and the good news is that it’s very navigable. We believe we can take these bold steps into the changing world and stake a claim where it needs to be staked, on behalf of this community, contributors past and present, and those yet to come.

If you (the company) truly believe that, then you would be doing everything in your power to keep AI, as well as AI companies, away from the sites in the Stack Exchange network.

Instead of that you have jumped on the AI train as fast as possible, disregarded the community wishes and pushed numerous AI related features and experiments that are detrimental to the very goal of gathering, preserving, and maintaining human knowledge.

You have also restricted the access to the data dumps that are fundamental for the preservation of the knowledge collected on sites, solely to enter into partnerships with AI companies allowing them to use our data for training, without considering the consequences of such deals.

While nothing that gets publicly posted on the Internet can be effectively protected from being used in AI training, there is a whole world of difference between not being able to prevent AI training and actively participating in it.

Every single move the company makes is going against the community wishes and the core purpose of the network.

What is the point in having the discussions, when the feedback is not being heard and taken seriously?

When time and time again, you retract poorly received AI related features, only to come up with other even more awful ones.

We don't want AI related features on the sites. We are not interested in providing training data for the AI companies. We are not interested in "Knowledge-as-a-service", nor any other AI concoction you can come up with.

Please, just get off the AI train, and start listening to the feedback.

Well, while I can't agree more with every word here, I can only think "Oh no, this is indeed so futile". — user152859
– user152859, Commented Oct 2, 2024 at 21:17

Joachim · Accepted Answer · 2024-10-01 20:16:29Z

I have no idea what this post is supposed to be or achieve.^*

An existential threat looms over the world of human knowledge preservation. [This is] simply a fact

Erm, source(s)? Clarification?

It’s essential that the Stack Exchange community and company have a mutual understanding of the opportunities and challenges that lay ahead

Trying to wrap my head around this: you hereby acknowledge that both parties have different understandings of those opportunities and challenges, or, rather, foresee different opportunities and challenges (I'm guessing, since having different understandings of the same opportunities and challenges doesn't make a lot of sense)?
While I do believe both parties have (indeed?) become increasingly misaligned, how does having this "mutual understanding" of the ideas of both sides change anything; what is this essential for? No effort has been put into realigning these understandings, or the opportunities or challenges. If this is indeed essential, wouldn't it be much easier to start the process by rebuilding real trust (as opposed to the vague interpretation of the word here), for example by being transparent about your intentions, and not writing these vague posts?

..the principles and ideals that were established on day one of Stack Exchange [..] are not incompatible with the world of today and tomorrow, but like everything they do need to adapt to foundational changes

They're not incompatible, so they are compatible, but they require change, so they're not compatible? Can you expand on this? Why is it required they adapt to foundational changes? What are those foundational changes? Are they foundational challenges of the company, or of "the world of today and tomorrow"?
This honestly makes it sound like these principles and ideals are in fact incompatible with your current foundational changes and you are only telling us you retain them to be able to show the community and the rest of the world you still (theoretically) have the same (yet necessarily adapted) principles and ideals.

[SE Inc.] stand[s] with [the community] and stand[s] for [the community], and hope that this series helps make that clear.

Haha. Reactions to this very post clearly illustrate how this cannot be the case. So does the heavy downvoting of the posts delineating the ~~changes~~ "bold steps" you mention. I'm pretty sure the community had hoped this support and/or representation would have been clear for a while now, but sure, let's start here; now is indeed the best moment to start making sure of that.

we seek and welcome discussion on this topic

You're asking the community to help explore the core challenges "our" industry is facing today and shaping your view of where your product strategy should go? If that is the case, that would actually be great (though I wonder how we can help shape your view of your strategy), but why then the title with an unsubstantiated claim/inconsequential observation ("the internet is changing") and a resolution ("here's how we are evolving with it")? This doesn't sound like an invitation to help explore. Or do I misunderstand the topic of this post? (Disclaimer: I do.)

Also: what are "continuing cycles of human knowledge generation"? How does one 'synthesize ideas in a tangible or actionable way'?

Really, I don't get the purpose (nor most of the meaning) of this post. But that could very well just be me and my growing pessimism for the platform.

^{Even while knowing fully well that the post itself clearly states that its purpose is to "seek and welcome discussion on this topic".}

I suppose the envisaged "existential threat" is: almost everyone begins to privately interact with, say, ChatGPT (instead of posting publicly at e.g. Stack Exchange), enabling OpenAI to monopolize human knowledge. — Rebecca J. Stones
– Rebecca J. Stones, Commented Oct 1, 2024 at 23:59

ColleenV · Accepted Answer · 2024-10-02 17:05:28Z

It’s essential that the Stack Exchange community and company have a mutual understanding of the opportunities and challenges that lay ahead.

The company wants us to accept their framing, but refuses to acknowledge our concerns when they don't align with The Plan. There can be no "mutual understanding". The company wants to exploit the community to make money. As far as I can tell, all we get in return is some long overdue investment in site infrastructure so we can be more productive while we're being exploited.

I don't need to read a series of blog posts to understand that the company wants us to continue to generate and curate content without any real compensation and that the company wants to unilaterally decide which corporations, universities and individuals get access to our work and how much that access will cost them.

That wasn't the deal I signed up for. The company's management does not have the vision to enumerate all of the uses for this decade+ worth of knowledge, accumulated and curated by thousands of people^* and assess their relative values. If automated access to knowledge had been paywalled just a few years ago, there might not be any LLMs today.

No-one knows what advances in technology or understanding this (pejorative description left for the reader because I can't be polite about it) approach is going to strangle in the cradle. I will not be complicit.

^{*Hundreds of thousands, tens of thousands, whatever. Don't @ me. I don't need to be precise to make this point.}

critically, its a number that - if it decreases isn't because of AI, its because dedicated users and curators feel that their contributions are not really appreciated — Journeyman Geek
– Journeyman Geek, Commented Oct 2, 2024 at 17:10

curiousdannii · Accepted Answer · 2024-10-01 12:05:50Z

Socially responsible use of community data needs to be mutually beneficial: the more potential partners are willing to contribute to community development, the more access to community content they receive. The reverse is also true: AI providers who take from our community without giving back will have increasingly limited access to community data.

The data used to train LLMs is not available in perpetuity. These partnerships are a recurring revenue model and a subscription service. Loss of access is retroactive—partners must retrain models without the data after this data is no longer available to consume and update.

The company (still) doesn't understand.

The Creative Commons Share-Alike license is not anti-commercial.

All generative AI that absorb CC BY-SA works aren't complying with the license. This is my opinion; I'm aware that the CC org itself seems to disagree and some courts might rule differently. But mine is probably the common opinion of most people releasing things under that license. (Surveys to the contrary would be interesting to see!)

But we contributing members of the SE network are not against the commercial exploitation of our content. The SE company profits off our contributions! So can other companies, so long as they comply with the license.

There is no implication of "giving back" in the CC BY-SA license, instead it has a legal requirement to license derivative works under the same/similar license. As long as you do that, if you can make a profit from our contributors, then the CC BY-SA license is an open invitation to do so! We contribute to our SE sites because we want to help people learn about hundreds of topics. If companies can do so in new innovative ways, while complying with our license, then that's great. There is no implication or expectation that they somehow contribute back to the SE community in the process.

Anerdw · Accepted Answer · 2024-10-01 04:15:44Z

7

Ongoing community data protection:

In the last few months, non-partners have begun posing as cloud providers to hack into community sites, which led us to take steps to block hosted commercial applications that do so without attributing community content.

Maybe I'm a little out of the loop, but is there a particular case or policy this is referring to? It vaguely reminds me of the recent automation restrictions, but my gut feeling is that they're unrelated.

answered Oct 1, 2024 at 4:15

Anerdw

4,0323 gold badges9 silver badges41 bronze badges

4

I find the characterisation kind of interesting. I can't imagine most modern developers not using a cloud service for almost anything - so posing seems an unusual choice of language. I do believe the Azure restrictions might be due to that. Practically the big cloud provider/AI companies, like google and OpenAI (and microsoft use them extensively) are already partners anyway.

Journeyman Geek
– Journeyman Geek

2024-10-01 04:52:30 +00:00
Commented Oct 1, 2024 at 4:52
3

That kind of language is just ridiculous and embarrassing. Nobody is hacking SE, someone is probably crawling the public sites while ignoring robots.txt.

Mad Scientist
– Mad Scientist

2024-10-01 08:18:21 +00:00
Commented Oct 1, 2024 at 8:18

Add a comment |

user152859 · Accepted Answer · 2024-10-01 12:51:29Z

7

I stayed silent (except a comment) as I had nothing good or constructive to say.

I still don't, but things like this are just too much to bear.

Do you honestly consider such a thing as "evolving"? Seriously asking.

answered Oct 1, 2024 at 12:51

user152859

1

Too many things, like this : meta.stackexchange.com/questions/403296/…

user1176409
– user1176409

2024-10-01 13:08:49 +00:00
Commented Oct 1, 2024 at 13:08

Add a comment |

Starship · Accepted Answer · 2024-09-30 20:47:25Z

The series is a collection of thought leadership pieces by the company, grounded in the things that we all care deeply about – preserving the continuing cycles of human knowledge generation, ensuring that attribution happens when knowledge from Stack Exchange is utilized in other ways, and preserving the open nature of the internet when LLM providers continue to challenge that idea. Our goal is to explore the core challenges our industry is facing today and how they shape our view of where our product strategy should go.

This is patently false and you know it too. Like most businesses, your goal is simple:make money. Stop pretending otherwise.

If you cared about human knowledge generation, then you wouldn't keep trying to force AI on us and limit the actual humans from generating knowledge. Maybe if you cared about preserving that knowledge, then you would stop trying to kill or severly limit the data dump. If you cared about attribution, then maybe you would care to learn what attribution even is. Maybe you would learn that generative AI cannot attribute something.

Erik A · Accepted Answer · 2024-10-02 08:34:22Z

If RAG is the means of attribution, we should only allow our content to be used for RAG

If content is attributed to by retrieving it and allowing the LLM to use it at time of generation, that means the content used to train the LLM is never attributed to, and only the content retrieved is attributed to.

If content used for training is not attributed to, then we should not allow our content to be used for training, since we require attribution.

Simple as that.

If SE does only allow our content to be used for RAG, and not for training, that's of course something I'd really like in writing.

Note that it'd also make the licensing problem a lot easier, as it would avoid the LLM model weights being based on our content, and thus would only require a compatible license for those results where our content was retrieved using RAG.

user1176409 · Accepted Answer · 2024-10-01 16:08:02Z

3

I rarely read the announcements of this site, but when I see announcements about AI, I definitely will take a look at it.

There’s a user who asked in the comments:

Shouldn't this be featured?

I'm also quite curious as to why they don’t do that. Maybe to avoid too many users downvoting this question, like what happened with this one: Our Partnership with OpenAI

They decided not to featured this question on all SE sites, and they think they are clever. Ha?

(I’ve edited this answer as the AI-enhance search is just a mistake, see here: Was the new AI search feature "test" officially announced?)

edited Oct 1, 2024 at 16:08

answered Oct 1, 2024 at 13:33

user1176409

1

6

There is often a general recommendation to avoid duplication between the sections of the bulletin and, while not the same post, this question links to the blog posts that are currently in the bulletin... while not optimal, it does make sense to hold off on featuring this for the three days the blog posts will be in the bulletin. I'd hold off judgement until that time.

Catija
– Catija

2024-10-01 16:06:34 +00:00
Commented Oct 1, 2024 at 16:06

Add a comment |

Berthold · Accepted Answer · 2024-10-15 17:26:49Z

-16

Many thanks to everyone who’s responded here and provided thoughts, feedback, and ideas. Looking ahead and planning for the future is always an exercise in judging uncertainties and making predictions. There has been a great deal of thought and research put into the strategy outlined in the blog series. It’s important to note that we’re aiming to be flexible and have the ability to pivot when something unexpected happens (because more often than not, it does). The themes woven through many of your responses reflect the uncertainty that’s present now, and that uncertainty will continue to exist as we take the next careful steps together.

Some responses speak of potential economic or legal outcomes as if they are virtual certainties, while others express concern about past missteps being repeated or amplified. As the blog series notes, there is a foundational shift happening. In a time of transformation, the future is more unclear than usual. Perhaps there is a bubble — there are many perspectives on that question. Time will bring more clarity on legal and licensing questions as courts and regulatory bodies assess the new landscape. And we’ll continue to exert pressure on LLM providers to keep attribution at the forefront and remind them of the dangers of obscuring human knowledge sources. We’re preparing for possible futures and keeping the strategy flexible so that we can adapt as needed, navigating known risks and reacting to emerging ones. Bubbles can pop and court cases can be decided, but there is not a scenario where the world simply reverts back to “the way things used to be.” We must continue to evaluate as we go forward.

Other themes in your responses are more focused on the Stack platform and community specifically, and the trust deficit that exists for many community members. We all have the shared goals of sustaining and growing a platform that enables a thriving set of communities to create and curate knowledge repositories. There are indeed risks, ranging from alienating existing community members to disincentivizing people who might wish to start contributing. Those are very real and there is keen awareness of them as we look ahead, but there is not a path without any risks.

As technology and the economics of knowledge cycles continue to evolve, the changes impact the business model and knowledge sharing processes that keep Stack running. We may need to correct course down the road or revisit strategy in response to future unknowns. But the biggest misstep of all would be to do nothing, to try and “wait it out”.

The blog series outlines the company vision and makes the case for the path we’re charting. There are many factors that will continue to inform that path — your concerns, predictions, fears, hopes and goals continue to be a big influence on considerations and decisions. Hearing your voices here (on this post) and in many other places is important. Whether you are someone who is skeptical or someone who is cautiously optimistic, please continue letting us know what you hope to see from Stack Exchange and the broader world as change continues.

answered Oct 15, 2024 at 17:26

BertholdStaffMod

3,5032 gold badges24 silver badges33 bronze badges

A "foundational shift," that the company is full-throatedly supporting despite community concerns... in ways that will (and has) resulted in community content being used without proper attribution. Initiatives that were sold as being things that will generate revenue that will be put back into the community... that we've yet to see 3 years later. All tools built off of this "new technology" that were tested on the community have been released as paid products running off the backs of the community... with no benefit to the community. Is this flowery language meant to change anything?

user400654
– user400654

2024-10-15 20:01:03 +00:00
Commented Oct 15, 2024 at 20:01
8

@KevinB It emphasises that genAI must go brrrrrrrrrrrrrrrrrrr. Nothing more. It signals that it's time to lean back and watch it burn

Zoe - Save the data dump
– Zoe - Save the data dump

2024-10-15 20:04:46 +00:00
Commented Oct 15, 2024 at 20:04
11

What I would like to see from the company is a vision that goes beyond servicing its shareholders. I know I'm not going to get it because (well, I have opinions that aren't constructive so let's skip the because). I'm sorry that y'all worked so hard on corporate planning that has zero chance of having any positive impact on how AI is going to evolve. People who want to change the world don't talk like this - corporations trying to stay relevant do this. SE is talking about attribution and restrictions on knowledge sharing instead of something aspirational. It's sad and disappointing.

ColleenV
– ColleenV

2024-10-15 20:56:44 +00:00
Commented Oct 15, 2024 at 20:56
15

So, the main problem is that y'all (consistently) list your decisions without doing anything to explain how you reached them, without explaining your goals or why you choose to this and not something else. How is anyone in the community supposed to respond in a supportive way if y'all can't even justify your choices. You say that we're not going back - well... I'll tell you, I don't think anyone wants to go back... for the last 8+ years, community members have been begging for change. The company has just chosen to keep things the same while it seeks value in other aspects of the site.

Catija
– Catija

2024-10-16 01:26:36 +00:00
Commented Oct 16, 2024 at 1:26
6

You say you've done lots of research - OK... what did you find out? What did you look into? Have you worked with any community consultants or tried to understand the core issues with the platform? Right now, it feels like y'all are trying to address symptoms rather than digging into core issues. For example, instead of fixing search, you're helping people write better questions/answers... instead of talking about what you've learned about unanswered questions to explain why you're working on helping people write better answers, it seems like you're just going to try and increase answering.

Catija
– Catija

2024-10-16 01:31:18 +00:00
Commented Oct 16, 2024 at 1:31
8

I've read many different posts about the next steps y'all are taking and there's no data explaining why. Even the data presented in the "activation" post is... well... it's not good. I could have posted an explanation of why it's problematic but I opted to take a different direction in my response to that post. E.g. Saying that most people don't have rep but looking at all users all time instead of just users active in the last year or something - or referring to unregistered vote attempts as registered.

Catija
– Catija

2024-10-16 01:36:49 +00:00
Commented Oct 16, 2024 at 1:36
17

TL;DR - please, show your work. Help us understand your choices rather than just telling us what you're doing and expecting us to accept your vision or leave.

Catija
– Catija

2024-10-16 01:39:12 +00:00
Commented Oct 16, 2024 at 1:39
7

The reason the company is struggling to stay relevant is not because of the disruption caused by AI; it’s because it has no North star to navigate by. It lies and says “community is at the center of everything we do” which may fool people who aren’t part of the community but we all know better. Regardless, that’s not a vision of what SE is supposed to be. It used to be about building a global library of programming knowledge; now it’s about cashing in on the AI hype without the legal exposure of actually violating licensing agreements while you determinedly ignore the spirit of them.

ColleenV
– ColleenV

2024-10-16 11:21:55 +00:00
Commented Oct 16, 2024 at 11:21

Add a comment |

Stack Exchange Network

The Internet is changing: Here's how we're evolving with it

18 Answers 18

Attribution is the foundation of developer trust commentary

What does attribution look like?

You must log in to answer this question.

Linked

Hot Network Questions

18 Answers 18

Attribution is the foundation of developer trust commentary

You must log in to answer this question.

Linked

Related