If we assume that at least some level of experimentation is desired, we should be willing to accept some disturbances during experiments and some additional cleanup effort afterwards. It will not always be avoidable. Disturbances of the regular operation should be reduced though to a minimum and should be communicated in a way where it's clear why it's needed. There should already before be a plan for how to recover and clean up afterwards and this step should then be executed.
Example: Recent answer bot answers are not included in the data dump (good), have unclear content licenses (bad), and seem to remain on the sites even though the experiment failed (bad). It would have been better if all impacts of this experiment would have been cleaned up.
Example: For 1-rep-vote experiment (which wasn't conducted but could have), one could have recorded votes from 1-rep users for a short time only (say 2 weeks), analyze their impact, then try to undo their impact as much as possible (vote reversal is available and done kind of regularly). If some residual effects (badges, caps) remained, it would be okay in my eyes. Foreyes; for science.
From the past, I would say that the whole community, not only selected members, should be consulted already during the planning phase of an experiment.
Example: Before the beta rollout of collectives and articles, selected members of the community were consulted. The feature wasn't very well received, wasn't very successful and slowly died in the years following. It seems a bit as if only consulting a few members, doesn't give enough significant input. It's better to ask more membersgather as much feedback as possible.
What's striking to me though is the half-bakedness of it all. Collectives or Discussions received very little changes after the initial rollout. It's almost like the company didn't want the features to succeed andwhich looks a bit like a waste of resources.
Example: The Staging ground worked when introduced, but took multiple rounds (although I just remember that the company gave up on it in the middle of the first round). The trending sort order also took multiple rounds of forth-and-back with the community (although I just saw that it seems to not yet be available outside of SO, why not?). The unfriendly comments robot had at least two versions/iterations (although it's not in use anymore apparently, why not?).
I think that meta Q&As are already quite effective for communication but maybe there are better ways to structure discussions about experiments. Maybe dedicated "folders" for experiments, where all Q&A related to a specific experiment are kept together? Could be realized with the tag system or something else. But in general, I'm happy with the existing framework.
I have a hard time trying to come up with general applicable guidelines, but I think I can tell you when an experiment might work when I see it. So including the community in all stages might be a good idea and try to be as clear as possible with problem and solution descriptiondescriptions.
As for the size, I guess experiments can come in all sizes (1-rep-voting would be a very small change to the software), but it makes sense to go in smaller steps and check back often, but of course sometimes you have to make a larger jump if there aren't any reasonable intermediate steps available.
And we should be prepared to see experiments failing frequently. There are probably many more ways that you can do things wrong than right, but that still doesn't mean that the current state is the best possible or even close to that. Failed experiments should not result in doubts about possible system-wide change possible or taint ideas. One can fail, and try again, and succeed the next time.
And a final example: Dedicated thank you feature. 1/6 of all comments are supposed to be thank yous, we could get rid of these somehow. The company thought that a dedicated thank you button that otherwise is useless would be a good idea for that. Downside would be competition with the upvote button. After one recent experiment (that did a couple of things at the same time, which would typically be bad practice) it concluded that there is a 10% reduction in thank you comments (which isn't much) and will create the button. The metric is fine, the feature set is questionable (a button that otherwise doesn't do anything) but complete but stillcompete; yet the problem is not solved andwhile there would be much better approaches (AIe.g. AI assisted thank you comment detection with a note to upvote instead and an automatic scheduled comment deletion after X days, which would surely find more than 10% of thank you comments and additionally would not compete with upvotes but support them andespecially considering the technology is quite simple nowadays). The community advocated for this. If a far suboptimal solution is also seen as fail, then this experiment failed very early, when possible solutions were sought. Everything afterwards couldn't correct for the initial conceptional problem. Some might say that this experiment wasn't really necessary in this form.