Using a test of independent data on paired data; valid or not?

Question

My specific question is about using a Mann-Whitney U test (MWUt) (meant for 2 independent samples) on data which is actually paired (correlation between the 2 samples).

I have searched this site, and the web, and I did not find any definitive answer.
Most of what I found was on paired t-test vs. 2 sample t-test. But this is not of great practical interest; why would one not use the paired t-test (higher power, and no need to be concerned about the non-independence).
And what I found was generally vague, but along the lines of a) should be ok b) you may lose some power c) you may get fewer type I errors (which is not really a bad thing).

But for the MWUt, there is no corresponding paired test which tests the exact same null hypothesis (stochastic superiority), but on paired data. The Sign test tests the median, the Wilcoxon Sign Rank test tests the pseudomedian, and neither tests stochastic superiority. So what is one to do when faced with paired data, and interested (regardless of the reason) in stochastic superiority? Will the results of a MWUt on such paired data be incorrect, or will they be usable?

And, more generally, what happens in other scenarios (paired t vs. independent t, or Sign test -which is paired- vs. Median test -which is not-)?

Depends on what you mean by words like valid and usable (e.g. if usable means publishable, we can't be expected to address the opinions of your colleagues). Significance level and power can both be an issue, but power properties for nonparametric tests are not distribution free. Some generalities result from impact on significance level but not specific amounts in general and thats not the whole story for power. In general simulation will be a good approach if you have a good sense of what you want to know. If you have specific classes of alternative in mind, more can be said — Glen_b
– Glen_b, Commented yesterday
@Glen_b, by "valid" and usable" I mean that, if and when I get a significant result (for which I may experience diminished power), I can believe it, at least as much/often as if I had used a paired test (if it exists...). By invalid, i would then mean that my Type I error rate is much larger than my stated significance level. — jginestet
– jginestet, Commented yesterday
"when I get a significant result ... I can believe it" - two issues here: 1. It seems to ignore the possibility of ending up with a very conservative test - if the test could be quite conservative you might be getting almost no rejections at all, perhaps even at large effect sizes (and I am curious why that wouldnt concern you). 2. It depends on whether you mean "Do I get more false rejections than I should?" (a type I effect) or "Are my rejections (regardless of how rarely they occur) going to incur a substantially higher proportion of false rejections" which is not just type I. — Glen_b
– Glen_b, Commented yesterday
@Glen_b, 1) I am not (too) worried about loss of power, because this is the nature of my data (e.g. I have ordinal data, or I did not collect pairing infpormation), so I will have to compensate with larger sample sizes (just like I would if I measured by attribute (e.g. pass/fail) as opposed to by value (ratio scale). The goal is to do the best that the data allows. 2) As long as type I errors remain "close" to the chosen significance...ctd — jginestet
– jginestet, Commented yesterday
...ctd. And wrt your 3rd, valid, point (substantially higher proportion of false rejections), I am not considering it, because now we get into issues of prevalence, positive predictive value, priors, etc. (a la Ioannidis, Why are all studies false). It is a very valid concern, but here I am staying in the narrow, frequentist, NHST scope, where we really do not worry about it... So I will take the loss in power (because it can not be avoided), unless it comes at a cost of substantially lerger number of Type I errors. — jginestet
– jginestet, Commented yesterday

Greg Snow · Accepted Answer · 2025-10-30 19:58:22Z

There are a few different issues in this question.

First, what is the effect of using and independent test when 2 samples are not independent (paired).

When 2 summary statistics (mean, median, pseudomedian, etc.) are independent, then the variance of their difference is just the sum of the 2 variances, $Var(A-B) = Var(A) + Var(B)$. This is simplest to show with normally distributed means, but this holds for other summaries as well. When the 2 summaries are not independent, then we need a measure of the covariance as well, $Var(A-B) = Var(A) + Var(B) - Cov(A,B)$. If the covariance is positive (the most common case in paired designs), then using the independent test will just use too large of an estimated variance (usually through the standard error in the actual test). This is where the statements about being conservative or lowering the type I error come from (note that it also lowers the power of tests and the precision of confidence intervals, so it may be a bad thing). I see this most often when someone does pre and post surveys on the same sample of subjects, but they did not include anything in the survey to pair the answers, so we do the independent test/interval with the note that it will be conservative if the covariance is positive.

But one important thing that is often skipped over (because it is rare), is that if the covariance is negative, then things swap and the variance/standard error of the independent test/interval is now too small instead of too big, which will increase the risk of type I error and lower the true confidence level of intervals.

The second issue is how to do the MWUt while accounting for pairing, when you are not happy with the sign test or signed rank test. Here I would suggest doing your own permutation test.

The MWUt is already a special case of permutation testing. The original tables were created by looking at all the possible permutations of ranks for given (small) sample sizes where all the permutations were based on independent (non-paired) sampling. You can instead create your own table/critical value/p-value by instead permuting pairs. For a small number of pairs you can create all $2^\text{(n pairs)}$ permutations, calculate the MWU statistic for each permutation, then seeing where the MWU statistic for your original data falls in this null distribution. If that is too many permutations then you can instead randomly permute within pairs, compute the MWU statistic, then repeat a bunch of times to generate the null distribution. I expect that the result will be pretty similar to the signed rank test results.

Thanks, this makes sense wrt biased estimate of the sd. But the MWUt does not use an estimate of the sd? Yes, MWUt suffers from the Behrens-Fisher problem. But then one could use Brunner-Munzel. So how could MWUt/BMt be affected by a non-null covariance? — jginestet
– jginestet, Commented yesterday

Thomas Lumley · Accepted Answer · 2025-10-30 19:52:14Z

4

The key fact about the t-test is that it tests the mean, and the mean difference over pairs is the same as the difference in means between groups: $$E[A-B]=E[A]-E[B]$$ This equality is what lets you have paired and unpaired tests that agree. The only problem when using a paired test of means with independent data is that you lose some degrees of freedom.

That's not true for basically any other summary statistic. A test for median difference is not the same as a test for difference in medians, and so on. In general there's no way to test a hypothesis about equality of some group-level summary statistic using a summary of paired differences.

answered yesterday

Thomas Lumley

56.4k1 gold badge79 silver badges222 bronze badges

$\begingroup$ Thanks. Out of curiosity, could there be a counter example of 2 random variables, X and Y, where Y is correlated to X, with X and Y having different medians, but the median of the paired differences ($x_i-y_i$) is 0? I have been thinking on how to construct such, and I can not figure it out? I can easily conjure up such samples, but then the correlation is dubious at best... $\endgroup$

jginestet
– jginestet

2025-10-31 00:15:33 +00:00
Commented yesterday
$\begingroup$ And I understand your point; there can not be a paired equivalent of, e.g. MWUt, because "there's no way to test a hypothesis about equality of some group-level statistic using a paired differences.". Then my question is; What would happen if I run the test at the group (assumed independent) level, when the data is actually paired? What are the downsides? Are significant results believable, or are they non-sense? $\endgroup$

jginestet
– jginestet

2025-10-31 00:21:26 +00:00
Commented yesterday
$\begingroup$ You'd be testing a different null and alternative. The p-values would be p-values, but it would be hard to get any useful interpretation of anything beyond the p-values. $\endgroup$

Thomas Lumley
– Thomas Lumley

2025-10-31 00:58:56 +00:00
Commented yesterday

Add a comment |

Peter Flom · Accepted Answer · 2025-10-30 19:53:21Z

2

First, you will lose a lot of power. You are losing all the things that make your subjects match. If your subjects are people, then there are a bazillion traits that will make people similar to themselves.

Second, you are violating the assumption of independent errors. I don't have a specific answer for how that will affect the Mann Whitney test, but, in general, this is a very important assumption.

Third, I'm not sure you are right about the tests you mentioned. See this thread.

Finally, if that doesn't suit your needs, you can always come up with some measure of dominance and then do a randomization test of some sort.

answered yesterday

Peter Flom

140k37 gold badges200 silver badges482 bronze badges

$\begingroup$ Thanks for the link, but I do not think it is relevant; it is about stochastic dominance ($F(x) \ge G(x) \forall x$), and not about stochastic superiority ($P(x>y) \ne P(y>x)$ for randomly selected x and y). Dominance implies that one CDF is always smaller than the other; superiority allows for the CDF's to criss-cross multiple times. Mann-Whitney or Kruskall-Wallis are tests of stochastic superiority, not dominance. Yes, dominance implies superiority (I think?), but not vice-versa. And greater median (or mean, or pseudomedian) does not imply superiority, or dominance. $\endgroup$

jginestet
– jginestet

2025-10-30 21:28:11 +00:00
Commented yesterday

Add a comment |

Stack Exchange Network

Using a test of independent data on paired data; valid or not?

3 Answers 3

Linked

Hot Network Questions

Using a test of independent data on paired data; valid or not?

3 Answers 3

Linked

Related

Hot Network Questions