Grubbs’ test in linear regression

Question

At the request of a reviewer who asked us to test for outliers, and at the suggestion of my research director who mentioned Grubbs’ test, I am trying to do Grubbs’ test as one of the approaches to detect outliers in our linear regression model with small sample size (n=23). (Our linear regression has 2 independent variables, one continuous and one binary). On which type of residuals is it best to do Grubbs’ test? Studentized residuals? (I know of raw, standardized and studentized residuals.) (Please note that my statistical knowledge is very limited.)

Please edit the question to say more about why you want to do this. With so few samples it will be very difficult to distinguish an "outlier" from the variation expected from a normal distribution of residuals. If you then re-do the analysis without "outliers" you could bias the results. If you show a quality-control plot of your regression model you might get suggestions about better ways to proceed. See this page. — EdM
– EdM, Commented Apr 27 at 19:56
(Leaving aside that given the circumstances in which people are often trying to identify and exclude outliers, it's often not the best approach, and testing, particularly when not based on its impact on what you're doing is often solving the wrong problem) ... A big issue with applying any test of "outliers" to residuals is the ability of any sufficiently influential point to make its residual small (self masking), and even if you use a residual that excludes the point in its own fitted value and s.e., with a second such point they can mask each other. Its important to consider influence. — Glen_b
– Glen_b, Commented Apr 28 at 0:04
Family names like Grubbs are evidently hard even for some people who have English as their first language to handle. But as said the name here is Grubbs and allowed possessives are Grubbs' and Grubbs's with perhaps a majority preference for the first. Grubb's would only be right if the statistician concerned had the surname Grubb, but he didn't. Mistakes like this can be infectious for the extra reason that people rarely check the original paper. See my answer in Meta for more such examples: stats.meta.stackexchange.com/questions/2810/… — Nick Cox
– Nick Cox, Commented 2 days ago
Please be aware that the conditions required for Grubbs' test to be reliable often do not apply or cannot be verified. These conditions include approximate normality of the data, approximate independence, and the fact that the test (correctly) seeks a single outlier even though multiple outliers might exist. Often a better procedure is to use an outlier-resistant regression method. If the differences in your results are immaterial to your conclusions, you have supplied all the evidence the reviewer seeks. Otherwise, you also know how sensitive your conclusions are to possible outliers. — whuber
– whuber ♦, Commented 2 days ago

Nick Cox · Accepted Answer · 2026-04-28 11:39:00Z

First, any outlier that is significant on Grubbs' test, or any other test, on any variant of residuals, will, with N = 23, surely pass the IOTT. That's the interocular trauma test. It hits you between the eyes.

Second, in the old days (before powerful and ubiquitous computers) it was hard to do many kinds of regression. So, we tried to make the data fit the model. Today, we can model much larger data sets than yours with all sorts of models. If you suspect outliers (either a priori or by visual inspection) then you can do robust regression or quantile regression, or perhaps (depending on the nature of your variables) transform something.

The link EdM gave has a lot of good info on outliers in general and for regression.

EDIT: Given that this is "at the request of an editor" (per your edit) I would still not do it. Rather, I would cite the answers here (and in the link Ed put it).

I often do statistical editing for journal articles and I get to see that other editors (not trained in statistics, or only minimally trained) often make rather unfortunate requests. If you have information on who this editor is (i.e. are they a statistician?) you can pitch your argument differently - even statistical editors can make mistakes, but this sort of mistake is more likely from someone who took stats back in grad school and vaguely remembers "outliers bad".

Many statistical people who are not statisticians in any strict professional sense do give competent advice. So the test of whether an editor is a statistician is not quite right. — Nick Cox
– Nick Cox, Commented 2 days ago
@NickCox I guess it depends on what you mean by "statistician" exactly. Certainly you don't need a PhD in stats to give good advice on articles. — Peter Flom
– Peter Flom, Commented 2 days ago

jginestet · Accepted Answer · 2026-04-29 06:12:46Z

I will express, in my own terms, what pretty much all answers in the post linked by @EdM said, namelly, "Do not remove outliers unless you are certain they are erroneous."

I have learned to profoundly dislike the term “outlier”. Not because it is an incorrect characterization of some datapoints (after all, it perfectly describes datapoints which lie far out from the bulk of the other datapoints), but because it has taken on the meaning of “bad data which should be removed”.

I will further state that there is no such thing as an “outlier”, at least not in the sense it has acquired now (bad data to remove), and will instead use the original (and unloaded) Tukey term of “far out”.

There can only be two kinds of datapoints: observations coming from the data generating process (DGP), and observations which do not come from the DGP, and were instead obtained by “error”. And by error I do not mean typical measurement uncertainty (within the expected accuracy and precision of the measurement system), or typical variability of units being measured, etc. I mean a gross error; sampling error (measuring a unit which is not part of one’s population of interest), measurement error (uncalibrated measurement instrument, measuring instrument set to the wrong scale, wrong measuring system used, etc.), or human error (error in reading the measured value, or transcription error, etc.).

The first, no matter how “far out” they may be, come from the DGP, and thus rightfully “belong” to the dataset. Removing them amounts to nothing else than falsifying the dataset. The second do not come from the DGP, thus do not “belong” to the dataset, and should be removed, if at all possible.

Note that “far out” values can very well come from the DGP (after all, a normal distribution is defined over $[−\infty,+\infty]$.
And “errors” do not always generate “far out” values; a datapoint could be affected by an “error” but generally fit with the rest of the datapoints.

Now how do you tell “far out” (but DGP-generated) observations from "far out" “errors”? With great difficulty, and great care...

Sometimes, it is easy; if I measure the weight of adult females, and see a measurement of 19 lbs, it is clearly an error; it could perhaps have been 91 lbs, or 109 lbs, or even 190 lbs, but 19 is a mistake, and I can (and should) safely remove it. But a weight of 102 lbs entered as 120 lbs (or vice versa) is not really detectable. And a weight of 190 lbs, or even 290 lbs, while certainly “far out” could be a correct entry. I may investigate such entry further (e.g. talking to the person(s) collecting the data to see if they remember such a -probably memorable- subject?), but unless I obtain compelling evidence that no such subject could possibly have been measured, I will have to keep that datapoint.

The fact that a datapoint does not fit some preconceived notion of what the data should look like, is not an issue with the data; it is a problem with our preconceived notion.

As statisticians, data analysts, data scientists, etc., our only ground truth is the data; after all, our only justification for reaching our conclusions and recommendations is “that is what the data says”. If we start to remove data, just because we feel it “does not fit”, we have made a big step towards charlatanism.

Note that Grubbs’ test (basically a z-score test: btw, it does not matter which form of residuals you run it on) assumes that the underlying distribution is normal; why would a point 5 sd away from the mean be an "outlier", when a normal distribution is defined over $[−\infty,+\infty]$? Such a point is completely plausible. Similar objections could be made for all other outlier detection methods; they are all completely arbitrary, and do not detect "errors", just "far out" points. But not all "far out" points are errors, and not all errors are "far out". The only way to catch "errors" is from the context of your measurements, the very specifics of your data, instruments, etc., and your domain knowledge, not because it is too "far out".

So if you can make a solid case that a “far out” point is an error, then by all means remove it (it was not generated by your DGP). But if you cannot confidently ascribe it to an “error”, integrity (of the data, and of your professional reputation) dictates that you keep it. You can find a similar cautious approach to “far out” observations (and maybe a little less extreme than mine?) here.

As an aside, and to confirm that not all “far out” observations are bad, and that it pays to investigate them further, you may want to read “Outliers are pure Gold”. Indeed this is often an invaluable source of knowledge about your DGP.

It is not surprising to hear the remarks from your reviewer and director; they are typical of people who do not have much to say about your work, but, by sending you on a wild goose chase for outliers, appear to do sound wise, and feel they contributed.
So what can you do and tell them? You should run a Q-Q plot of your (raw) residuals, to ensure they are not "too non-normal" (that is just a standard diagnostic test for your regression). You may then also possibly identify on such a graph whether you have any "far out" point; if you do, investigate them. Remove them if and only if you are virtually certain they are "errors". Also, you could run the regression with and w/o them, to perform a sensitivity analysis; typically high leverage points are points where the x-axis value is very far from the bulk of the x-values, but the y-value is within the bulk of y-values.
If you have such a point, it is permissibe to remove it, not because it is an outlier, but because it is outside the x-range of your model (this datapoint is isolated in x; you do not have enough data to extend the range of your model that far).
And then report what you did; you looked for errors and high leverage points and did not find any, or found some errors and removed them, or found some potential high leverage point and did a sensitivity analysis and decided to keep them (after all, they did not change the model much), or decided to remove them (x was out of range), etc.

The question What is an outlier? is here answered as whatever does not come from the DGP -- to which the natural riposte What is the DGP? Your analysis is more subtle and statistically informed than many, but it too risks vacuity or does not match many set-ups. For example, some people (not me) deal with mixes of genuine transactions or creations and fraudulent or forged ditto, so then which is the DGP in tbat case? (I think I've made this point before, but so have you!) — Nick Cox
– Nick Cox, Commented 2 days ago
@NickCox, my point is that anything that can not be positively identified as an error must be assumed to come from the DGP. Yes, data could be coming e.g. from a mixture of DGP's, or forgery DGP, etc. Then we are back to the basic statistical assumption of i.d., which is near impossible to "prove" and has to assumed. The whole removal of outliers is based on assuming a single DGP, with a few "far out" points to remove. My point is that "far out" is an invalid criterion; only error is a valid criterion for removal. One does not need to know what the DGP is, just that it exists and is i.d. — jginestet
– jginestet, Commented 2 days ago

EdM · Accepted Answer · 2026-04-28 20:04:45Z

I see two issues here: when do tests for outliers of residuals make sense, and which test(s) to perform when they do.

When to evaluate the residual distribution

In fairness to your reviewer, there is a situation in which evaluating the distribution of residuals makes a lot of sense: quality control. Ideally, you should have performed quality control before submitting your regression results for publication. If you didn't (or didn't report your quality-control results), then a reviewer could fairly ask you to evaluate whether "outliers" might affect your results and conclusions.

How to evaluate the residual distribution

Although raw residuals are useful in many respects, they don't necessarily tell you how much an "outlier" is affecting your results. For example, an observation with predictor variables that are outliers can greatly influence (in a technical sense) the model's coefficient estimates. In the extreme, such an influential observation might force the regression estimate essentially through itself. The residual for that observation will then be very small.

Evaluating residuals around regression models thus should take both the raw residuals and the associated predictor variables into account. That's done by correcting each residual for the leverage of its associated observation. Terminology isn't always consistent (see this page), but the R quality-control plots for Scale-Location, Quantile-Quantile, and Residual-Leverage use "standardized" residuals "which have identical variance (under the hypothesis)." Starting with the raw residuals $R_i$ and the model's estimate of error variance $s^2$, they are defined as:

$R_i / (s \times \sqrt{1 - h_{ii}})$ where the ‘leverages’ $h_{ii}$ are the diagonal entries of the hat matrix, influence()\$hat, (see also hat) and where the Residual-Leverage plot uses the standardized Pearson residuals (residuals.glm, type = "pearson") for $R[i]$.

That might argue for evaluating the distribution of "standardized" residuals, but what's really important is how much any "outliers" matter. Grubbs' test is not very useful here even if you use those residuals, as it only tests the overall distribution and is typically designed for one-at-a-time outlier removal. Quality control plots are much more informative than a search for "outliers" in the overall distribution.

My suggestion: show and discuss the implications of quality-control plots rather than restricting yourself to evaluating the overall distribution of any type of residuals. Ideally, that should assure both you, your research director, and your reviewer that your model is adequate. If it doesn't, then you should consider types of regression other than ordinary least squares, as Peter Flom suggests.

Stack Exchange Network

Grubbs’ test in linear regression

3 Answers 3

Linked

Hot Network Questions

Grubbs’ test in linear regression

3 Answers 3

Linked

Related

Hot Network Questions