It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing insights.
-
1$\begingroup$ Please search our site for examples of boxplots (aka box-and-whisker plots), which graphically display the quartiles. See stats.stackexchange.com/a/262740/919, for instance. $\endgroup$whuber– whuber ♦2025-10-02 17:44:29 +00:00Commented Oct 2 at 17:44
-
2$\begingroup$ "I struggle to interpret the values in the context of providing insights.". Does it mean that you already have a dataset and its quartiles, and struggle to interpret the values? Can you share the dataset with us? $\endgroup$J-J-J– J-J-J2025-10-02 20:45:04 +00:00Commented Oct 2 at 20:45
-
$\begingroup$ @J-J-J no I actually do not have a dataset yet, I just do not understand how when you calculate the quartile values, what does this value tell us about the data distribution. I cannot seem to wrap my head around it. $\endgroup$Buchi– Buchi2025-10-03 23:27:19 +00:00Commented 2 days ago
5 Answers
Well, do you think the median is useful? It is a quartile (the middle one). But the median alone only tells you one thing: The point where half are below and half are above. That is useful, but it's just one point.
The first and third quartiles can give you an intuitive sense of skew (although not always). They can also give you an intuitive sense of spread: Half of all the data are in between. That's often useful. And it's more intuitive than the median absolute deviation (which is another good measure of spread for skewed distributions).
Of course, deciles give you even more info - but at the cost of having more numbers to look at.
Begin by thinking about what any quantile is - how it's defined. A quartile is a particular kind of quantile, as is a decile or a percentile.
In a population distribution, one way to define a $p$-th quantile is as the lowest value that has at least a proportion $p$ of the population values at or below it. For example, the median is a 0.5-quantile, because at least half the population is at or below it. The first and third quartiles (0.25 and 0.75 quantiles) are definied similarly; they have 1/4 and 3/4 of the population at or below them.
In a continuous distribution you'll get exact fractions below the respective quantiles - that is, exactly a fraction $p$ of the distribution is to the left of the $p$-th quantile - and so in that case the three quartiles would divide the distribution into four equal fractions of one quarter each.
Multiple sample definitions of quantiles exist (check your text for whatever definition it uses) but they are very similar to the typical population definitions.
If the variable is continuous (or discrete but no one value has much of the total probability in it) then about a quarter of the values would fall into the four regions delimnited by the three quartiles.
If you regard a median as in some sense a 'typical' value, in that it generally divides the collection of values into two near equal-sized groups (as long as not many values are exactly at the median), the lower quartile is similarly a 'typical' value in that lower half and the upper quartile is a 'typical' value in the upper half.
They can give a rough sense of 'where' the values are, keeping in mind the limited extent to which a few specific values could summarize a whole distribution - some sense of location and spread.
If the quartiles of adult male height in cm are 170, 175 and 180 respectively, then about a quarter of adult males will be 170 cm or below, about a half at 175 cm or below and about three quarters at 180 cm or below, with the rest being taller than that. About half the population would be above 170 cm but not more than 180 cm. This tells you something about where the values typically tend to be. It doesn't come near to telling you everything, naturally. You don't know how large or small the more extreme values might be and you don't know how the values in between might be distributed.
You can get a little information about shape from the three quartile values (the quartile skewness and Bowley skewness are defined in terms of the relative distance between the outer quartiles and the median).
Mean and standard deviation also tell you something about the location and spread of the distribution, but tell you nothing about shape. If the distribution is very skewed the mean might not seem very 'typical' (e.g. it may be larger than nearly all the values if the distribution is sufficiently right skewed). Similarly, if the distribution is quite skewed, the standard deviation might not seem much like a very typical distance of values from some 'central' value, nor seem much related to your sense of typical distance between pairs of values. They are still a kind of typical value and typical distance respectively, but any such summary of a large set of values down to only a few numbers isn't ideal for every use and every circumstance.
I think you can find a good example here: https://mlwithouttears.com/2024/01/17/conformalized-quantile-regression/comment-page-1/
Short version: estimating a quantile is a (conditional) point estimating process. It can be useful to understand potential ranges of outcomes (if you estimate e.g. an upper and lower quantile). In the end if you knew all quantiles you could specify the probability (density) of the outcomes.
In general, one quantile alone (a point estimate) is as useful as a point estimate of the mean value (which you obtain from mean squared error minimization)...
Well, if you struggle with interpreting quartiles (and quantiles for that matter), do not feel too bad. To be honest, I do find them pretty uninformative, and pretty much never compute/look at them.
The median (50% percentile) may be somehow of interest, rarely by itself, but sometimes when interpreted together with the mean (it gives a sense of the skewness; but then, I may as well compute the skewness ...).
And wrt boxplots, they are equally uninformative, if not outright misleading; see e.g.
I’ve Stopped Using Box Plots. Should You? by Nick Desbarats,
Be wary of boxplots, they could be hiding important information! (Reddit), or
That last link, towards the end of the post, has graphs for distributions which all have exactly the same box plots, but which are widely different, like below.
And if they have the same boxplots, they have the same quartile.
My go-to plot for exploratory analysis is the "individual value plot", something like this
which you can then decorate with symbols for mean, CI of the mean, and possibly median.
Bottom line: I would not sweat it ...
-
2$\begingroup$ Your plot has many other names, including dot plot and strip plot. What's tacit is that you're using jittering to shake identical or similar values apart, which in turn doesn't rule out overlap or overplotting; isn't easily reproducible; obliges readers to average mentally to get an impression of density. You might decide that such problems are trivial or tolerable. $\endgroup$Nick Cox– Nick Cox2025-10-03 11:17:40 +00:00Commented 2 days ago
-
2$\begingroup$ While not wanting to be dogmatic, my own default is some combination of quantile and box plot with extra details allowed or even advised. By convenient coincidence, stats.stackexchange.com/questions/669795/… gives a worked example with four displays side by side. $\endgroup$Nick Cox– Nick Cox2025-10-03 11:23:36 +00:00Commented 2 days ago
In order to understand quartiles , you should look at how it is used.
(1) When thinking about quartiles , we must also consider the inter-quartile range $IQR=Q_3-Q_1$ , where the quartiles are $Q_1$ & $Q_3$.
Now , you can use IQR to calculate two "fences" for the given Data :
$l=Q_1-1.5IQR$
$u=Q_3+1.5IQR$
Data Points outside the range $(l,u)$ are generally Outliers : Outlier Detection using this method with this range is more robust , compared to usage of mean & median.
(2) We can turn the coin the other way , to claim that inter-quartile range gives the good range of valid Data Points , to measure spread / scatter / dispersion / variability of the Data Points. This method is more robust compared to mean & median , because extreme values do not enter the calculations & skewness might not impact $IQR$.
(3) We can then focus our attention to the inter-quartile range $(Q_1,Q_3)$ , in cases like advertisement targeting , product marketing , movie audiences , getting more "bang-for-the-buck" in general. We might later move beyond that range to target wider Population.
(4) We always want to "Summarize" a large amount of Data to a few numbers , hence we calculate Summary Statistics where quartiles & inter-quartile-range are well-known & widely used.
Made Up Example : rather than giving a list of $5,000,000$ numbers (ages of government employees in a large country) , we can more easily say $50\%$ of the ages are in the range $(21,57)$
That is more useful than just saying that the mean age is $36$ or the median age is $40$
-
6$\begingroup$ Using whether data points lie more than 1.5 IQR beyond the quartiles to define outliers has caused more mischief and misunderstanding than it has prevented. While some analysts are obliged to automate outlier detection, ideally that requires statistical and scientific (subject-matter) judgement. $\endgroup$Nick Cox– Nick Cox2025-10-03 11:27:16 +00:00Commented 2 days ago
-
4$\begingroup$ I didn't downvote (although I was tempted, as this answer doesn't add helpfully to others in my view). On Tukey, I am clear, having been an assiduous reader of his work for over 50 years. 1.5 IQR or more beyond the quartiles was a practical (but compromise) rule for which points should be plotted separately on a box plot -- no more than that. It was never for Tukey a way to define outliers. Indeed, by example and by exhortation, he was repeatedly clear that box plots often imply that variables should be analysed on other scales and in conjunction with other variables. $\endgroup$Nick Cox– Nick Cox2025-10-03 13:02:32 +00:00Commented 2 days ago
-
3$\begingroup$ The Wikipedia citation is not reliable or accurate on Tukey's suggestions, let alone his intentions (among other details, talking about tests is woefully inappropriate). Indeed, it comes near exploding itself. How on Earth could 1.5 serve as a universal threshold across different distributions, different sample sizes, and different anything else? The best single source here is Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. $\endgroup$Nick Cox– Nick Cox2025-10-03 13:08:44 +00:00Commented 2 days ago
-
3$\begingroup$ "Preliminary Screening" - e..g these points are candidates to be looked at. Make sure they aren't data entry errors. If they are real data, adjust your analyses accordingly. $\endgroup$Peter Flom– Peter Flom2025-10-03 13:46:44 +00:00Commented 2 days ago
-
3$\begingroup$ @PeterFlom undoubtedly means "before computers became utterly standard". Tukey was a curious combination on computing. He was involved in computers from the 1940s, suggested bit as a technical term, and had a good claim to introducing the term software. Not to mention FFT! But also it seems he did not use computers much from about 1970 on, unless in collaboration. $\endgroup$Nick Cox– Nick Cox2025-10-03 14:05:15 +00:00Commented 2 days ago