5
$\begingroup$

The problem is to determine whether a patient referred to the clinic is hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. Because 92 percent of the patients are not hyperthyroid a good classifier must be significant better than 92%.

This is for some data I found for Thyroid Disease.

$\endgroup$

3 Answers 3

13
$\begingroup$

Imagine that you are the boss and hire a data scientist at a high salary to build a model that you hope will achieve a high accuracy score.$^{\dagger}$ She comes back to you reporting $92\%$ accuracy. “Wow,” you think. “That sounds like an A in school! Great job you’re worth every penny of your salary!”

Then you realize that you could have gotten that same $92\%$ accuracy by predicting the dominant outcome every time. At this point, you realize that your data scientist has not accomplished anything beyond what you could have accomplished, and you no longer feel that she is worth the high salary you pay her.

In your case, since you know that $92\%$ of the so patients lack the hyperthyroid, you could get a classifier with $92\%$ accuracy just by predicting this every time.

You’re always allowed to—and I would argue should be encouraged to—compare to some kind of baseline model.

An issue that might get you is that there are three categories, yet the task is to classify into hyperthyroid or not, which is binary. Whether or not these categories should be combined or dealt with separately really warrants a distinct question, however.

$^{\dagger}$A critical aspect of this is the assumption of accuracy being the metric of interest. I am not sold on this. First, any threshold-based metric is known to have issues. Second, even if you are in a position where you must make discrete classifications instead of predicting tendencies like the link discusses, the costs of wrong decisions need not be equal. I find that to be highly plausible here, and you might be willing to sacrifice a bit of accuracy for gains in sensitivity or specificity. The idea of comparing to baseline still applies to such a performance metric, though. If your data scientist cannot make a model that performs better than your naïve model that classifies everyone the same way, she probably hasn’t helped you that much, no matter how much the accuracy score looks like an A in school.

Related.

$\endgroup$
8
  • $\begingroup$ As always, note the issues with accuracy as a performance metric. The included link contains resources that explain this in more detail. $\endgroup$ Commented Nov 17, 2022 at 0:58
  • 3
    $\begingroup$ +1 although I disagree that a classifier needs to be better than 92% (not that you say this is good and you just explained it). It also depends on the relative costs of false error and false positive. If the boss uses their simple approach of classifying everything as negative, then they will make a 100% error for all the 8% positive cases' which is, I believe, disastrous. $\endgroup$ Commented Nov 17, 2022 at 8:27
  • $\begingroup$ I remember there was a question about this somewhere during the lasts months. $\endgroup$ Commented Nov 17, 2022 at 8:29
  • $\begingroup$ The related questions are: stats.stackexchange.com/questions/552420 and stats.stackexchange.com/questions/550362 (I also notice that I forgot to write the answer to those questions that I had in my head) $\endgroup$ Commented Nov 17, 2022 at 8:36
  • 1
    $\begingroup$ I feel the footnote takes care of that. If your only concern is accuracy, which the problem seems to imply, than the modeler must be able to beat $92\%$, though I am with you that accuracy might not be the best measure here. @SextusEmpiricus $\endgroup$ Commented Nov 17, 2022 at 15:02
7
$\begingroup$

a good classifier must be significant better than 92%.

There are two issues here

  • Where does this statement come from?

    It is not so clear what is meant with the 92%, but it seems to relate to accuracy. $$ \text{accuracy} = \frac{\text{good estimates}}{\text{total estimates}} $$ this 92% accuracy can already by achieved with a naive classifier that assigns every patient to the negative category. For the 92% of the patients that are negative it will be right and for the 8% of the patients that are positive it will be wrong.

But also

  • Is this statement right?

    I disagree. The classifier does not need to be better than 92%.

    Imagine the following situation:

    Your test could be 100% sensitive and classify 8% of the people as having hyperthyroid disease while they also have the disease, but it also accidentally classifies 9% of the people as having hyperthyroid disease while they actually do not have hyperthyroid disease. Then you make a mistake in 9% of the cases and the accuracy is 91%, which is less than 92%.

    Did you do bad?

$\endgroup$
3
  • $\begingroup$ The last paragraph, "Imagine your test could detect all 8% of the people with hyperthyroid disease, but it also accidentally classifies 9% of the people with hyperthyroid disease that actually do not have hyperthyroid disease. Then you make a mistake in 9% of the cases and the accuracy is 91%, which is less than 92%.", is very enlightening and I would have upvoted an answer that contained just that. $\endgroup$ Commented Nov 17, 2022 at 10:32
  • $\begingroup$ I like this answer, and only want to comment that there's still more complexity beyond this. For example, consider the benefits of true positives vs the consequences of false positives. For some hypothetical disease, say a treatment buys 5 years of life on avg, but for healthy people, it costs 10 yrs of life on avg. In that case, you might want the true positive rate to be at least twice the false positive rate. Or you just might want to consider the consequences of potential malpractice lawsuits, in which case you might want a very low false positive rate $\endgroup$ Commented Nov 17, 2022 at 14:47
  • $\begingroup$ @anjama indeed some sort of cost function needs to be involved. I commented about that to the other answer as well. $\endgroup$ Commented Nov 17, 2022 at 14:49
3
$\begingroup$

Since 92% of patients are all from one class, you can build a classifier that just predicts the majority class for all samples and get 92% accuracy. This is of course not a terribly interesting classifier, as it completely ignores everything about individual patients, and makes a prediction based on the population as a whole. Classifiers are generally expected to make a sample-level prediction based on some set of input features, but a naive majority-class classifier can achieve 92% accuracy in this setting without knowing anything at all about individual patients. In some cases where accuracy is the right measure to optimize, the majority-class classifier may be a useful baseline comparator, since it represents an accuracy benchmark that can be achieved with the simplest possible model.

That said, I disagree that any useful classifier must have greater accuracy than the majority-class prevalence. Accuracy is often not the best measure of classifier performance, particularly in cases where the "cost" of errors is not equal between false positives and false negatives. Accuracy treats all misclassification errors as the same, but there can be very real differences in the consequences of making false positive versus false negative errors.

Imagine we are building a classifier for a medical screening test for a serious, fast-progressing, but relatively rare disease. In such a case we want a test with very high sensitivity (we can't afford to miss patients who truly have the disease, since they will die if the disease continues untreated), but are willing to sacrifice specificity to achieve it (we're willing to falsely diagnose some people and send them for a follow-up test). In such a case, we want very high accuracy in the small disease-positive population (resulting in few false negatives), but may allow lower accuracy in the large disease-negative population (resulting in somewhat more false positives). The net accuracy may be lower than the majority-class prevalence, but the overall "cost" of misclassifications is actually lower than other classifiers with higher accuracy, since not all errors are equivalent.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.