ROC curve manual calculation vs. pROC package R

Question

I want do recreate ROC curve manually on my dataset and compare it to roc function from pROC package in R. I'm using dataset on customer churn telco.csv from Kaggle. Data can be found here: https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download.

I import data, and change column Churn to factor variable with levels 1,0.

telco %>%  mutate(Churn = ifelse(Churn == "Yes",1,0)) -> telco
factor(telco$Churn, levels = c(1,0)) -> telco$Churn

For the sake of simplicity, I run logistic regression with only one explanatory variable MonthlyCharges.

glm(Churn ~ MonthlyCharges, data = telco, family = "binomial") -> model

Now, that I have a model, I'll use roc function from pROC package, and plot the roc curve:

roc(telco$Churn, predict(model,telco, type = "response")) -> roc_curve
par(pty = "s")
plot(roc_curve)

Here is the result:

Now, I want to recreate this result manually. For defined thresholds I'll plot TP rate versus FP rate, and calculate sensitivity and specificity using yardstick package.

model_data <- list()
thrs <- seq(0,1,0.01)
sens <- NULL
FP_rate <- NULL


for (i in seq_along(thrs)) {
  model_data_i <- augment(model, type.predict = "response") %>%
    mutate(pred = ifelse(.fitted > thrs[i], 1, 0)) %>%
    mutate(pred = factor(pred, levels = c(1, 0)))
  
  model_data[[i]] <- model_data_i
  
  sens[i] <- yardstick::sensitivity(model_data_i, Churn, pred) %>% pull(.estimate)
  FP_rate[i] <- 1 - yardstick::specificity(model_data_i, Churn, pred) %>% pull(.estimate)
}

tibble(sens,FP_rate) ->  roc_data

ggplot(roc_data, aes(FP_rate, sens)) + geom_line(size = 1.5, color = "blue") +
  geom_abline(intercept =  0, slope = 1, linetype = "dashed") +
  coord_fixed(ratio = 1) + 
  theme_minimal()

Here is the result:

These two graph look inverse, and they provide significantly different AUC values. When I plot sens on x-axis and FP rate on y-axis, the graphs are the same, but then it doesn't make sense to me.

Also, when I look at ratio of TP and FP rates, there are more times where this ratio is below 1, which give more evidence to my manually calculated ROC curve.

tibble(sens,FP_rate) %>% mutate(ratio = sens/FP_rate) -> roc_data

Even if I use thresholds using roc_curve$thresholds I get the same result.

Where am I making a mistake?

Dave · Accepted Answer · 2024-07-11 15:58:57Z

2

If the pROC function yields a ROC curve with an area under the curve less than $1/2,$ pROC will flip the labels and recalculate the ROC curve and the area underneath it, and this is the curve that is plotted. I disagree with the decision to make this default behavior, but there is an argument to keep the function from doing this.

pROC::roc(…, direction = "<")

There’s nothing wrong with this kind of calibration step, but I disagree with the default behavior of calibrating automatically and without user knowledge, possibly leading the user into believing that a model is good when it is terrible, even if a second stage of the prediction pipeline (calibration) could lead to good predictions.

edited Jul 11, 2024 at 15:58

answered Feb 16, 2024 at 17:18

Dave

5,4051 gold badge13 silver badges41 bronze badges

$\begingroup$ The pROC package developer has addressed this. $\endgroup$

Dave
– Dave

2024-07-11 15:58:39 +00:00
Commented Jul 11, 2024 at 15:58

Add a comment |

Stack Exchange Network

ROC curve manual calculation vs. pROC package R

1 Answer 1

Hot Network Questions

ROC curve manual calculation vs. pROC package R

1 Answer 1

Related

Hot Network Questions