I want do recreate ROC curve manually on my dataset and compare it to roc function from pROC package in R. I'm using dataset on customer churn telco.csv from Kaggle. Data can be found here: https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download.
I import data, and change column Churn to factor variable with levels 1,0.
telco %>% mutate(Churn = ifelse(Churn == "Yes",1,0)) -> telco
factor(telco$Churn, levels = c(1,0)) -> telco$Churn
For the sake of simplicity, I run logistic regression with only one explanatory variable MonthlyCharges.
glm(Churn ~ MonthlyCharges, data = telco, family = "binomial") -> model
Now, that I have a model, I'll use roc function from pROC package, and plot the roc curve:
roc(telco$Churn, predict(model,telco, type = "response")) -> roc_curve
par(pty = "s")
plot(roc_curve)
Here is the result:
Now, I want to recreate this result manually. For defined thresholds I'll plot TP rate versus FP rate, and calculate sensitivity and specificity using yardstick package.
model_data <- list()
thrs <- seq(0,1,0.01)
sens <- NULL
FP_rate <- NULL
for (i in seq_along(thrs)) {
model_data_i <- augment(model, type.predict = "response") %>%
mutate(pred = ifelse(.fitted > thrs[i], 1, 0)) %>%
mutate(pred = factor(pred, levels = c(1, 0)))
model_data[[i]] <- model_data_i
sens[i] <- yardstick::sensitivity(model_data_i, Churn, pred) %>% pull(.estimate)
FP_rate[i] <- 1 - yardstick::specificity(model_data_i, Churn, pred) %>% pull(.estimate)
}
tibble(sens,FP_rate) -> roc_data
ggplot(roc_data, aes(FP_rate, sens)) + geom_line(size = 1.5, color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
coord_fixed(ratio = 1) +
theme_minimal()
Here is the result:
These two graph look inverse, and they provide significantly different AUC values. When I plot sens on x-axis and FP rate on y-axis, the graphs are the same, but then it doesn't make sense to me.
Also, when I look at ratio of TP and FP rates, there are more times where this ratio is below 1, which give more evidence to my manually calculated ROC curve.
tibble(sens,FP_rate) %>% mutate(ratio = sens/FP_rate) -> roc_data
Even if I use thresholds using roc_curve$thresholds I get the same result.
Where am I making a mistake?

