0

I am a little bit lost in tidymodels. I have a some data from topicmodeling:

  • prevalent_topic: factor variable with most prevalent topic, ranging from "Topic_1" to "Topic_5"
  • value1 and value2: two numeric variables used as predictor

I want to predict/classify the prevalent_topic based on value1 and value2:

prevalent_topic ~ value1 + value2

I started with multiclass classification using glmnet and nnet with tidymodels. Now I want to try "one-vs-rest" binary classification and created a recipe to begin with:

dfFT_rec <- recipe( ~ value1 + value2, data = dfFT_train) %>%
  step_dummy(prevalent_topic, one_hot = TRUE) %>%
  step_normalize(c(value1, value2)) 

The second step creates dummy variables that I would like to use as outcome, e.g. "prevalent_topic_Topic_1", ""prevalent_topic_Topic_2", ...

I tried to update the recipe's formula to use "prevalent_topic_Topic_1 ~ value1 + value2" but that did not work. I also tried to fit a workflow to my data without specifying the outcome but only got an error: "logistic_reg() was unable to find an outcome."

Is this possible at all? Or is there a different way to turn an outcome factor variable into dummy-coded outcome variables?

1 Answer 1

1

As long as the values in prevalent_topic are mutually exclusive (and are in the normal factor class), you can use multinom_reg() to get a model. Instead of fitting a set of logistic regressions, you can simultaneously model all of your categories.

If there are not mutually exclusive (like a multiple choice question), you would probably need to make separate factors and model each separately. That "multilabel" structure isn't currently supported in tidymodels. You might look at the recipe step step_dummy_multi_choice() (https://recipes.tidymodels.org/reference/step_dummy_multi_choice.html) (followed by step_bin2factor() (https://recipes.tidymodels.org/reference/step_bin2factor.html)) to make the different outcome columns.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Yes, they are mutually exclusive and I already tried multinom_reg() . The reason why I want to try one-vs-rest binary classification is that the topics are not nearly equally distributed, ranging from 4 to 5,000. Trying to equal that out with the themis package is not necessarily the better approach. I will check the `step_dummy_multi_choice()' but perhaps dummy-coding the outcome in advance to the ml workflow might be another approach.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.