0
$\begingroup$

I'm trying to do OHC in R to convert categorical into numerical data. However R's caret package requires one to use factors with greater than 2 levels. Any idea how to go around this? I've searched and not found a solution. I would do label encoding for instance but that would defeat the whole purpose of OHC. Thanks in advance.

$\endgroup$
3
  • $\begingroup$ does the caret::dummyVars function not do what you need ? $\endgroup$ Commented Nov 24, 2018 at 16:28
  • 1
    $\begingroup$ It doesn't. I have factors with only 2 and some with onlly 1 level. The dummyVars function requires greater than 2 factors. I did not want to discard these columns. $\endgroup$ Commented Nov 24, 2018 at 17:05
  • 1
    $\begingroup$ Just recode as 0/1 numeric yourself $\endgroup$ Commented Nov 24, 2018 at 18:02

1 Answer 1

2
$\begingroup$

One-hot encoding is commonly used in pre-processing data as inputs to machine learning algorithms. For factors with more than 2 levels, this involves creating one or more dummy variables. If a factor has only 2 levels then no dummy variables are needed - indeed it may be already one-hot encoded. Just check the levels (for example in R, use levels(varname)). If they are not 0 and 1, then just change them to 0 and 1 and you should be good to go. An example in R:

> x <- factor(c("alpha","beta","alpha","beta","alpha","beta"))
> length(x)
[1] 6

> x
[1] alpha beta  alpha beta  alpha beta 
Levels: alpha beta

> levels(x) <- c(0,1)
> x
[1] 0 1 0 1 0 1
Levels: 0 1

You also mentioned factors with only 1 level. Such factors are not variables and could be removed from the dataset (as they are constants and will not affect predictions), but if it were me I would investigate why a single-factor variable is there in the first place.

$\endgroup$
2
  • $\begingroup$ That solved it, how would I automate it if many factors have only two levels though? $\endgroup$ Commented Nov 24, 2018 at 17:29
  • $\begingroup$ Something like this: newData <- lapply(oldData[, the.vars], function(x) {levels(x) <- c(0,1)}) where the.vars are the column numbers $\endgroup$ Commented Nov 24, 2018 at 18:07

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.