I'm trying to do OHC in R to convert categorical into numerical data. However R's caret package requires one to use factors with greater than 2 levels. Any idea how to go around this? I've searched and not found a solution. I would do label encoding for instance but that would defeat the whole purpose of OHC. Thanks in advance.
1 Answer
One-hot encoding is commonly used in pre-processing data as inputs to machine learning algorithms. For factors with more than 2 levels, this involves creating one or more dummy variables. If a factor has only 2 levels then no dummy variables are needed - indeed it may be already one-hot encoded. Just check the levels (for example in R, use levels(varname)). If they are not 0 and 1, then just change them to 0 and 1 and you should be good to go. An example in R:
> x <- factor(c("alpha","beta","alpha","beta","alpha","beta"))
> length(x)
[1] 6
> x
[1] alpha beta alpha beta alpha beta
Levels: alpha beta
> levels(x) <- c(0,1)
> x
[1] 0 1 0 1 0 1
Levels: 0 1
You also mentioned factors with only 1 level. Such factors are not variables and could be removed from the dataset (as they are constants and will not affect predictions), but if it were me I would investigate why a single-factor variable is there in the first place.
-
$\begingroup$ That solved it, how would I automate it if many factors have only two levels though? $\endgroup$NelsonGon– NelsonGon2018-11-24 17:29:36 +00:00Commented Nov 24, 2018 at 17:29
-
$\begingroup$ Something like this:
newData <- lapply(oldData[, the.vars], function(x) {levels(x) <- c(0,1)})wherethe.varsare the column numbers $\endgroup$Robert Long– Robert Long2018-11-24 18:07:33 +00:00Commented Nov 24, 2018 at 18:07
caret::dummyVarsfunction not do what you need ? $\endgroup$dummyVarsfunction requires greater than 2 factors. I did not want to discard these columns. $\endgroup$