Some machine learning software requires categorical data to be converted to numerical data. This section describes how to do this with one-hot encoding.
Load the following small data frame:
To look at the data, just enter:
shoulders
shoulder fu score
1 hemiarthroplasty 12 90
2 hemiarthroplasty 10 80
3 tsr 9 95
4 hemiarthroplasty 8 89
5 tsr 10 70
The structure of the data frame can be seen with:
str(shoulders)
‘data.frame': 5 obs. of 3 variables:
$ shoulder: Factor w/ 2 levels “hemiarthroplasty”,..: 1 1 2 1 2
$ fu : num 12 10 9 8 10
$ score : num 90 80 95 89 70
The first variable is categorical (factor) and the other two are numerical. The aim is to convert the categorical variable to numerical variables with hone-hot encoding. The library vtreat 1 should be installed for the encoding and dplyr 2 to make filtering easier. Load the libraries:
library(vtreat)
library(dplyr)
Create a vector that contains the names of the variables to be included in the final data frame (here all three):
vars <- c(“shoulder”, “fu”, “score”)
vars
[1] “shoulder” “fu” “score”
The code (in red) can be copied and pasted into the R console, but the quotation marks may need to be re-entered to execute the code.
Create a treatment plan (called treatplan) with vtreat’s designTreatmentsZ() function that takes as first argument the data frame to be treated (shoulders) and as second argument the variables to be included in the final data frame (vars):
treatplan <- designTreatmentsZ(shoulders, vars)
[1] “designing treatments Sun Jan 7 18:08:45 2018″
[1] “designing treatments Sun Jan 7 18:08:45 2018″
[1] ” have level statistics Sun Jan 7 18:08:45 2018″
[1] “design var shoulder Sun Jan 7 18:08:45 2018″
[1] “design var fu Sun Jan 7 18:08:45 2018″
[1] “design var score Sun Jan 7 18:08:45 2018″
[1] ” scoring treatments Sun Jan 7 18:08:45 2018″
[1] “have treatment plan Sun Jan 7 18:08:45 2018″
The treatplan object contains a lot of information. The variables called “varName”, “origName” and “code” in the scoreFrame needs to be extracted and saved (here called scoreFrame):
scoreFrame <- treatplan$scoreFrame[ ,c(“varName”, “origName”, “code”)]
Examine the scoreFrame:
scoreFrame
varName origName code
1 shoulder_lev_x.hemiarthroplasty shoulder lev
2 shoulder_lev_x.tsr shoulder lev
3 fu_clean fu clean
4 score_clean score clean
The first variable (shoulder_lev_x.hemiarthroplasty) is the first factor (hemiarthroplasty) of the categorical variable (shoulder) and the second variable (shoulder_lev_x.tsr) is the second factor (tsr) of the categorical variable (shoulder). The third (fu_clean) and fourth (score_clean) variables are the cleaned up fu and score numerical variables.
We only want the rows with codes “clean” or “lev” (here all, but this is not always the case). Save them as a new object called new_variables:
new_variables <- filter(scoreFrame, code %in% c(“clean”, “lev”))
new_variables
varName origName code
1 shoulder_lev_x.hemiarthroplasty shoulder lev
2 shoulder_lev_x.tsr shoulder lev
3 fu_clean fu clean
4 score_clean score clean
Create a new data frame that contains the treated data and examine it:
shoulder_treat <- prepare(treatplan, shoulders, varRestriction = new_variables$varName)
shoulder_treat
shoulder_lev_x.hemiarthroplasty shoulder_lev_x.tsr fu_clean score_clean
1 1 0 12 90
2 1 0 10 80
3 0 1 9 95
4 1 0 8 89
5 0 1 10 70
The structure can be examined with the str() function:
str(shoulder_treat)
‘data.frame': 5 obs. of 4 variables:
$ shoulder_lev_x.hemiarthroplasty: num 1 1 0 1 0
$ shoulder_lev_x.tsr : num 0 0 1 0 1
$ fu_clean : num 12 10 9 8 10
$ score_clean : num 90 80 95 89 70
The “shoulder” categorical variable has been recoded into new variables (binary) with as many variables as there are categories.