Statsbook

One-Hot Encoding of Categorical Variables

Some machine learning software requires categorical data to be converted to numerical data. This section describes how to do this with one-hot encoding.

Load the following small data frame:

shoulders.rda

To look at the data, just enter:

shoulders
          shoulder fu score
1 hemiarthroplasty 12    90
2 hemiarthroplasty 10    80
3              tsr  9    95
4 hemiarthroplasty  8    89
5              tsr 10    70

The structure of the data frame can be seen with:

str(shoulders)
'data.frame': 5 obs. of 3 variables:
 $ shoulder: Factor w/ 2 levels "hemiarthroplasty",..: 1 1 2 1 2
 $ fu : num 12 10 9 8 10
 $ score : num 90 80 95 89 70

The first variable is categorical (factor) and the other two are numerical. The aim is to convert the categorical variable to numerical variables with hone-hot encoding. The package vtreat1 should be installed  for the encoding and dplyr2 to make filtering easier. Load the packages:

library(vtreat) 
library(dplyr)

Create a vector that contains the names of the variables to be included in the final data frame (here all three):

vars <- c("shoulder", "fu", "score")
vars
[1] "shoulder" "fu" "score" 

Create a treatment plan (called treatplan) with vtreat’s designTreatmentsZ() function that takes as first argument the data frame to be treated (shoulders) and as second argument the variables to be included in the final data frame (vars):

treatplan <- designTreatmentsZ(shoulders, vars)
[1] "vtreat 1.6.5 inspecting inputs Mon Jun 30 17:32:20 2025"
[1] "designing treatments Mon Jun 30 17:32:20 2025"
[1] " have initial level statistics Mon Jun 30 17:32:20 2025"
[1] " scoring treatments Mon Jun 30 17:32:20 2025"
[1] "have treatment plan Mon Jun 30 17:32:20 2025"

The treatplan object contains a lot of information. The variables called “varName”, “origName” and “code” in the scoreFrame needs to be extracted and saved (here called scoreFrame):

scoreFrame <- treatplan$scoreFrame[ ,c("varName", "origName", "code")]

Examine  the scoreFrame:

scoreFrame
                          varName origName  code
1                              fu       fu clean
2                           score    score clean
3 shoulder_lev_x_hemiarthroplasty shoulder   lev
4              shoulder_lev_x_tsr shoulder   lev

The first variable (shoulder_lev_x.hemiarthroplasty) is the first factor (hemiarthroplasty) of the categorical variable (shoulder) and the second variable (shoulder_lev_x.tsr) is the second factor (tsr) of the categorical variable (shoulder). The third (fu_clean) and fourth (score_clean) variables are the cleaned up fu and score numerical variables.

We only want the rows with codes “clean” or “lev” (here all, but this is not always the case). Save them as a new object called new_variables:

new_variables <- dplyr::filter(scoreFrame, code %in% c("clean", "lev"))
new_variables
                          varName origName  code
1                              fu       fu clean
2                           score    score clean
3 shoulder_lev_x_hemiarthroplasty shoulder   lev
4              shoulder_lev_x_tsr shoulder   lev

Create a new data frame that contains the treated data and examine it:

shoulder_treat <- prepare(treatplan, shoulders, varRestriction = new_variables$varName)
shoulder_treat
  fu score shoulder_lev_x_hemiarthroplasty shoulder_lev_x_tsr
1 12    90                               1                  0
2 10    80                               1                  0
3  9    95                               0                  1
4  8    89                               1                  0
5 10    70                               0                  1

The structure can be examined with the str() function:

str(shoulder_treat)
'data.frame':	5 obs. of  4 variables:
 $ fu                             : num  12 10 9 8 10
 $ score                          : num  90 80 95 89 70
 $ shoulder_lev_x_hemiarthroplasty: num  1 1 0 1 0
 $ shoulder_lev_x_tsr             : num  0 0 1 0 1

The “shoulder” categorical variable has been recoded into new variables (binary) with as many variables as there are categories.