{"id":2180,"date":"2018-01-04T21:54:09","date_gmt":"2018-01-04T21:54:09","guid":{"rendered":"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=2180"},"modified":"2025-06-30T17:38:00","modified_gmt":"2025-06-30T16:38:00","slug":"one-hot-encoding-of-categorical-variables","status":"publish","type":"page","link":"https:\/\/pcool.dyndns.org\/index.php\/one-hot-encoding-of-categorical-variables\/","title":{"rendered":"One-Hot Encoding of Categorical Variables"},"content":{"rendered":"\n<p>Some machine learning software requires categorical data to be converted to numerical data. This section describes how to do this with one-hot encoding.<\/p>\n\n\n\n<p>Load the following small data frame:<\/p>\n\n\n\n<p><a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/data_files\/shoulders.rda\" target=\"_blank\" rel=\"noreferrer noopener\">shoulders.rda<\/a><\/p>\n\n\n\n<p>To look at the data, just enter:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f20727\" class=\"has-inline-color\">shoulders<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#082bf3\" class=\"has-inline-color\">\n          shoulder fu score\n1 hemiarthroplasty 12    90\n2 hemiarthroplasty 10    80\n3              tsr  9    95\n4 hemiarthroplasty  8    89\n5              tsr 10    70<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The structure of the data frame can be seen with:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>str(shoulders)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>'data.frame': 5 obs. of 3 variables:<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ shoulder: Factor w\/ 2 levels \"hemiarthroplasty\",..: 1 1 2 1 2<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ fu : num 12 10 9 8 10<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ score : num 90 80 95 89 70<\/em><\/span><\/code><\/pre>\n\n\n\n<p>The first variable is categorical (factor) and the other two are numerical. The aim is to convert the categorical variable to numerical variables with hone-hot encoding. The package vtreat<sup class='sup-ref-note' id='note-zotero-ref-p2180-r1-o1'><a class='sup-ref-note' href='#zotero-ref-p2180-r1'>1<\/a><\/sup> should be <a href=\"https:\/\/pcool.dyndns.org\/index.php\/packages\/\" data-type=\"page\" data-id=\"22\">installed<\/a>\u00a0 for the encoding and dplyr<sup class='sup-ref-note' id='note-zotero-ref-p2180-r2-o1'><a class='sup-ref-note' href='#zotero-ref-p2180-r2'>2<\/a><\/sup> to make filtering easier. Load the packages:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>library(vtreat)\u00a0<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>library(dplyr)<\/em><\/span><\/code><\/pre>\n\n\n\n<p>Create a vector that contains the names of the variables to be included in the final data frame (here all three):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>vars &lt;- c(\"shoulder\", \"fu\", \"score\")<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>vars<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>&#091;1] \"shoulder\" \"fu\" \"score\" <\/em><\/span><\/code><\/pre>\n\n\n\n<p>Create a treatment plan (called treatplan) with vtreat&#8217;s designTreatmentsZ() function that takes as first argument the data frame to be treated (shoulders) and as second argument the variables to be included in the final data frame (vars):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f50f2e\" class=\"has-inline-color\">treatplan &lt;- designTreatmentsZ(shoulders, vars)<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3b0ff4\" class=\"has-inline-color\">\n&#091;1] \"vtreat 1.6.5 inspecting inputs Mon Jun 30 17:32:20 2025\"\n&#091;1] \"designing treatments Mon Jun 30 17:32:20 2025\"\n&#091;1] \" have initial level statistics Mon Jun 30 17:32:20 2025\"\n&#091;1] \" scoring treatments Mon Jun 30 17:32:20 2025\"\n&#091;1] \"have treatment plan Mon Jun 30 17:32:20 2025\"<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The treatplan object contains a lot of information. The variables called &#8220;varName&#8221;, &#8220;origName&#8221; and &#8220;code&#8221; in the scoreFrame needs to be extracted and saved (here called scoreFrame):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>scoreFrame &lt;- treatplan$scoreFrame&#091; ,c(\"varName\", \"origName\", \"code\")]<\/em><\/span><\/code><\/pre>\n\n\n\n<p>Examine&nbsp; the scoreFrame:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#ed0020\" class=\"has-inline-color\">scoreFrame<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0f01ed\" class=\"has-inline-color\">\n                          varName origName  code\n1                              fu       fu clean\n2                           score    score clean\n3 shoulder_lev_x_hemiarthroplasty shoulder   lev\n4              shoulder_lev_x_tsr shoulder   lev<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The first variable (<em>shoulder_lev_x.hemiarthroplasty)<\/em> is the first factor (hemiarthroplasty) of the categorical variable (shoulder) and the second variable (<em>shoulder_lev_x.tsr<\/em>) is the second factor (tsr) of the categorical variable (shoulder). The third (<em>fu_clean<\/em>) and fourth (<em>score_clean<\/em>) variables are the cleaned up fu and score numerical variables.<\/p>\n\n\n\n<p>We only want the rows with codes &#8220;clean&#8221; or &#8220;lev&#8221; (here all, but this is not always the case). Save them as a new object called new_variables:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>new_variables &lt;- dplyr::filter(scoreFrame, code %in% c(\"clean\", \"lev\"))<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>new_variables<\/em><\/span>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1d07f6\" class=\"has-inline-color\">                          varName origName  code\n1                              fu       fu clean\n2                           score    score clean\n3 shoulder_lev_x_hemiarthroplasty shoulder   lev\n4              shoulder_lev_x_tsr shoulder   lev<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>Create a new data frame that contains the treated data and examine it:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>shoulder_treat &lt;- prepare(treatplan, shoulders, varRestriction = new_variables$varName)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>shoulder_treat<\/em><\/span>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1508ef\" class=\"has-inline-color\">  fu score shoulder_lev_x_hemiarthroplasty shoulder_lev_x_tsr\n1 12    90                               1                  0\n2 10    80                               1                  0\n3  9    95                               0                  1\n4  8    89                               1                  0\n5 10    70                               0                  1<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The structure can be examined with the str() function:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#ed022a\" class=\"has-inline-color\">str(shoulder_treat)<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3103ee\" class=\"has-inline-color\">\n'data.frame':\t5 obs. of  4 variables:\n $ fu                             : num  12 10 9 8 10\n $ score                          : num  90 80 95 89 70\n $ shoulder_lev_x_hemiarthroplasty: num  1 1 0 1 0\n $ shoulder_lev_x_tsr             : num  0 0 1 0 1<\/mark><\/em><\/code><\/pre>\n\n\n\n<p>The &#8220;shoulder&#8221; categorical variable has been recoded into new variables (binary) with as many variables as there are categories.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some machine learning software requires categorical data to be converted to numerical data. This section describes how to do this with one-hot encoding. Load the following small data frame: shoulders.rda To look at the data, just enter: The structure of the data frame can be seen with: The first variable is categorical (factor) and the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"inline_featured_image":false,"footnotes":""},"class_list":["post-2180","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/comments?post=2180"}],"version-history":[{"count":1,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2180\/revisions"}],"predecessor-version":[{"id":4624,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2180\/revisions\/4624"}],"wp:attachment":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/media?parent=2180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}