{"id":2124,"date":"2018-01-01T23:11:00","date_gmt":"2018-01-01T23:11:00","guid":{"rendered":"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=2124"},"modified":"2025-06-30T17:25:12","modified_gmt":"2025-06-30T16:25:12","slug":"principle-component-analysis","status":"publish","type":"page","link":"https:\/\/pcool.dyndns.org\/index.php\/principle-component-analysis\/","title":{"rendered":"Principle Component Analysis"},"content":{"rendered":"\n<p>Principle Component Analysis\u00a0(PCA) is a\u00a0linear transformation technique. It is an <strong>unsupervised<\/strong> machine learning method and consequently no classifier is necessary. To illustrate the method, R\u2019s build in data set \u201ciris\u201d is used.<\/p>\n\n\n\n<p>The build in iris data set contains&nbsp;the sepal and petal length and width&nbsp;of three different types of iris plants: Setosa, Versicolor and Virginica. Have a look at the iris data set:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>head(iris)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> Sepal.Length Sepal.Width Petal.Length Petal.Width Species<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>1 5.1 3.5 1.4 0.2 setosa<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>2 4.9 3.0 1.4 0.2 setosa<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>3 4.7 3.2 1.3 0.2 setosa<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>4 4.6 3.1 1.5 0.2 setosa<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>5 5.0 3.6 1.4 0.2 setosa<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>6 5.4 3.9 1.7 0.4 setosa<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>str(iris)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>'data.frame': 150 obs. of 5 variables:<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...<\/em><\/span>\n<span style=\"color: #0000ff;\"><em> $ Species : Factor w\/ 3 levels \"setosa\",\"versicolor\",..: 1 1 1 1 1 1 1 1 1 1 ...<\/em><\/span><\/code><\/pre>\n\n\n\n<p>A plot matrix of the first four numerical variables can be illustrative and is easily obtained:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">plot(iris&#91;,1:4])<\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"886\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/iris_matrix.png\" alt=\"\" class=\"wp-image-3257\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/iris_matrix.png 900w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/iris_matrix-300x295.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/iris_matrix-768x756.png 768w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/figure>\n\n\n\n<p>This illustrates that some variables are associated with each other. The aim of principle component analysis is to describe the data with linear combinations of the variables. Would it be possible to describe the data with less than 4 variables?? It is important to standardise the data first (mean = 0 and standard deviation = 1) as otherwise, the magnitude of a variable could have an influence. To standardise the four numerical variables (the categorical classifier is not needed as this is an unsupervised learning technique):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f50a32\" class=\"has-inline-color\">iris_standardised &lt;- scale(iris&#91;,1:4])\nhead(iris_standardised)<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#0f09f4\" class=\"has-inline-color\">\n     Sepal.Length Sepal.Width Petal.Length Petal.Width\n&#91;1,]   -0.8976739  1.01560199    -1.335752   -1.311052\n&#91;2,]   -1.1392005 -0.13153881    -1.335752   -1.311052\n&#91;3,]   -1.3807271  0.32731751    -1.392399   -1.311052\n&#91;4,]   -1.5014904  0.09788935    -1.279104   -1.311052\n&#91;5,]   -1.0184372  1.24503015    -1.335752   -1.311052\n&#91;6,]   -0.5353840  1.93331463    -1.165809   -1.048667<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"p1\">Next perform the principle component analysis on these standardised data and display a summary:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f00727\" class=\"has-inline-color\">pca_iris &lt;- prcomp(iris_standardised)<br>summary(pca_iris)<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3c07ef\" class=\"has-inline-color\"><br>Importance of components:<br>                          PC1    PC2     PC3     PC4<br>Standard deviation     1.7084 0.9560 0.38309 0.14393<br>Proportion of Variance 0.7296 0.2285 0.03669 0.00518<br>Cumulative Proportion  0.7296 0.9581 0.99482 1.00000<\/mark><\/em><br><br><\/code><\/pre>\n\n\n\n<p class=\"p1\">The summary shows that the first principle component explains 73% of the variance, the second 96% and the third 99%. All variance is explained by all four components. To check this, sum the variances of each component (the standard deviation is stored in pca_iris$sdev):<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">pca_iris$sdev<\/span><\/em>\n<span style=\"color: #0000ff;\"><em>&#91;1] 1.7083611 0.9560494 0.3830886 0.1439265<\/em><\/span>\n<span style=\"color: #0000ff;\"><em># all variance should add up to the number of variables<\/em><\/span>\n<em><span style=\"color: #ff0000;\">sum(pca_iris$sdev^2) <\/span><\/em>\n<span style=\"color: #0000ff;\"><em>&#91;1] 4<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"p1\">So, the sum of the variances is four, which is the number of variables in the data set. How many principle components should be used to describe the data set? One component explains only 73% of the variance. Two components seems reasonable, but should be quantified. First draw a sreeplot:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>screeplot(pca_iris, type = 'lines')<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"884\" height=\"861\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/scree.png\" alt=\"\" class=\"wp-image-3611\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/scree.png 884w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/scree-300x292.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/scree-768x748.png 768w\" sizes=\"auto, (max-width: 884px) 100vw, 884px\" \/><\/figure>\n\n\n\n<p class=\"p1\">The elbow of the plot is at 3 principle components, so it is reasonable to maintain 2 principle components. Furthermore, according to Kaiser&#8217;s criterium, the variance of the last component on the standardised data should be less than 1; so again it is reasonable to use 2 principle components. Finally, to explain at least 80% of the variance, again 2 components are required.<\/p>\n\n\n\n<p class=\"p1\">The loadings of the first principle component can be obtained by:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">pca_iris$rotation&#91;,1]<\/span><\/em>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3e09f1\" class=\"has-inline-color\">Sepal.Length  Sepal.Width Petal.Length  Petal.Width \n   0.5210659   -0.2693474    0.5804131    0.5648565 <\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"p1\">and the second principle component by:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>pca_iris$rotation&#91;,2]<\/em><\/span>\n<em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#2b06f1\" class=\"has-inline-color\">Sepal.Length  Sepal.Width Petal.Length  Petal.Width \n -0.37741762  -0.92329566  -0.02449161  -0.06694199 <span style=\"color: #0000ff;\"> <\/span><\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"p1\">The sum of all loadings should be one:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">sum(pca_iris$rotation&#91;,1]^2) # should be one<\/span><\/em>\n<span style=\"color: #0000ff;\"><em>&#91;1] 1<\/em><\/span>\n<em><span style=\"color: #ff0000;\">sum(pca_iris$rotation&#91;,2]^2) # should be one<\/span><\/em>\n<span style=\"color: #0000ff;\"><em>&#91;1] 1<\/em><\/span><\/code><\/pre>\n\n\n\n<p class=\"p1\">Looking at the loadings, the first principle component can be regarded as a contrast between sepal width and the other three variables (sepal length, petal width and petal length). The second principle component takes sepal width and length into account, but not petal length or width.<\/p>\n\n\n\n<p class=\"p1\">The&nbsp;first principle component can be plotted versus the second principle component to illustrate how the principle components separate the data into clusters. Remember, no classifier has been used, but the clusters are compared with the classifier as an illustration how principle component analysis can aid to gain insight:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>iris_pca_df &lt;- data.frame(first = pca_iris$x&#91;,1], <\/em><\/span>\n<span style=\"color: #ff0000;\"><em>second = pca_iris$x&#91;,2], species = iris$Species)<\/em><\/span>\n\n<span style=\"color: #ff0000;\"><em>library(ggplot2)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>ggplot(iris_pca_df, aes(x = first, y = second, colour = species)) +<\/em><\/span> \n  <span style=\"color: #ff0000;\"><em>geom_point() +<\/em><\/span>\n  <span style=\"color: #ff0000;\"><em>ggtitle(\"Principle Component Analysis\") +<\/em><\/span>\n  <span style=\"color: #ff0000;\"><em>scale_x_continuous(\"First Principle Component\") <span style=\"color: #ff0000;\"><em>+<\/em><\/span><\/em><\/span>\n  <span style=\"color: #ff0000;\"><em>scale_y_continuous(\"Second Principle Component\") +<\/em><\/span>\n  <span style=\"color: #ff0000;\"><em>theme_bw()<\/em><\/span><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"905\" height=\"913\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/pca.png\" alt=\"\" class=\"wp-image-3369\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/pca.png 905w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/pca-297x300.png 297w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/pca-150x150.png 150w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/pca-768x775.png 768w\" sizes=\"auto, (max-width: 905px) 100vw, 905px\" \/><\/figure>\n\n\n\n<p class=\"p1\">As can be seen the first principle component is good in separating Setosa species from the other two and to a lesser extend in separating Versicolar and Virginica species. The second principle component helps to separate&nbsp;Versicolar and Virginica species.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Principle Component Analysis\u00a0(PCA) is a\u00a0linear transformation technique. It is an unsupervised machine learning method and consequently no classifier is necessary. To illustrate the method, R\u2019s build in data set \u201ciris\u201d is used. The build in iris data set contains&nbsp;the sepal and petal length and width&nbsp;of three different types of iris plants: Setosa, Versicolor and Virginica. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"inline_featured_image":false,"footnotes":""},"class_list":["post-2124","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2124","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/comments?post=2124"}],"version-history":[{"count":1,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2124\/revisions"}],"predecessor-version":[{"id":4622,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/2124\/revisions\/4622"}],"wp:attachment":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/media?parent=2124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}