{"id":826,"date":"2015-08-10T15:30:51","date_gmt":"2015-08-10T14:30:51","guid":{"rendered":"http:\/\/pcool.dyndns.org:8080\/statsbook\/?page_id=826"},"modified":"2025-07-04T20:55:55","modified_gmt":"2025-07-04T19:55:55","slug":"regression-coefficient","status":"publish","type":"page","link":"https:\/\/pcool.dyndns.org\/index.php\/regression-coefficient\/","title":{"rendered":"Regression Coefficient"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">One often wants to know if there is a relation or association between two variables. To look if there is such a relation, an experiment can be designed. This experiment provides us data that can be plotted (<a href=\"https:\/\/pcool.dyndns.org\/index.php\/scatterplot\/\" data-type=\"page\" data-id=\"541\">scatterplot<\/a>). Next, a line or curve that best fits these data is drawn. The mathematical equation of the curve gives us the relation. Most commonly, a straight line is fitted through the data and this process is called linear curve fitting.&nbsp; It is however also possible to fit <a href=\"https:\/\/pcool.dyndns.org\/index.php\/other-models\/\" data-type=\"page\" data-id=\"832\">non-linear curves<\/a> to data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>&nbsp;Linear Curve fitting<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Linear curve fitting will be explained with an example. It is suggested, there may be a relation between the girth and mass of a tree. The girth of 30 trees and their corresponding mass was measured. The data can be found in <a href=\"https:\/\/pcool.dyndns.org:\/wp-content\/data_files\/trees30.rda\" target=\"_blank\" rel=\"noreferrer noopener\">trees30.rda<\/a>. Open the data set in R:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><em><mark style=\"background-color:rgba(0, 0, 0, 0);color:#f60505\" class=\"has-inline-color\">TreeGirthMass\n<\/mark><mark style=\"background-color:rgba(0, 0, 0, 0);color:#3305f5\" class=\"has-inline-color\">   Girth Mass\n1    205  251\n2    213  272\n3    219  335\n4    226  278\n5    231  375\n6    241  335\n7    250  410\n8    266  414\n9    266  478\n10   275  560\n11   296  489\n12   299  506\n13   314  606\n14   315  616\n15   321  562\n16   327  693\n17   327  737\n18   334  610\n19   343  733\n20   347  673\n21   351  726\n22   358  760\n23   358  788\n24   360  766\n25   362  750\n26   362  737\n27   363  707\n28   368  821\n29   369  827\n30   372  772<\/mark><\/em><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As can be seen, there are trees with a girth between 200 and 375 centimetres. In this case, the girth is the <strong><em>independent  (predictor) variable<\/em><\/strong> and the mass the <strong><em>dependent (response) variable  <\/em><\/strong>(the girth of the tree is measured to estimate the mass of the tree). Next, the data is plotted in a <a href=\"https:\/\/pcool.dyndns.org\/index.php\/scatterplot\/\" data-type=\"page\" data-id=\"541\">scatterplot<\/a>. It is customary that <strong>the independent variable is plotted on the x-axis and the dependent variable on the y-axis:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><em><span style=\"color: #ff0000;\">ggplot(<span style=\"color: #ff0000;\">data=TreeGirthMass, aes(x = Girth,y = Mass)<\/span>) + <\/span>\n<span style=\"color: #ff0000;\">geom_point() + <\/span>\n<span style=\"color: #ff0000;\">theme_bw()<\/span> <span style=\"color: #ff0000;\">+ <\/span>\n<span style=\"color: #ff0000;\">ggtitle(label = 'Girth and Mass Trees') + <\/span>\n<span style=\"color: #ff0000;\">xlab(label = 'Girth &#91;cm]') + <\/span>\n<span style=\"color: #ff0000;\">ylab(label = 'Mass &#91;kg]')<\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30-1024x768.png\" alt=\"\" class=\"wp-image-3852\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30-1024x768.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30-300x225.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30-768x576.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30.png 1355w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In linear curve fitting, a straight line is drawn that fits the data points best. One way of doing this is to plot the data as shown above and draw a line through it with a ruler. When we draw the line, we will try to have as many data points above as below the line. This is certainly an acceptable method and seems to be no problem in the example above. However, the data do not always lie close to a straight line. If they would lie further apart, it would be more difficult to draw a straight line through them. Furthermore, this graphical method is not very consistent. A mathematical method is favoured as is far more consistent and reproducible than a graphical method.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are several mathematical methods described to fit a straight line through data points and full discussion of these is beyond the scope of this book. One method commonly used is the <em><strong>least square method<\/strong><\/em>. This is illustrated in the next graph:<a href=\"http:\/\/pcool.dyndns.org:8080\/statsbook\/wp-content\/uploads\/leastsquare.png\"><\/a><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"212\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/leastsquare.png\" alt=\"\" class=\"wp-image-3263\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/leastsquare.png 304w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/leastsquare-300x209.png 300w\" sizes=\"auto, (max-width: 304px) 100vw, 304px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine a straight line through the data points as shown. The distance of the dependent variable to this proposed line is calculated (vertical distances as indicated in the plot). Next the square of this distance is taken and all of these squares are added together. The square is taken for two reasons:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Points below the line have a negative distance and points above the line a positive distance. They therefore tend to cancel each other out. In taking the square, all distances become positive; eliminating the problem.<\/li>\n\n\n\n<li>By taking the square, data points further away from the proposed line are given more \u2018weight\u2019 than those close to the line.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This process is repeated for all straight lines possible. The best fitting line is that were the sum of the squares is <strong><em>least<\/em><\/strong>. This method is therefore called the <strong><em>least square method<\/em><\/strong>. Computers are used to perform these calculations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Regression coefficient<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A straight line has the following basic equation:<\/p>\n\n\n\n<div class=\"wp-block-mathml-mathmlblock\">\\(y = a \\cdot x + b \\)<script id=\"wp-hooks-js\" src=\"https:\/\/pcool.dyndns.org\/wp-includes\/js\/dist\/hooks.min.js?ver=7496969728ca0f95732d\"><\/script>\n<script id=\"wp-i18n-js\" src=\"https:\/\/pcool.dyndns.org\/wp-includes\/js\/dist\/i18n.min.js?ver=781d11515ad3d91786ec\"><\/script>\n<script id=\"wp-i18n-js-after\">\nwp.i18n.setLocaleData( { 'text direction\\u0004ltr': [ 'ltr' ] } );\n\/\/# sourceURL=wp-i18n-js-after\n<\/script>\n<script  async id=\"mathjax-js\" src=\"https:\/\/cdnjs.cloudflare.com\/ajax\/libs\/mathjax\/2.7.7\/MathJax.js?config=TeX-MML-AM_CHTML\"><\/script>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Were x is the independent variable and y the dependent variable. \u2018a\u2019 is the <strong><em>regression coefficient<\/em><\/strong>. It represents the slope of the line and can be calculated by dividing the difference in y-value to the difference in x-value at two points:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"212\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression1.png\" alt=\"\" class=\"wp-image-3600\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression1.png 304w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression1-300x209.png 300w\" sizes=\"auto, (max-width: 304px) 100vw, 304px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If a = 0, the line is horizontal. The larger the value of \u2018a\u2019, the more vertical (steeper) the line is:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"212\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression2.png\" alt=\"\" class=\"wp-image-3601\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression2.png 304w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression2-300x209.png 300w\" sizes=\"auto, (max-width: 304px) 100vw, 304px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">A negative value of \u2018a\u2019 corresponds to a downwards slope: <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"212\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression3.png\" alt=\"\" class=\"wp-image-3602\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression3.png 304w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/regression3-300x209.png 300w\" sizes=\"auto, (max-width: 304px) 100vw, 304px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In the graph above, the regression coefficient = &#8211; 20\u00a0<\/p>\n\n\n\n<div class=\"wp-block-mathml-mathmlblock\">\\( a = \\frac{80}{-4} = -20 \\)<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The <em><strong>intercept<\/strong> <\/em>\u2018b\u2019 is a constant for the line and represents the y-value at x = 0. If the line goes through the origin of the coordinate system (0,0), than b = 0. If the line crosses above the origin, \u2018b\u2019 is positive and if it crosses below the origin, \u2018b\u2019 is negative. In the graph above, b = 80, so:<\/p>\n\n\n\n<div class=\"wp-block-mathml-mathmlblock\">\\(y = -20 \\cdot x + 80 \\)<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Returning to the example with the girths and masses of 30 trees, it is easy to add a linear regression line with a 95% confidence interval to the plot by adding the <strong><em>geom_smooth function<\/em><\/strong> to the plot:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"font-style: italic; color: rgb(255, 0, 0);\">ggplot(<span style=\"color: #ff0000;\">data=TreeGirthMass, aes(x = Girth,y = Mass)<\/span>) + <\/span><i>\n<\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">geom_point() + <\/span><i>\n<\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">theme_bw()<\/span><i> <\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">+ <\/span><i>\n<\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">ggtitle(label = 'Girth and Mass Trees') + <\/span><i>\n<\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">xlab(label = 'Girth &#91;cm]') + <\/span><i>\n<\/i><span style=\"font-style: italic; color: rgb(255, 0, 0);\">ylab(label = 'Mass &#91;kg]') + <\/span>\n<em><span style=\"color: #ff0000;\"><strong>geom_smooth(aes(x=Girth, y=Mass), data=TreeGirthMass, method='lm')<\/strong><\/span><\/em><\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30regression-1024x768.png\" alt=\"\" class=\"wp-image-3853\" srcset=\"https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30regression-1024x768.png 1024w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30regression-300x225.png 300w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30regression-768x576.png 768w, https:\/\/pcool.dyndns.org\/wp-content\/uploads\/2025\/06\/trees30regression.png 1355w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The computer has drawn the best fitting line through the data points using the least square method. In addition, the 95% <a href=\"https:\/\/pcool.dyndns.org\/index.php\/confidence-intervals\/\" data-type=\"page\" data-id=\"892\">confidence interval<\/a> is indicated with grey shading. However, it doesn&#8217;t provide the formula of the regression line. The formula of the regression line can be found by:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code><span style=\"color: #ff0000;\"><em>fit &lt;- lm(Mass~Girth, data=TreeGirthMass)<\/em><\/span>\n<span style=\"color: #ff0000;\"><em>fit<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>Call:<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>lm(formula = Mass ~ Girth, data = TreeGirthMass)<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>Coefficients:<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>(Intercept)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Girth \u00a0<\/em><\/span>\n<span style=\"color: #0000ff;\"><em>\u00a0\u00a0\u00a0 -411.62\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3.24\u00a0<\/em> <\/span><\/code><\/pre>\n\n\n\n<p class=\"is-style-text-annotation is-style-text-annotation--1 wp-block-paragraph\">lm = linear model<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The regression coefficient is 3.24 and the intercept -411.62, therefore the formula of the regression line is:<\/p>\n\n\n\n<div class=\"wp-block-mathml-mathmlblock\">\\(Mass = 3.24 \\cdot Girth &#8211; 411.62 \\)<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The regression coefficient is a measure of the slope of the line. It ranges from -\u221e to +\u221e. A regression coefficient of zero means the line is horizontal; a positive value corresponds to an upward slope and a negative value to a downward slope. The larger the value of the regression coefficient, the steeper the slope.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One often wants to know if there is a relation or association between two variables. To look if there is such a relation, an experiment can be designed. This experiment provides us data that can be plotted (scatterplot). Next, a line or curve that best fits these data is drawn. The mathematical equation of the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"inline_featured_image":false,"footnotes":""},"class_list":["post-826","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/comments?post=826"}],"version-history":[{"count":3,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/826\/revisions"}],"predecessor-version":[{"id":4918,"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/pages\/826\/revisions\/4918"}],"wp:attachment":[{"href":"https:\/\/pcool.dyndns.org\/index.php\/wp-json\/wp\/v2\/media?parent=826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}