Regression Coefficient

One often wants to know if there is a relation or association between two variables. To look if there is such a relation, an experiment can be designed. This experiment provides us data that can be plotted (scatterplot). Next, a line or curve that best fits these data is drawn. The mathematical equation of the curve gives us the relation. Most commonly, a straight line is fitted through the data and this process is called linear curve fitting.  It is however also possible to fit non-linear curves to data.

 Linear Curve fitting

Linear curve fitting will be explained with an example. It is suggested, there may be a relation between the girth and mass of a tree. The girth of 30 trees and their corresponding mass was measured. The data can be found in trees30.rda. Open the data set in JGR and the data can be seen:

TreeGirthMass
   Girth Mass
1    205  251
2    213  272
3    219  335
4    226  278
5    231  375
6    241  335
7    250  410
8    266  414
9    266  478
10   275  560
11   296  489
12   299  506
13   314  606
14   315  616
15   321  562
16   327  693
17   327  737
18   334  610
19   343  733
20   347  673
21   351  726
22   358  760
23   358  788
24   360  766
25   362  750
26   362  737
27   363  707
28   368  821
29   369  827
30   372  772

As can be seen, there are trees with a girth between 200 and 375 centimetres. In this case, the girth is the independent variable and the mass the dependent variable (if trees were selected according to their mass rather than girth, the mass would have been the independent variable and the girth the dependent variable). Next, the data is plotted in a scatterplot. Customary, the independent variable is plotted on the x-axis and the dependent variable on the y-axis:

ggplot() + geom_point(aes(x = Girth,y = Mass),data=TreeGirthMass) + theme_bw()

Or with a title and axes labels:

ggplot() + geom_point(aes(x = Girth,y = Mass),data=TreeGirthMass) + theme_bw() + ggtitle(label = ‘Girth and Mass Trees’) + xlab(label = ‘Girth [cm]’) + ylab(label = ‘Mass [kg]’)

trees30

In linear curve fitting, a straight line is drawn that fits the data points best. One way of doing this is to plot the data as shown above and draw a line through it with a ruler. When we draw the line, we will try to have as many data points above as below the line. This is certainly an acceptable method and seems to be no problem in the example above. However, the data do not always lie close to a straight line. If they would lie further apart, it would be more difficult to draw a straight line through them. Furthermore, this graphical method is not very consistent. A mathematical method is favoured as is far more consistent than a graphical method.

There are several mathematical methods described to fit a straight line through data points and full discussion of these is beyond the scope of this book. One method commonly used is the least square method. This is illustrated in the next graph:

leastsquareImagine a straight line through the data points as shown. The distance of the dependent variable to this proposed line is calculated (vertical distances as indicated in the plot). Next the square of this distance is taken and all of these squares are added together. The square is taken for two reasons:

  • Points below the line have a negative distance and points above the line a positive distance. They therefore tend to cancel each other out. In taking the square, all distances become positive; eliminating the problem.
  • By taking the square, data points further away from the proposed line are given more ‘weight’ than those close to the line.

This process is repeated for all straight lines possible. The best fitting line is that were the sum of the squares is least. This method is therefore called the least square method. Computers are used to perform these calculations.

Regression coefficient

A straight line has the following basic equation:

y = a×x+ b

Were x is the independent variable and y the dependent variable. ‘a’ is the regression coefficient. It represents the slope of the line and can be calculated by dividing the difference in y-value to the difference in x-value at two points:regression1If a = 0, the line is horizontal. The larger the value of ‘a’, the more vertical (steeper) the line is:

regression2A negative value of ‘a’ corresponds to a downwards slope: regression3In the graph above, the regression coefficient = – 20 \left ( \frac{80}{{-4}} \right ) .

The intercept ‘b’ is a constant for the line and represents the y-value at x = 0. If the line goes through the origin of the coordinate system (0,0), than b = 0. If the line crosses above the origin, ‘b’ is positive and if it crosses below the origin, ‘b’ is negative. In the graph above, b = 80, so:

y = -20×x +80

Returning to the example with the girths and masses of 30 trees, it is easy to add a linear regression line with a 95% confidence interval to the plot by adding the geom_smooth function to the plot (it is also possible to do this with plot builder):

ggplot() + geom_point(aes(x = Girth,y = Mass),data=TreeGirthMass) + theme_bw() + ggtitle(label = ‘Girth and Mass Trees’) + xlab(label = ‘Girth [cm]’) + ylab(label = ‘Mass [kg]’) + geom_smooth(aes(x = Girth,y = Mass),data=TreeGirthMass,method = ‘lm’)

trees30regressionThe computer has drawn the best fitting line through the data points using the least square method. In addition, the 95% confidence interval is indicated with grey shading. However, it doesn’t provide the formula of the regression line. The formula of the regression line can be found by:

fit<-lm(Mass~Girth,data=TreeGirthMass)
fit

Call:
lm(formula = Mass ~ Girth, data = TreeGirthMass)

Coefficients:
(Intercept)        Girth  
    -411.62         3.24 

The regression coefficient is 3.24 and the intercept -411.62, therefore the formula of the regression line is:

Mass = 3.24×Girth -411.62

The regression coefficient is a measure of the slope of the line. It ranges from -∞ to +∞. A regression coefficient of zero means the line is horizontal; a positive value corresponds to an upward slope and a negative value to a downward slope. The larger the value of the regression coefficient, the steeper the slope.