CENTRAL TENDENCY
Averages: mean, median and mode.
As an example, the first variable (X1) in Anscombe’s first data set1 can be used.
To show the data values:
anscombe.quartet$X1
[1] 10 8 13 9 11 14 6 4 12 7 5
Mean: add and divide by number of observations.
Where n is the number of observations.
Or in R:
mean(anscombe.quartet$X1)
[1] 9
Median: middle value of the ordered data. If the number of observations is odd, the median value is one of the observations. If the number of observations is even, the median value is the mean of the central two observations.
Or in R:
median(anscombe.quartet$X1)
[1] 9
The mode, or most common category, is more difficult to calculate and is often best depicted graphically.
There is no standard function for the mode in R. The following:
mode(anscombe.quartet$X1)
[1] "numeric"
returns the internal storage mode of the R object and not the mode!
The mode can be found with a user defined function:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
This function calculates the mode in R by entering:
Mode(anscombe.quartet$X1)
[1] 10
Note the capital M as defined in the function!
DISPERSION
Indicators of the spread or variability of the data.
Variance: the average of the sum of the squares about the mean:
The term
is the sum of the squares about the mean.
Or in R:
var(anscombe.quartet$X1)
[1] 11
Standard Deviation: the square root of the variance:
or
or
Or in R:
sd(anscombe.quartet$X1)
[1] 3.316625
The square root of the variance gives the same result:
sqrt(var(anscombe.quartet$X1))
[1] 3.316625
Range: highest minus lowest value.
The maximum observation minus the minimum observation.
Or in R:
range(anscombe.quartet$X1)
[1] 4 14
So, the range is 10.
Minimum: Lowest value
Or in R:
min(anscombe.quartet$X1)
[1] 4
Maximum: Highest value
Or in R:
max(anscombe.quartet$X1)
[1] 14
Interquartile range: ‘midspread’ or ‘mid fifty’, the difference between the upper and lower quartiles.
Data can be divided into four quartiles: Q1, Q2, Q3 and Q4. Q2 is equal to the median value and the interquartile range is Q3 minus Q1:
IQR = Q3 – Q1
Or in R:
IQR(anscombe.quartet$X1)
[1] 5
In a box plot, the IQR is indicated by the box in the plot.
Normal distribution: mean, mode and median are approximately the same.
Descriptive statistics can be obtained in R in one go. For example to obtain the descriptives of X1 in Anscombe’s first data set, load the data and enter the following into the console:
summary(anscombe.quartet$X1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 6.5 9.0 9.0 11.5 14.0
However, this summary doesn’t provide the standard deviation. To obtain a custom description with the dplyr2 package:
library(dplyr)
anscombe.quartet %>%
summarise(n=n(), mean=mean(X1), median=median(X1), min=min(X1), max=max(X1), iqr=IQR(X1), sd=sd(X1), var=var(X1), q1=quantile(X1, 0.25), q2=quantile(X1,0.5), q3=quantile(X1,0.75), q4=quantile(X1,1))
n mean median min max iqr sd var q1 q2 q3 q4
1 11 9 9 4 14 5 3.316625 11 6.5 9 11.5 14
Summarising data
If the distribution conforms a Normal distribution, data should be presented by the mean (central tendency) and the standard deviation (spread). However, the mean is very sensitive to outliers and is not a good measure of central tendency in skewed distributions. When data do not conform a Normal distribution, they should be summarised by the median and the interquartile range (middle 50% of the data).
Normally distributed data: mean and standard deviation
Otherwise: median and interquartile range