Statsbook

Descriptive Statistics

CENTRAL TENDENCY

Averages: mean, median and mode.

As an example, the first variable (X1) in Anscombe’s first data set1 can be used.

To show the data values:

anscombe.quartet$X1
[1] 10  8 13  9 11 14  6  4 12  7  5

Mean: add and divide by number of observations.

\(Mean(X1) = \frac{1}{n} \sum_{i=1}^{n} X1{i} \)

Where n is the number of observations.

Or in R:

mean(anscombe.quartet$X1)
[1] 9

Median: middle value of the ordered data. If the number of observations is odd, the median value is one of the observations. If the number of observations is even, the median value is the mean of the central two observations.

Or in R:

median(anscombe.quartet$X1)
[1] 9

The mode, or most common category, is more difficult to calculate and is often best depicted graphically.

There is no standard function for the mode in R. The following:

mode(anscombe.quartet$X1)
[1] "numeric"

returns the internal storage mode of the R object and not the mode!

The mode can be found with a user defined function:

Mode <- function(x) {
 ux <- unique(x)
 ux[which.max(tabulate(match(x, ux)))]
}

This function calculates the mode in R by entering:

Mode(anscombe.quartet$X1)
[1] 10

Note the capital M as defined in the function!

DISPERSION

Indicators of the spread or variability of the data.

Variance: the average of the sum of the squares about the mean:

\(Variance(X1) = \frac{1}{n-1} \sum_{i=1}^{n}(X1(i) – Mean(X1))^2 \)

The term

\( \sum_{i=1}^{n}(X1(i) – Mean(X1))^2 \)

 is the sum of the squares about the mean.

Or in R:

var(anscombe.quartet$X1)
[1] 11

Standard Deviation: the square root of the variance:

\(SD(X1)=\sqrt(Variance(X1) \)

or

\(SD(X1) = \sqrt(\frac{1}{n-1} \sum_{i=1}^{n}(X1(i) – Mean(X1))^2) \)

or

\(Variance(X1) = (SD(X1))^2 \)

Or in R:

sd(anscombe.quartet$X1)
[1] 3.316625

The square root of the variance gives the same result:

sqrt(var(anscombe.quartet$X1))
[1] 3.316625

Range: highest minus lowest value.

The maximum observation minus the minimum observation.

Or in R:

range(anscombe.quartet$X1)
[1]  4 14

So, the range is 10.

Minimum: Lowest value

Or in R:

min(anscombe.quartet$X1)
[1] 4

Maximum: Highest value

Or in R:

max(anscombe.quartet$X1)
[1] 14

Interquartile range: ‘midspread’ or ‘mid fifty’, the difference between the upper and lower quartiles.

Data can be divided into four quartiles: Q1, Q2, Q3 and Q4. Q2 is equal to the median value and the interquartile range is Q3 minus Q1:

IQR = Q3 – Q1

Or in R:

IQR(anscombe.quartet$X1)
[1] 5

In a box plot, the IQR is indicated by the box in the plot.

Normal distribution: mean, mode and median are approximately the same.

Descriptive statistics can be obtained in R in one go. For example to obtain the descriptives of X1 in Anscombe’s first data set, load the data and enter the following into the console:

summary(anscombe.quartet$X1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     6.5     9.0     9.0    11.5    14.0 

However, this summary doesn’t provide the standard deviation. To obtain a custom description with the dplyr2 package:

library(dplyr)
anscombe.quartet %>%
  summarise(n=n(), mean=mean(X1), median=median(X1), min=min(X1), max=max(X1), iqr=IQR(X1), sd=sd(X1), var=var(X1), q1=quantile(X1, 0.25), q2=quantile(X1,0.5), q3=quantile(X1,0.75), q4=quantile(X1,1))
   n mean median min max iqr       sd var  q1 q2   q3 q4
1 11    9      9   4  14   5 3.316625  11 6.5  9 11.5 14

Summarising data

If the distribution conforms a Normal distribution, data should be presented by the mean (central tendency) and the standard deviation (spread). However, the mean is very sensitive to outliers and is not a good measure of central tendency in skewed distributions. When data do not conform a Normal distribution, they should be summarised by the median and the interquartile range (middle 50% of the data).

Normally distributed data: mean and standard deviation

Otherwise: median and interquartile range