Descriptive Statistics

CENTRAL TENDENCY

Averages: mean mode and median.

As an example, the first variable (X1) in Anscombe’s first data set can be used 1.

To show the data values:

anscombe.quartet$X1

[1] 10  8 13  9 11 14  6  4 12  7  5

Mean: add and divide by number of observations.

Mean(X 1)=\frac{1}{n}\sum_{i=1}^{n} X1(i)

Where n is the number of observations. Or in R:

mean(anscombe.quartet$X1)

[1] 9

Median: middle value of the ordered data. If the number of observations is odd, the median value is one of the observations. If the number of observations is even, the median value is the average of the central two observations.

Or in R:

median(anscombe.quartet$X1)

[1] 9

The mode, or most common category, is more difficult to calculate and is often best depicted graphically.

There is no standard function for the mode in R. The following:

mode(anscombe.quartet$X1)

returns the internal storage mode of the R object and not the mode:

[1] “numeric”

The mode can be found with a user defined function:

Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

This function calculates the mode in R by entering:

Mode(anscombe.quartet$X1)

Note the capital M as defined in the function!

[1] 10

DISPERSION

Indicators of the spread or variability of the data.

Variance: the average of the sum of the squares about the mean:

Variance(X 1)=\frac{1}{n-1}\sum_{i=1}^{n} (X1(i)-Mean(X1)) ^{2}

The term \sum_{i=1}^{n} (X1(i)-Mean(X1)) ^{2} is the sum of the squares about the mean.

Or in R:

var(anscombe.quartet$X1)

[1] 11

Standard Deviation: the square root of the variance:

StandardDeviation(X 1)=\sqrt(Variance)

or

StandardDeviation(X 1)=\sqrt(\frac{1}{n-1}\sum_{i=1}^{n} (X1(i)-Mean(X1)) ^{2})

or

Variance(X1)=(StandardDeviation(X1)) ^{2}

Or in R:

sd(anscombe.quartet$X1)

[1] 3.316625

The square root of the variance gives the same result:

sqrt(var(anscombe.quartet$X1))

[1] 3.316625

Range: highest minus lowest value.

The maximum observation minus the minimum observation.

Or in R:

range(anscombe.quartet$X1)

[1]  4 14

So, the range is 10.

Minimum: Lowest value

Or in R:

min(anscombe.quartet$X1)

[1] 4

Maximum: Highest value

Or in R:

max(anscombe.quartet$X1)

[1] 14

Interquartile range: ‘midspread’ or ‘mid fifty’, the difference between the upper and lower quartiles.

Data can be divided into four quartiles: Q1, Q2, Q3 and Q4. Q2 is equal to the median value and the interquartile range is Q3 minus Q1:

IQR = Q3 – Q1

Or in R:

IQR(anscombe.quartet$X1)

[1] 5

In a box plot, the IQR is indicated by the box in the plot.

Normal distribution: mean, mode and median are the same.

Descriptive statistics can be obtained using JGR, or alternatively entered manually into the console. For example to obtain the descriptives of X1 in Anscombe’s first data set, load the data and enter the following into the console:

descriptive.table(vars = d(X1),data= anscombe.quartet, func.names =c(“Mean”,”St. Deviation”,”Valid N”,”Median”,”Minimum”,”Maximum”))

The code can be copied and pasted, but the quotation marks may have to be re-entered.

This should give the following output:

$`strata: all cases `
         Mean.X1 St. Deviation.X1       Valid N.X1        Median.X1       Minimum.X1       Maximum.X1
        9.000000         3.316625        11.000000         9.000000         4.000000        14.000000

1.
Anscombe F. Graphs in statistical analysis. The American Statistician. 1973;27(1):17–21.