Scatter Plot – Statsbook

Scatter plots are commonly used in medicine to illustrate the relation between two continuous variables. However, scatter plots can also be used to show discrete numeral and ordinal data.

Download the anscombe.rda dataset for this example¹.

Anscombe’s fictional data sets can be shown by:

anscombe.quartet
   X1    Y1 X2   Y2 X3    Y3 X4    Y4
1  10  8.04 10 9.14 10  7.46  8  6.58
2   8  6.95  8 8.14  8  6.77  8  5.76
3  13  7.58 13 8.74 13 12.74  8  7.71
4   9  8.81  9 8.77  9  7.11  8  8.84
5  11  8.33 11 9.26 11  7.81  8  8.47
6  14  9.96 14 8.10 14  8.84  8  7.04
7   6  7.24  6 6.13  6  6.08  8  5.25
8   4  4.26  4 3.10  4  5.39 19 12.50
9  12 10.84 12 9.13 12  8.15  8  5.56
10  7  4.82  7 7.26  7  6.42  8  7.91
11  5  5.68  5 4.74  5  5.73  8  6.89

The four data sets are x1 vs y1, x2 vs y2, x3 vs y3 and x4 vs y4. The x and y variables have identical means:

summary(anscombe.quartet$X1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     6.5     9.0     9.0    11.5    14.0 
summary(anscombe.quartet$X2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     6.5     9.0     9.0    11.5    14.0 
summary(anscombe.quartet$X3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     6.5     9.0     9.0    11.5    14.0 
summary(anscombe.quartet$X4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      8       8       8       9       8      19 
summary(anscombe.quartet$Y1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.260   6.315   7.580   7.501   8.570  10.840 
summary(anscombe.quartet$Y2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.100   6.695   8.140   7.501   8.950   9.260 
summary(anscombe.quartet$Y3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.39    6.25    7.11    7.50    7.98   12.74 
summary(anscombe.quartet$Y4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.250   6.170   7.040   7.501   8.190  12.500

and standard deviations:

sd(anscombe.quartet$X1)
[1] 3.316625
sd(anscombe.quartet$X2)
[1] 3.316625
sd(anscombe.quartet$X3)
[1] 3.316625
sd(anscombe.quartet$X4)
[1] 3.316625
sd(anscombe.quartet$Y1)
[1] 2.031568
sd(anscombe.quartet$Y2)
[1] 2.031657
sd(anscombe.quartet$Y3)
[1] 2.030424
sd(anscombe.quartet$Y4)
[1] 2.030579

It is important to plot data, rather than solely relying on descriptive parameters, so that their relation can be appreciated. To plot the first data set:

ggplot(data=anscombe.quartet, aes(x = X1, y = Y1)) +
geom_point() +
ggtitle(label = 'Anscombe\'s First Data Set') +
theme_bw()

The backslash \ before the ‘s is required so the quotation mark does not indicate the end of the title’s text string, but that the quotation mark is part of the title.

It is customary to put the independent (explanatory or predictor) variable on the x-axis (abscissa) and the dependent (response or outcome) variable on the y-axis (ordinate). However, it is not always clear which variable is dependent and which independent.

The second data set:

ggplot(data=anscombe.quartet, aes(x = X2, y = Y2)) +
geom_point() +
ggtitle(label = 'Anscombe\'s Second Data Set') +
theme_bw()

The third data set:

ggplot(data=anscombe.quartet, aes(x = X3, y = Y3)) +
geom_point() +
ggtitle(label = 'Anscombe\'s Third Data Set') +
theme_bw()

And the fourth data set:

ggplot(data=anscombe.quartet, aes(x = X4, y = Y4)) +
geom_point() +
ggtitle(label = 'Anscombe\'s Fourth Data Set') +
theme_bw()

This illustrates the importance of plotting your data.

References

↑ FJ Anscombe, « Graphs in statistical analysis » on , 1973-03-26