Scatter Plot

Scatter plots are commonly used in medicine to illustrate the relation between two continuous variables. However, scatter plots can also be used to show discrete numeral and ordinal data.

Download the anscombe.rda dataset for this example 1.

Anscombe’s fictional data sets can be shown by:

anscombe.quartet
   X1    Y1 X2   Y2 X3    Y3 X4    Y4
1  10  8.04 10 9.14 10  7.46  8  6.58
2   8  6.95  8 8.14  8  6.77  8  5.76
3  13  7.58 13 8.74 13 12.74  8  7.71
4   9  8.81  9 8.77  9  7.11  8  8.84
5  11  8.33 11 9.26 11  7.81  8  8.47
6  14  9.96 14 8.10 14  8.84  8  7.04
7   6  7.24  6 6.13  6  6.08  8  5.25
8   4  4.26  4 3.10  4  5.39 19 12.50
9  12 10.84 12 9.13 12  8.15  8  5.56
10  7  4.82  7 7.26  7  6.42  8  7.91
11  5  5.68  5 4.74  5  5.73  8  6.89

The four data sets are x1 v y1, x2 v y2, x3 v y3 and x4 v y4. The x and y variables have identical mean and standard deviations:

descriptive.table(vars = d(X1,X2,X3,X4,Y1,Y2,Y3,Y4),data= anscombe.quartet,
+ func.names =c(“Valid N”,”Mean”,”Median”,”St. Deviation”))
$`strata: all cases `
   Valid N     Mean Median St. Deviation
X1      11 9.000000   9.00      3.316625
X2      11 9.000000   9.00      3.316625
X3      11 9.000000   9.00      3.316625
X4      11 9.000000   8.00      3.316625
Y1      11 7.500909   7.58      2.031568
Y2      11 7.500909   8.14      2.031657
Y3      11 7.500000   7.11      2.030424
Y4      11 7.500909   7.04      2.030579

It is important to plot data, rather than solely relying on descriptive parameters,  so that their relation can be appreciated. To plot the first data set:

ggplot() +
geom_point(aes(x = X1,y = Y1),data=anscombe.quartet) +
ggtitle(label = ‘Anscombe\’s First Data Set’) +
theme_bw()

If you are using ggplot < 0.9.2, the title can be set using:  opts(title=’Anscombe\’s First Data Set’)

The backslash \ before the ‘s is required so the quotation mark does not indicate the end of the title’s text string, but that the quotation mark is part of the title!

anscombe1It is customary to put the independent (explanatory or predictor) variable on the x-axis (abscissa) and the dependent (response or outcome) variable on the y-axis (ordinate). However, it is not always clear which variable is dependent and which independent.

1.
Anscombe F. Graphs in statistical analysis. The American Statistician. 1973;27(1):17–21.