If a measurement is repeated, it is likely the measurement value is different each time. If the variation is small, the repeatability is high. Obviously, there is a difference if the measurement is repeated by the same person or another person. Intra observer variation is the variation that occurs when the same persons repeats the measurements. Whilst Inter observer variation is the variation that occurs when a different person repeats the measurements. We would normally expect the inter-observer variation to be larger than the intra-observer variation.
The American Society for Testing and Materials (ASTM) has defined repeatability and reproducibility 1.
Repeatability
Repeatability is precision determined from multiple tests done under repeatability conditions: the test is conducted by the same operator, using the same equipment and laboratory within a short period of time so that neither the equipment nor the environment is likely to change significantly.
Reproducibility
Reproducibility is precision determined from multiple tests done under reproducibility conditions: the test is conducted in different laboratories with different operators and different environmental conditions.
When an accepted reference value is known, the bias can be expressed.
The precision limits of the repeatability and reproducibility can be calculated from the sample standard deviation of the test results. The number of tests should be at least 30, so that the sample standard deviation is a reasonable estimate of the population standard deviation. The repeatability precision limit (r) and the reproducibility precision limit (R) are useful for comparing test results within and between laboratories. They are calculated by multiplying the repeatability standard deviation (sr) or the reproducibility standard deviation (sR) by 2.8 respectively. The factor 2.8 is derived from 1.96 (95% of the population is within 1.96 standard deviations of the mean) times the square root of 2.
For example, consider the measurements of femoral heads in the heads.rda file. The data frame is called heads and there are five variables: number, the accepted reference value (measurement of the femoral head using callipers at the time of excision at total hip replacement) and three radiological measurements performed before the surgery. Measurements m1 and m2 were done under repeatability conditions and measurement m3 was done under reproducibility conditions.
Load the data file in JGR / R and check the data frame:
heads
number reference m1 m2 m3
1 1 52 54 55 53
2 2 50 50 49 56
3 3 50 51 47 52
4 4 52 53 53 53
5 5 52 50 51 52
6 6 48 49 48 51
7 7 55 53 56 56
8 8 55 52 55 59
9 9 53 54 54 53
10 10 48 47 49 52
11 11 50 51 48 54
12 12 48 47 47 53
13 13 50 49 51 50
14 14 49 50 49 55
15 15 52 51 51 52
16 16 51 53 52 50
17 17 51 50 49 53
18 18 51 52 51 56
19 19 50 51 52 53
20 20 54 56 55 56
21 21 49 48 48 53
22 22 50 51 51 55
23 23 53 52 52 55
24 24 48 48 50 52
25 25 52 52 53 55
26 26 51 51 51 56
27 27 55 53 55 59
28 28 50 53 49 52
29 29 54 52 55 57
30 30 50 52 49 59
As there is an accepted reference value (the calliper measurements), first calculate the biases of the measurements:
bias1<-heads$m1-heads$reference
bias2<-heads$m2-heads$reference
bias3<-heads$m3-heads$reference
Are the biases Normally distributed? This can be checked with the Shapiro-Wilks test for Normality:
shapiro.test(bias1)
Shapiro-Wilk normality test
data: bias1
W = 0.9422, p-value = 0.1041
shapiro.test(bias2)
Shapiro-Wilk normality test
data: bias2
W = 0.9421, p-value = 0.1039
shapiro.test(bias3)
Shapiro-Wilk normality test
data: bias3
W = 0.9612, p-value = 0.3315
All three tests are non significant. It can therefore be concluded that it is reasonable to assume a Normal distribution as a model for the data.
Next, check if there is bias by performing a t-test (one sample two sided):
t.test(bias1)
One Sample t-test
data: bias1
t = 0.2423, df = 29, p-value = 0.8103
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.4960831 0.6294164
sample estimates:
mean of x
0.06666667
t.test(bias2)
One Sample t-test
data: bias2
t = 0.273, df = 29, p-value = 0.7868
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.4327081 0.5660415
sample estimates:
mean of x
0.06666667
t.test(bias3)
One Sample t-test
data: bias3
t = 7.2677, df = 29, p-value = 5.285e-08
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.131801 3.801532
sample estimates:
mean of x
2.966667
The biases of m1 (bias1) and m2 (bias2), performed under repeatability conditions are not significantly different from zero (p=0.81 and p=0.79 respectively). So the repeatability measurements are unbiased. However, the bias in the measurements performed under reproducibility conditions are significantly different from zero (p=0.00000005). So, the reproducibility measurements are biased by:
mean(bias3)
[1] 2.966667
or 3 mm.
To express the repeatability according to the ASTM standard 1:
Subtract measurement m2 from m1 and call this ‘rep':
rep<-heads$m1-heads$m2
rep
[1] -1 1 4 0 -1 1 -3 -3 0 -2 3 0 -2 1 0 1 1 1 -1 1 0 0 0 -2 -1 0 -2 4 -3 3
So, the repeatability standard deviation (sr) is:
sr<-sd(rep)
sr
[1] 1.893728
and the repeatability (r) is:
repeatability<-qnorm(0.975)*sqrt(2)*sr
repeatability
[1] 5.249051
or 5.2 mm.
To express the reproducibility according the the ASTM standard 1:
Subtract measurement m3 from m1 and call this ‘repro':
repro<-heads$m1-heads$m3
repro
[1] 1 -6 -1 0 -2 -2 -3 -7 1 -5 -3 -6 -1 -5 -1 3 -3 -4 -2 0 -5 -4 -3 -4 -3 -5 -6 1 -5 -7
So, the reproducibility standard deviation (sR) is:
sR<-sd(repro)
sR
[1] 2.61758
and the reproducibility R is:
reproducibility<-qnorm(0.975)*sqrt(2)*sR
reproducibility
[1] 7.255428
or 7.3 mm.