In medicine, diagnostic tests are used to make a diagnosis. For example, an MRI-scan is performed to diagnose a meniscal tear or a CT-scan to see if someone has a tarsal coalition. Some diagnostic tests are better than others. An MRI-scan of the knee, for example, is better in diagnosing a meniscal tear than a CT scan, which in turn is better than a plain radiograph (This of cause does not mean one should not request a radiograph of the knee when a meniscal tear is suspected! The radiograph is very helpful in eliminating other causes of knee pain or locking such as osteochondritis dissecans). If one test is better than another in diagnosing a condition, we would like to know how much better this test is. To do this we can validate the test against a gold standard.
Five features of a diagnostic test have been described. These are sensitivity, specificity, positive predictive value, negative predictive value and accuracy. These five features allow us to compare different tests. They will be explained with a fictional example.
As an example, lets look at the value of MRI in diagnosing a meniscal tear in the knee. The MRI needs to be validated against a gold standard (or the ‘truth’ as it is perceived). In this case, diagnostic arthroscopy is the gold standard. There are 100 patients with a suspected meniscal tear. All patients had an MRI-scan that was reviewed by a radiologist. The radiologist reported the scan as either positive or negative for a meniscal tear. The radiologist was not allowed to be indecisive. After the patient had the MRI-scan, a diagnostic arthroscopy was performed. The orthopaedic surgeon, who performed the procedure, diagnosed a meniscal tear or not. Other pathology found by the orthopaedic surgeon, such as arthritis, is irrelevant in this context.
Now the radiological diagnosis is validated against the arthroscopic diagnosis. So, we assume that the arthroscopic diagnosis is always correct (surgeon is always right!). There are 4 possible combinations:
- The radiologist and orthopaedic surgeon both agree that there is a meniscal tear.
- The radiologist diagnosed a meniscal tear but this was not confirmed at arthroscopy (over diagnosis by radiologist).
- The radiologist reported the scan as normal, but there was a meniscal tear at arthroscopy (missed diagnosis by radiologist).
- They both agree there is no meniscal tear.
These four possibilities are summarised in a 2 × 2 table:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
MRI Positive | a | b | |
MRI Negative | c | d |
In this table, the gold standard (‘truth’) is in the columns and the test we validate against this standard is in rows. It is important to realise that the formulas that follow will be different if this is changed. The result of the MRI-scan is validated against the arthroscopic diagnosis. So, value a is the number of patients who have been correctly diagnosed as having a meniscal tear by MRI-scan. They are called the True Positive scans. Similarly, we can see that value b is the False Positive, value c the False Negatives and value d the True Negative scans. Or:
a: True Positive
b: False Positive
c: False Negative
d: True Negative
The five features of a diagnostic test that have been described are discussed:
- Positive Predictive Value
- Negative Predictive Value
- Sensitivity
- Specificity
- Accuracy
The values of a, b, c and d are substituted in the table:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
100 | |||
MRI Positive | 49 | 5 | |
MRI Negative | 1 | 45 |
Positive Predictive Value (ppv):
When the test is positive, what is the probability the person has the condition:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
100 | |||
MRI Positive | 49 | 5 | 54 |
MRI Negative | 1 | 45 |
So, 91% of the patients who had a positive MRI-scan were indeed found to have a meniscal tear at arthroscopy. 9% of the patients with a positive scan did not have a meniscal tear. Patients with a positive MRI-scan are therefore likely to have a meniscal tear (91%). The positive predictive value is the probability that a person who is test positive indeed has the condition. The value ranges from 0 to 100 %. If the positive predictive value is 100%, all test positives are also true positives. In other words, there will be no patients with a false positive test (b=0). If the positive predictive value is 50%, there are as many true positives as there are false positives (a=b). Consequently, a positive test has no value in diagnosing disease. If the positive predictive value is 0%, there are no true positives (a=0), and all people with a positive test are false positives. This does not necessarily mean that the test is useless. It might well be that a negative test is helpful in excluding disease.
Negative Predictive Value (npv):
When the test is negative, what is the probability the person does not have the condition:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
100 | |||
MRI Positive | 49 | 5 | 54 |
MRI Negative | 1 | 45 | 46 |
So, 98% of the patients who had a negative MRI-scan indeed did not have a meniscal tear at arthroscopy. Only 2% of the patients with a negative scan were found to have a meniscal tear at arthroscopy. Patients with a negative MRI-scan are therefore unlikely to have a meniscal tear. The negative predictive value is the probability that a person who is test negative does not have the condition. The value ranges from 0 to 100 %. If the negative predictive value is 100%, all test negatives are also true negatives. In other words, there will be no patients with a false negative test (c=0). If the negative predictive value is 50%, there are as many true negatives as there are false negatives (c=d). Consequently, a negative test has no value in excluding disease. If the negative predictive value is 0%, there are no true negatives (d=0), and all people with a negative test are false negatives. This does not necessarily mean that the test is useless. It might well be that a positive test is helpful in diagnosing disease.
Sensitivity:
Given the person has the condition, how likely the test will be positive:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
50 | 100 | ||
MRI Positive | 49 | 5 | 54 |
MRI Negative | 1 | 45 | 46 |
So, 98% of the patients who were found to have a meniscal tear at arthroscopy had a positive MRI-scan. Only 2% of the patients with a meniscal tear had a negative MRI-scan. Therefore, an MRI-scan is very good in picking up patients who have a meniscal tear. The sensitivity, or true positive rate, describes how good a test is in picking up people with the condition. The value ranges from 0 to 100 %. If the sensitivity is 100%, all positives are true positives. In other words, there are no false negatives (c=0). If the sensitivity is 50%, there are as many true positives as there are false negatives (a=c). Indicating that the test has no use in picking up disease. If the sensitivity is 0%, there are no true positives (a=0), and all people with the condition are false negatives. This does not necessarily mean the test is useless. It might well be good in excluding disease.
Specificity:
Given the person does not have the condition, how likely the test will be negative:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
50 | 50 | 100 | |
MRI Positive | 49 | 5 | 54 |
MRI Negative | 1 | 45 | 46 |
So, 90% of the patients who (at arthroscopy) did not have a meniscal tear had a negative MRI-scan. 10% of the patients without a meniscal tear had a positive MRI-scan. Therefore, an MRI-scan is good in excluding patients who do not have a meniscal tear. The specificity, or true negative rate, describes how good a test is in correctly excluding people without the condition. The value ranges from 0 to 100 %. If the specificity is 100%, all negatives are true negatives. In other words, there are no false positives (b=0). If the specificity is 50%, there are as many true negatives as there are false positives (b=d). Indicating that the test has no use in excluding disease. If the specificity is 0%, there are no true negatives (d=0), and all people without the condition are false positives. This does not necessarily mean the test is useless. It might well be good in picking up disease.
Accuracy:
The probability that the test result is correct:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
50 | 50 | 100 | |
MRI Positive | 49 | 5 | 54 |
MRI Negative | 1 | 45 | 46 |
So, in 94% of all MRI-scans performed, the result of the scan was correct. Accuracy ‘combines’ the specificity and the sensitivity of a test. The value is between 0 and 100 %. If the accuracy of a test is 100%, there were no false positives and no false negatives (b=0 and c=0). Indicating that the test is very useful. If the accuracy is 50%, there are just as many incorrect as correct results. In other words, the true positives plus true negatives equal the false positives plus false negatives (a+d = b+c). Consequently, the test is useless in diagnosing the disease. If the accuracy is 0%, there are no true positives and true negatives (a=0 and d=0). Indicating that the test is always incorrect! This does not necessarily mean the test is useless. It could be just as useful to know if a test is incorrect as if it is correct.
To calculate the sensitivity, specificity, positive predictive value and negative predictive value in R / JGR is straight forward with the epiR package 1:
library(epiR)
mat<-matrix(c(49,1,5,45),ncol=2)
mat
[,1] [,2]
[1,] 49 5
[2,] 1 45
epi.tests(mat)
Disease + Disease – Total
Test + 49 5 54
Test – 1 45 46
Total 50 50 100
Point estimates and 95 % CIs:
———————————————————
Apparent prevalence 0.54 (0.44, 0.64)
True prevalence 0.50 (0.40, 0.60)
Sensitivity 0.98 (0.89, 1.00)
Specificity 0.90 (0.78, 0.97)
Positive predictive value 0.91 (0.80, 0.97)
Negative predictive value 0.98 (0.88, 1.00)
Positive likelihood ratio 9.80 (4.26, 22.53)
Negative likelihood ratio 0.02 (0.00, 0.16)
———————————————————
Please note the values need to be entered in columns rather than rows!
Precision:
Accuracy should not be confused with precision. Precision is something completely different. It is defined as the closeness of repeated measurements of the same quantity. Whilst accuracy is the closeness of a measured variate to its true value. Precision indicates the variability of the estimate over all samples. A precise indicator will have a small variability (small standard deviation)
For example, a person’s mass is 65.2 kg. If the person is repeatedly measured on the electronic scales and the mean mass is 60.00001 kg (standard deviation 0.0000001 kg). The measurement is very precise, but not very accurate.
Validation
Validation is confirmation (by evidence) that the measure can be used consistently for its intended use.
In general:
Athroscopy Positive | Arthroscopy Negative | ||
---|---|---|---|
a + c | b + d | a + b + c + d | |
MRI Positive | a | b | a + b |
MRI Negative | c | d | c + d |
True Positive:
False Positive:
False Negative:
True Negative:
Positive Predictive Value:
Negative Predictive Value:
Sensitivity:
Specificity:
Accuracy:
Or in R / JGR:
library(epiR)
mat<-matrix(c(a,c,b,d),ncol=2) {enter values}
epi.tests(mat)
Disease + Disease – Total
Test + a b
Test – c d
Total
Point estimates and 95 % CIs:
———————————————————
Apparent prevalence
True prevalence
Sensitivity
Specificity
Positive predictive value
Negative predictive value
Positive likelihood ratio
Negative likelihood ratio
———————————————————
In the table, the gold standard (‘truth’) is in the columns and the test we validate against this standard is in rows. It is important to realise that the formulas will be different if we change the columns and rows. It is therefore not advisable to learn the formulas of by heart. It is better to approach it systematically. It should also be clear from the previous that any of the five parameters discussed on their own are of limited value. If one wants to look at just one parameter, the accuracy is the most informative. However, it is better to look at the 4 × 4 table and calculate all parameters. This is further illustrated in two examples.
Example 1:
Very sensitive test (fire alarm):
Truth Positive | Truth Negative | ||
---|---|---|---|
1 | 99 | 100 | |
Test Positive | 1 | 39 | 40 |
Test Negative | 0 | 60 | 60 |
True Positive: 1, False Positive: 39, False Negative: 0, True Negative: 60
ppv = 2.5%, npv = 100%, Sensitivity = 100%, Specificity ≈ 61%, Accuracy = 61%
Or in R / JGR:
library(epiR)
mat<-matrix(c(1,0,39,60),ncol=2)
epi.tests(mat)
Disease + Disease – Total
Test + 1 39 40
Test – 0 60 60
Total 1 99 100
Point estimates and 95 % CIs:
———————————————————
Apparent prevalence 0.40 (0.30, 0.50)
True prevalence 0.01 (0.00, 0.05)
Sensitivity 1.00 (0.01, 1.00)
Specificity 0.61 (0.50, 0.70)
Positive predictive value 0.03 (0.00, 0.13)
Negative predictive value 1.00 (0.91, 1.00)
Positive likelihood ratio 2.54 (1.99, 3.24)
Negative likelihood ratio 0.00 (0.00, NaN)
———————————————————
Example 2:
Very specific test (being caught for speeding):
Truth Positive | Truth Negative | ||
---|---|---|---|
40 | 60 | 100 | |
Test Positive | 1 | 0 | 1 |
Test Negative | 39 | 60 | 99 |
True Positive: 1, False Positive: 0, False Negative: 39, True Negative: 60
ppv = 100%, npv ≈ 61%, Sensitivity = 2.5%, Specificity = 100%, Accuracy = 61%
Or in R / JGR:
library(epiR)
mat<-matrix(c(1,39,0,60),ncol=2)
epi.tests(mat)
Disease + Disease – Total
Test + 1 0 1
Test – 39 60 99
Total 40 60 100
Point estimates and 95 % CIs:
———————————————————
Apparent prevalence 0.01 (0.00, 0.05)
True prevalence 0.40 (0.30, 0.50)
Sensitivity 0.03 (0.00, 0.13)
Specificity 1.00 (0.91, 1.00)
Positive predictive value 1.00 (0.01, 1.00)
Negative predictive value 0.61 (0.50, 0.70)
Positive likelihood ratio Inf (NaN, Inf)
Negative likelihood ratio 0.97 (0.93, 1.02)
It is important to bear in mind that a test can be sensitive for one purpose, but not necessarily for another. For example, a bone scan is very sensitive in picking up abnormalities such as fractures and infections. However, it is not very helpful in picking up multiple myeloma. For that purpose, it would be better to use a bone marrow biopsy or MRI scan of the marrow areas. If a test is used for screening, it is very important to make sure it has a high sensitivity. It is obviously unsatisfactory to miss disease with a screening investigation. If this investigation is not very specific, subsequent investigations can be performed to increase diagnostic accuracy (eliminate the false positives).
All tests have their limitations, and the most appropriate investigation should be selected for what is being investigated. Sometimes, a combination of investigations is used. Usually, the simplest and most sensitive investigations are performed first, followed by the more specific investigations to increase diagnostic accuracy.
To create radar or spider-web plots of different tests, please refer to the radar plots page.
1. Stevenson M, Nunes T, Heuer C, Marshall J, Sanchez J, Thornton R, et al. epiR: Tools for the Analysis of Epidemiological Data [Internet]. 2015. (R package). Available from: http://cran.r-project.org/package=epiR