Statsbook

Sensitivity / Specificity

In medicine, diagnostic tests are used to make a diagnosis. For example, an MRI-scan is performed to diagnose a meniscal tear or a CT-scan to see if someone has a tarsal coalition. Some diagnostic tests are better than others. An MRI-scan of the knee, for example, is better in diagnosing a meniscal tear than a CT scan, which in turn is better than a plain radiograph (This of cause does not mean one should not request a radiograph of the knee when a meniscal tear is suspected! The radiograph is very helpful in eliminating other causes of knee pain or locking such as osteochondritis dissecans). If one test is better than another in diagnosing a condition, we would like to know how much better this test is. To do this we can validate the test against a gold standard.

Five measures of a diagnostic test have been described. These are sensitivity, specificity, positive predictive value, negative predictive value and accuracy. These five features allow us to compare different tests. They will be explained with a fictional example.

As an example, lets look at the value of MRI in diagnosing a meniscal tear in the knee. The MRI needs to be validated against a gold standard (or the ‘truth’ as it is perceived). In this case, diagnostic arthroscopy is the gold standard. There are 100 patients with a suspected meniscal tear. All patients had an MRI-scan that was reviewed by a radiologist. The radiologist reported the scan as either positive or negative for a meniscal tear. The radiologist was not allowed to be indecisive. After the patient had the  MRI-scan, a diagnostic arthroscopy was performed. The orthopaedic surgeon, who performed the procedure, diagnosed a meniscal tear or not. Other pathology found by the orthopaedic surgeon, such as arthritis, is irrelevant in this context.

Now the radiological diagnosis is validated against the arthroscopic diagnosis. So, we assume that the arthroscopic diagnosis is always correct (surgeon is always right!). There are 4 possible combinations:

  1. The radiologist and orthopaedic surgeon both agree that there is a meniscal tear.
  2. The radiologist diagnosed a meniscal tear but this was not confirmed at arthroscopy (over diagnosis by radiologist).
  3. The radiologist reported the scan as normal, but there was a meniscal tear at arthroscopy (missed diagnosis by radiologist).
  4. They both agree there is no meniscal tear.
Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
ab
MRI
Negative
cd

Please note the formulas that follow will be different if the rows / columns in the table are changed.

In this table, the gold standard (‘truth’) is in the columns and the test we validate against this standard is in rows. The result of the MRI-scan is validated against the arthroscopic diagnosis. So, value a is the number of patients who have been correctly diagnosed as having a meniscal tear by MRI-scan. They are called the True Positive scans. Similarly, we can see that value b is the False Positive, value c the False Negatives and value d the True Negative scans. Or:

a: True Positive

b: False Positive

c: False Negative

d: True Negative

The five measures of a diagnostic test that have been described are discussed:

  1. Positive Predictive Value
  2. Negative Predictive Value
  3. Sensitivity
  4. Specificity
  5. Accuracy
Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
495
MRI
Negative
145
100

Positive Predictive Value (ppv, also called precision):

When the test is positive, what is the probability the person has the condition:

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
49554
MRI
Negative
145
100
\(PPV = \frac{49}{54} \approx 0.907 \)

So, 91% of the patients who had a positive MRI-scan were indeed found to have a meniscal tear at arthroscopy. 9% of the patients with a positive scan did not have a meniscal tear. Patients with a positive MRI-scan are therefore likely to have a meniscal tear (91%). The positive predictive value is the probability that a person who is test positive indeed has the condition. The value ranges from 0 to 100 %. If the positive predictive value is 100%, all test positives are also true positives. In other words, there will be no patients with a false positive test (b=0). If the positive predictive value is 50%, there are as many true positives as there are false positives (a=b). Consequently, a positive test has no value in diagnosing disease. If the positive predictive value is 0%, there are no true positives (a=0), and all people with a positive test are false positives. This does not necessarily mean that the test is useless. It might well be that a negative test is helpful in excluding disease.

Negative Predictive Value (npv):

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
49554
MRI
Negative
14546
100
\(NPV = \frac{45}{46} \approx 0.978 \)

So, 98% of the patients who had a negative MRI-scan indeed did not have a meniscal tear at arthroscopy. Only 2% of the patients with a negative scan were found to have a meniscal tear at arthroscopy. Patients with a negative MRI-scan are therefore unlikely to have a meniscal tear. The negative predictive value is the probability that a person who is test negative does not have the condition. The value ranges from 0 to 100 %. If the negative predictive value is 100%, all test negatives are also true negatives. In other words, there will be no patients with a false negative test (c=0). If the negative predictive value is 50%, there are as many true negatives as there are false negatives (c=d). Consequently, a negative test has no value in excluding disease. If the negative predictive value is 0%, there are no true negatives (d=0), and all people with a negative test are false negatives. This does not necessarily mean that the test is useless. It might well be that a positive test is helpful in diagnosing disease.

Sensitivity (also called recall):

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
49554
MRI
Negative
14546
50100
\(Sensitivity = \frac{49}{50} = 0.98 \)

So, 98% of the patients who were found to have a meniscal tear at arthroscopy had a positive MRI-scan. Only 2% of the patients with a meniscal tear had a negative MRI-scan. Therefore, an MRI-scan is very good in picking up patients who have a meniscal tear. The sensitivity, or true positive rate, describes how good a test is in picking up people with the condition. The value ranges from 0 to 100 %. If the sensitivity is 100%, all positives are true positives. In other words, there are no false negatives (c=0). If the sensitivity is 50%, there are as many true positives as there are false negatives (a=c). Indicating that the test has no use in picking up disease. If the sensitivity is 0%, there are no true positives (a=0), and all people with the condition are false negatives. This does not necessarily mean the test is useless. It might well be good in excluding disease.

Specificity:

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
49554
MRI
Negative
14546
5050100
\(Specificity = \frac{45}{50} = 0.9 \)

So, 90% of the patients who (at arthroscopy) did not have a meniscal tear had a negative MRI-scan. 10% of the patients without a meniscal tear had a positive MRI-scan. Therefore, an MRI-scan is good in excluding patients who do not have a meniscal tear. The specificity, or true negative rate, describes how good a test is in correctly excluding people without the condition. The value ranges from 0 to 100 %. If the specificity is 100%, all negatives are true negatives. In other words, there are no false positives (b=0). If the specificity is 50%, there are as many true negatives as there are false positives (b=d). Indicating that the test has no use in excluding disease. If the specificity is 0%, there are no true negatives (d=0), and all people without the condition are false positives. This does not necessarily mean the test is useless. It might well be good in picking up disease.

Accuracy:

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
49554
MRI
Negative
14546
5050100
\(Accuracy = \frac{49+45}{100}=0.94 \)

So, in 94% of all MRI-scans performed, the result of the scan was correct. Accuracy ‘combines’ the specificity and the sensitivity of a test. The value is between 0 and 100 %. If the accuracy of a test is 100%, there were no false positives and no false negatives (b=0 and c=0). Indicating that the test is very useful. If the accuracy is 50%, there are just as many incorrect as correct results. In other words, the true positives plus true negatives equal the false positives plus false negatives (a+d = b+c). Consequently, the test is useless in diagnosing the disease. If the accuracy is 0%, there are no true positives and true negatives (a=0 and d=0). Indicating that the test is always incorrect! This does not necessarily mean the test is useless. It could be just as useful to know if a test is incorrect as if it is correct.

To calculate the sensitivity, specificity, positive predictive value and negative predictive value in R is straight forward with the epiR package1:

library(epiR)
Package epiR 2.0.84 is loaded
Type help(epi.about) for summary information
Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses


mat <- matrix(c(49,5,1,45),byrow=TRUE, ncol=2)
mat
     [,1] [,2]
[1,]   49    5
[2,]    1   45
> epi.tests(mat)
          Outcome +    Outcome -      Total
Test +           49            5         54
Test -            1           45         46
Total            50           50        100

Point estimates and 95% CIs:
--------------------------------------------------------------
Apparent prevalence *                  0.54 (0.44, 0.64)
True prevalence *                      0.50 (0.40, 0.60)
Sensitivity *                          0.98 (0.89, 1.00)
Specificity *                          0.90 (0.78, 0.97)
Positive predictive value *            0.91 (0.80, 0.97)
Negative predictive value *            0.98 (0.88, 1.00)
Positive likelihood ratio              9.80 (4.26, 22.53)
Negative likelihood ratio              0.02 (0.00, 0.16)
False T+ proportion for true D- *      0.10 (0.03, 0.22)
False T- proportion for true D+ *      0.02 (0.00, 0.11)
False T+ proportion for T+ *           0.09 (0.03, 0.20)
False T- proportion for T- *           0.02 (0.00, 0.12)
Correctly classified proportion *      0.94 (0.87, 0.98)
--------------------------------------------------------------
* Exact CIs

Please note the byrow=TRUE is required if data is entered by row. By default byrow=FALSE

Precision:

Accuracy should not be confused with precision and can have two meanings:

1: Precision

Precision is defined as the closeness of repeated measurements of the same quantity. Whilst accuracy is the closeness of a measured variate to its true value. Precision indicates the variability of the estimate over all samples. A precise indicator will have a small variability (small standard deviation). Consequently, the precision is:

\(Precision = \frac{1}{Variance} = \frac{1}{\sigma^2} \)

For example, a person’s mass is 65.2 kg. If the person is repeatedly measured on the electronic scales and the mean mass is 60.00001 kg (standard deviation 0.0000001 kg). The measurement is very precise, but not very accurate.

2: Precision (machine learning)

In machine learning and artificial intelligence, papers often refer to precision and recall. In this context, precision refers to the positive predictive value , whilst recall is the same as the sensitivity(see above).

Validation

Validation is confirmation (by evidence) that the measure can be used consistently for its intended use.
In general:

Athroscopy
Positive
Arthroscopy
Negative
MRI
Positive
aba + b
MRI
Negative
cdc + d
a + cb + da + b + c + d

True Positive: a

False Positive: b

False Negative: c

True Negative: d

\( Positive Predictive Value = PPV = \frac{a}{a+b} \)
\( Negative Predictive Value = PPV = \frac{d}{c+d} \)
\( Sensitivity = Sens = \frac{a}{a+c} \)
\( Specificity = Spec = \frac{d}{b+d} \)
\( Accuracy= Acc = \frac{a + d}{a+b+c+d} \)

Or in R:

library(epiR)
mat<-matrix(c(a,b,c,d),byrow=TRUE,ncol=2) {enter values by row}
epi.tests(mat)
          Disease +    Disease -      Total
Test +           a            b         
Test -           c            d         
Total

Point estimates and 95 % CIs:
---------------------------------------------------------
Apparent prevalence        
True prevalence                   
Sensitivity                         
Specificity                 
Positive predictive value   
Negative predictive value        
Positive likelihood ratio     
Negative likelihood ratio    
---------------------------------------------------------

In the table, the gold standard (‘truth’) is in the columns and the test we validate against this standard is in rows. It is important to realise that the formulas will be different if we change the columns and rows. It is therefore not advisable to learn the formulas of by heart. It is better to approach it systematically. It should also be clear from the previous that any of the five performance measures discussed on their own are of limited value. The accuracy is often selected as a collective measure. However, when combining ratios, it is better to calculate the harmonic mean (of F1 score). In general, it is better to look at the 2 × 2 table and review the numbers in context in what is required (sometimes a high sensitivity is required but at other times a high specificity). This is further illustrated in two examples.

Example 1:

Very sensitive test; fire alarm:

Truth
Positive
Truth
Negative
Test
Positive
13940
Test
Negative
06060
199100

True Positive: 1, False Positive: 39, False Negative: 0, True Negative: 60

ppv = 2.5%, npv = 100%, Sensitivity = 100%, Specificity ≈ 61%, Accuracy = 61%

Or in R:

library(epiR)
mat <- matrix(c(1,39,0,60), byrow=TRUE, ncol=2)
epi.tests(mat)
          Outcome +    Outcome -      Total
Test +            1           39         40
Test -            0           60         60
Total             1           99        100

Point estimates and 95% CIs:
--------------------------------------------------------------
Apparent prevalence *                  0.40 (0.30, 0.50)
True prevalence *                      0.01 (0.00, 0.05)
Sensitivity *                          1.00 (0.03, 1.00)
Specificity *                          0.61 (0.50, 0.70)
Positive predictive value *            0.03 (0.00, 0.13)
Negative predictive value *            1.00 (0.94, 1.00)
Positive likelihood ratio              2.54 (1.99, 3.24)
Negative likelihood ratio              0.00 (0.00, NaN)
False T+ proportion for true D- *      0.39 (0.30, 0.50)
False T- proportion for true D+ *      0.00 (0.00, 0.97)
False T+ proportion for T+ *           0.97 (0.87, 1.00)
False T- proportion for T- *           0.00 (0.00, 0.06)
Correctly classified proportion *      0.61 (0.51, 0.71)
--------------------------------------------------------------
* Exact CIs

Example 2:

Very specific test; being caught for speeding:

Truth
Positive
Truth
Negative
Test
Positive
101
Test
Negative
396099
4060100

True Positive: 1, False Positive: 0, False Negative: 39, True Negative: 60

ppv = 100%, npv ≈ 61%, Sensitivity = 2.5%, Specificity = 100%, Accuracy = 61%

Or in R:

library(epiR)
mat <- matrix(c(1,0,39,60), byrow=TRUE, ncol=2)
epi.tests(mat)
          Outcome +    Outcome -      Total
Test +            1            0          1
Test -           39           60         99
Total            40           60        100

Point estimates and 95% CIs:
--------------------------------------------------------------
Apparent prevalence *                  0.01 (0.00, 0.05)
True prevalence *                      0.40 (0.30, 0.50)
Sensitivity *                          0.03 (0.00, 0.13)
Specificity *                          1.00 (0.94, 1.00)
Positive predictive value *            1.00 (0.03, 1.00)
Negative predictive value *            0.61 (0.50, 0.70)
Positive likelihood ratio              Inf (NaN, Inf)
Negative likelihood ratio              0.97 (0.93, 1.02)
False T+ proportion for true D- *      0.00 (0.00, 0.06)
False T- proportion for true D+ *      0.97 (0.87, 1.00)
False T+ proportion for T+ *           0.00 (0.00, 0.97)
False T- proportion for T- *           0.39 (0.30, 0.50)
Correctly classified proportion *      0.61 (0.51, 0.71)
--------------------------------------------------------------
* Exact CIs

It is important to bear in mind that a test can be sensitive for one purpose, but not necessarily for another. For example, a bone scan is very sensitive in picking up abnormalities such as fractures and infections. However, it is not very helpful in picking up multiple myeloma. For that purpose, it would be better to use an MRI scan of the marrow areas. If a test is used for screening, it is very important to make sure it has a high sensitivity. It is obviously unsatisfactory to miss disease with a screening investigation. If this investigation is not very specific, subsequent investigations can be performed to increase diagnostic accuracy (eliminate the false positives).

All tests have their limitations, and the most appropriate investigation should be selected for what is being investigated. Sometimes, a combination of investigations is used. Usually, the simplest and most sensitive investigations are performed first, followed by the more specific investigations to increase diagnostic accuracy.

To create radar or spider-web plots of different tests, please refer to the radar plots page.