Statistics deals with uncertainty. When we make a statement; we also need to say how certain we are that this statement is correct. We would like this to be 100%, but realise that this is not possible.
The decision to reject a null hypothesis is made on the basis of a statistical test. What statistical test is being used, depends on the distribution of the outcome variable and can be parametric or non parametric. A parametric test (ie t-test) can only be used if the data are Normally distributed. If the data are not Normally distributed, parametric statistics can’t be used and we will have to use a non parametric test (ie Wilcoxon test).
A statistical test will calculate a test statistic. The test statistic is different for different tests. For example, a parametric t-test calculates the t-value and a non parametric Wilcoxon test calculates the W value. The p value is the probability the test statistic takes a more extreme value and is used as a threshold to make a decision whether the null hypothesis is true or not.
P Value = Probability the test statistic takes a more extreme value
It is generally accepted (in medical statistics) that something is ‘proven’, or statistically significant, if the probability the test statistic takes a more extreme value is less than 5%. On the basis of this, the null hypothesis is accepted or rejected.
In medical statistics we are usually satisfied something is statistically significant if p less than 5%. If p less than 5%, we feel this is unlikely to be due to chance and the null hypothesis is rejected in favour of the alternate hypothesis.
Statistically significant: p value < 5%.
Please bear in mind that the p value indicates how incompatible data are with a statistical model. P values do NOT measure the probability that the studied hypothesis is true and do NOT measure the size of an effect (the alternative hypothesis is NOT ‘more true’ if the p value is lower)1. Please see also in p values their use and abuse.
It is customary to round the p value to three decimal figures. However, when p is less than one in a thousand p < 0.001 is used.
That something is statistically significant doesn’t necessarily mean it is also clinically significant. It might well be that, although statistically there is a difference, it is of no clinical importance.
Also, if we were unable to demonstrate a statistically significant difference; this doesn’t mean there is no difference. It might well be that with more patients in our study, we can demonstrate a significant difference (underpowered study, type 2 error).
If p = 5%, there is a probability of 1 in 20 that we drew the wrong conclusion (incorrectly reject the null hypothesis, type 1 error). However, it is generally a reasonable trade off in clinical studies.
P value correction
In a large data set, there may be many variables (columns when the data are in tidy format) with an opportunity to do multiple tests until something turns up that is ‘significant’. However, when performing multiple tests, it is important to bear in mind that, when p = 5%, one in twenty tests will be significant by chance. To address this, p values should be corrected to the number of tests that have been performed. A number of methods have been described that include Bonferroni and Holme’s methods and are included in R.
For example, when comparing two groups of patients, multiple comparison tests were performed on many variables that included the patient’s height. The height variable had a p value of 0.001 and was considered ‘significant’. However, this should be evaluated in view of the number of test performed. To correct the p value using Bonferroni’s method in R:
p = 0.001 # set the p value
p.adjust(p, method='bonferroni', n=20)
[1] 0.02
p.adjust(p, method='bonferroni', n=50)
[1] 0.05
p.adjust(p, method='bonferroni', n=100)
[1] 0.1
n is the number of tests that have been performed
It can be seen that, if 50 or more tests were being performed, this p value can no longer be regarded as ‘significant’.
Fishing for p values
It is important to consider that data should be evaluated in clinical perspective and should make clinical sense. There is no place for ‘fishing for p values’ until one is ‘significant’. On its own a p value has little importance2. A p value is only part of the data evaluation and reporting should be with full transparency and further evidence to justify the conclusions.