Background: In the pyramid of evidence based medicine randomized controlled trials (RCTs) are considered to be one of the most reliable study designs when evaluating the cause and effect of treatment modalities. When evaluating randomized controlled trials, we often look for statistical significance of a study to determine if a treatment has an effect. Statistical significance means that the result of a study is unlikely to occur by chance alone. The value assigned to a statistically significant result is typically a p-value less than 0.05.
What are the Major Issues with Using a p-value to Determine Statistical Significance?
- A p-value does not take the size of the trial or the number of events in the trial into account (i.e. p value of <0.05 does not indicate the same likelihood of being real)
- A p-value of 0.051 is considered not statistically significant while a p value of 0.049 or less is considered statistically significant, even though there is a small absolute difference between these values
- Statistically significant isn’t the same as clinically significant. If a large (n = 50,000) patient study is performed on BP lowering, you may find that drug X lowers BP on average by 5 mm Hg in comparison to drug Y with a p < 0.05, but that difference is unlikely to be clinically significant
In the paper by Walsh et al [1] they give a great example of a RCT that has 200 patients. 100 patients receive treatment A and 100 patients receive placebo. Let’s say that fewer people with treatment A have a bad outcome vs treatment B, 1 vs 9 respectively. The p-value for this would be 0.02 and would be considered statistically significant. If only one more patient in treatment group A, had a bad outcome this would change the p-value to 0.06, which would no longer be statistically significant. Obviously in larger RCTs, a change in outcome of one event would have a smaller influence and not have a significant impact on the p-value.
So knowing the minimum number of additional events that would need to occur to change a statistical significant result to non-statistically significant could be used to determine the fragility of a study, which from this point forward will be called the Fragility Index.
How do you Calculate a Fragility Index?
- Place the results of the trial into a two-by-two contingency table
- Now add an event from the group with the smaller number of events (and subtracting a nonevent from the same group to keep the total number of patients constant)
- Recalculate the two-side P-value for Fisher’s exact test
- Keep adding events until the first the p-value becomes equal to or greater than 0.05
- The number of additional events required to obtain a p-value of ≥ 0.05 is the Fragility Index
- A higher value indicates a less fragile result, while a lower value indicates less statistically robust result
What is the Evidence to Use a Fragility Index for RCTs?
Fragility Index of RCTs Published in Major Journals [1]
What They Did:
- Identified RCTs with statistically significant results for at least one dichotomous outcome in high impact journals
- Included Journals: NEJM, The Lancet, JAMA, BMJ, and Annals of Internal Medicine
- Only selected trials that met 3 criteria:
- Two parallel arm or two-by-two factorial design RCTs involving humans (i.e. cluster RCTs, crossover RCTs, and >2 parallel arm designs were excluded)
- Allocated participants in a 1 to 1 ratio of treatment and control
- The abstract, reported at least on dichotomous or time-to-event outcome as significant (p <0.05 or a 95% CI that excluded the null value) under a null hypothesis that no difference existed
- Calculated the Fragility Index for each trial
Results:
- 399 RCTs Eligible
- Median Sample Size: 682 patients (Range: 15 – 112,604)
- Median Events: 112.9 (Range: 8 – 5,142)
Limitations:
- This was a convenience sample of trials published between 2004 – 2010 which will miss a lot of RCTs
- Fragility Index only applies to 1 to 1 randomization and only to binary data
- Fragility Index in time-to-event analyses may not always be appropriate (sensitivity is better for number of events in each group rather than timing of events); If the number of events in each group are similar, but the timing of the event has a clear difference, using a Fragility Index may falsely result in finding a trial inappropriately fragile
- Studies included were not all multicenter studies
Discussion:
- Studies with larger numbers of events and larger sample sizes were associated with less fragile results, however studies with inadequate or unclear concealment and large numbers of patients lost to follow up were associated with more fragile results
- The Fragility Index was ≤3 in ¼ of dichotomous outcomes from RCTs reported in high-impact journals
Author Conclusion: “The statistically significant results of many RCTs hinge on small numbers of events. The Fragility Index complements the P-value and helps identify less robust results.”
Fragility Index of RCTs Published in Critical Care Trials [2]
What They Did:
- PubMed/MEDLINE Search to identify all multicenter RCTs in critical care medicine
- Calculate the Fragility Index of trials reporting a statistically significant effect on mortality
Results:
- 56 trials included
- Median trial size was 126.5 patients
- Median Fragility Index: 2 (Range 0 – 48)
Strengths:
- Systematic search approach, ensuring that all multicenter RCTs of nonsurgical interventions reporting landmark mortality in critically ill patients were identified for inclusion
Limitations:
- Only included multicenter RCTs reporting a statistically significant effect of trial intervention on mortality, but it is important to note that there are other important measures of an interventions success
- Single center RCTs were excluded from this trial
Discussion:
- There was a moderate positive correlation between Fragility index and trial size
- Trials with ≤126.5 had a lower median fragility index than trials with >126.5 patients
- Trials reporting mortality as a primary outcome were significantly larger than those reporting it as a secondary outcome
- Only 1/3 of critical care RCTs comparing mortality between treatment arms had a sample size, that are sufficient to detect a 10% absolute reduction in mortality [3]
- There was a weak positive correlation between the Fragility Index and the number of study centers
Author Conclusion: “In critical care trials reporting statistically significant effects on mortality, the findings often depend on a small number of events. Critical care clinicians should be wary of basing decisions on trials with a low Fragility Index. We advocate the reporting of Fragility Index for future trials in critical care to aid interpretation and decision making by clinicians.”
Clinical Take Home Point: The Fragility Index is a useful metric in interpreting the results of randomized controlled trials, randomized in a 1:1 ratio, with dichotomous outcomes. Use of the Fragility Index can aid physicians in identifying trials that may be at risk of being overturned by future studies and not overestimating the significance of RCTs results.
References:
- Walsh M et al. The Statistical Significance of Randomized Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index. J Clin Epidemiol 2014; 67(6): 622 – 8. PMID: 24508144
- Ridgeon EE et al. The Fragility Index in Multicenter Randomized Controlled Critical Care Trials* Crit Care Med 2016; 44(7): 1278 – 84. PMID: 26963326
- Harhay MO et al. Outcmes and Statistical Power in Adult Critical Care Randomized Trials. AM J Respir Crit Care Med 2014; 189: 1469 – 78. PMID: 24786714
For More Thoughts on This Topic Checkout:
- Birinder Giddey at INTENSIVE: Fragility Index (Walsh et al, 2014)
- Chris Nickson at Life in the Fast Lane: Fragility Index
- Josh Farkas at PulmCrit/EMCrit: What is the Fragility Index of the NINDS Trial?
Post Peer Reviewed By: Anand Swaminathan (Twitter: @EMSwami)