Credit: Janet Wittes

After pouring millions of dollars into developing a drug, researchers want to learn not only whether a drug works but what the magnitude of its effect is, what clinical outcomes it affects and who should receive it. Thus they ask many questions of data collected in clinical studies. However, these interwoven questions can present a challenge when it is time for the statistical analysis of the trial's results. With the increasing complexities of clinical trials, scientists need to be more thoughtful when it comes to choosing how they crunch the numbers. Biomedical researchers do not always choose the statistical constructs that suit their studies best, and as a result the studies they publish might miss real effects from the interventions under investigation.

In designing a trial under a classical statistical framework, researchers typically preselect a type I error rate, an effect size, and statistical power. Conventionally set at 0.05 in medical statistics, the type I error rate is the probability of declaring two therapies different if they are exactly the same. When a study ends, scientists calculate the probability that two identical medical treatments would have yielded a difference at least as large as the one observed and hope this P value comes out to less than 0.05, indicating that the results are statistically significant.

Physicians, regulators and patients should worry about both type I errors—finding a benefit when there is none—and type II errors, where calculations fail to show a statistically significant benefit of an effective product. There's good reason for these concerns. Some scientists declare any comparison with a P value of less than 0.05 statistically significant. Those who object to post hoc data dredging might prespecify many hypotheses but still declare success for any of those hypotheses whose P value is less than 0.05. But the more shots at a target, the higher the chance of hitting it (or, in the context of experimentation, the more a researcher looks at the data, the more apparently statistically significant results he or she will find even if the experimental therapy yields results no different from the control). The difficulty of making reliable inferences in the face of many hypotheses is the problem of multiplicity.

To deal with multiplicity, statisticians have developed methods to limit the probability of making a type I error; these methods lower probability of finding benefit of a product that in reality acts no better than the control therapy. To use the commonly applied Bonferroni's correction, investigators who ask a number of prespecified questions declare any comparison significant if its P value is less than 0.05 divided by the number of questions. This strict method does indeed limit type I errors, but sometimes Bonferroni's corrections mask true effects because, as is common in medical trials, not all comparisons are independent of each other, not all have equal statistical power and not all are equally credible.

Yet Bonferroni's test treats all comparisons equally. For example, Rios et al., in a meta-analysis of omega-3 fatty acids, examined eight outcomes: all-cause mortality, cardiac death, sudden death and myocardial infarction, each tested once to compare relative risk and once to compare absolute risk (JAMA 308, 1024–1033, 2012). The paper has engendered considerable interest in the popular press. But, simultaneously, the findings have come under scrutiny stemming from the study's approach to multiplicity (M. Beck, Wall Street Journal D1, 2 October 2012). The authors declared comparisons statistically significant only for P values less than 0.05 / 8, or 0.0063. None of the eight comparisons passed that hurdle. The relative risk for cardiac death, however, was 0.91 with a P value of 0.01. Had the authors looked only at cardiac death, or had they tested only the relative risks, they would have found statistically significant evidence of benefit on cardiac death.

How should scientists deal with interwoven hypotheses within a clinical trial? For better or worse, a multiplicity of approaches to multiplicity is available. On one extreme are those who urge reporting observed uncorrected P values, thus raising the specter of finding many results that will later be shown to be incorrect. Bonferroni-ites sit on the other extreme; ignoring the structure of the hypotheses, they simply divide by the number of questions they ask. This method is subject to abuse too: to show a product does not work, list a host of hypotheses to make the criterion for benefit so stringent that the chance of finding anything statistically significant will be very low.

Indeed, sometimes Bonferroni's corrections are not applied when they should be. Consider, for example, a 12 October briefing document from the US Food and Drug Administration's Gastrointestinal Drugs Advisory Committee that detailed a three-arm trial comparing 0.10 milligrams per kilogram body weight per day and 0.05 milligrams per kilogram body weight per day of the peptide drug teduglutide (Gattex, made by New Jersey's NPS Pharmaceuticals) to placebo in adults with short bowel syndrome. The analysis specified comparing the low dose to placebo only if the P value comparing high dose to placebo was less than 0.05. The P value for the high dose was 0.17, allowing no further formal statistical testing. The observed P value for the low dose was 0.007. Under Bonferroni's correction, the low dose would be called statistically superior to placebo. Why decide to preclude testing of a low dose if a high dose is not statistically significant.

Saying that something doesn't work when it actually does is bad for both drug development and public health. The methods must account for biological relationships among outcomes (for example, cardiac death is a subset of death) and the relative importance of the questions asked. Biomedical researchers and doctors dealing with statisticians who propose a Draconian-seeming method for avoiding type I errors should not hesitate to push for a technique that better reflects their understanding of the underlying biology. And, statisticians should work together with clinicians and bench scientists to develop an approach to multiplicity that comports with important questions being asked.