The long-awaited results of another trial in critical care were published in a recent NEJM: (http://content.nejm.org/cgi/content/abstract/358/2/111). Similar to the VASST trial, the CORTICUS trial was "negative" and low dose hydrocortisone was not demonstrated to be of benefit in septic shock. However, unlike VASST, in this case the results are in conflict with an earlier trial (Annane et al, JAMA, 2002) that generated much fanfare and which, like the Van den Berghe trial of the Leuven Insulin Protocol, led to widespread [and premature?] adoption of a new therapy. The CORTICUS trial, like VASST, raises some interesting questions about the design and interpretation of trials in which short-term mortality is the primary endpoint.
Jean Louis Vincent presented data at this year's SCCM conference with which he estimated that only about 10% of trials in critical care are "positive" in the traditional sense. (I was not present, so this is basically hearsay to me - if anyone has a reference, please e-mail me or post it as a comment.) Nonetheless, this estimate rings true. Few are the trials that show a statistically significant benefit in the primary outcome, fewer still are trials that confirm the results of those trials. This begs the question: are critical care trials chronically, consistently, and woefully underpowered? And if so, why? I will offer some speculative answers to these and other questions below.
The CORTICUS trial, like VASST, was powered to detect a 10% absolute reduction in mortality. Is this reasonable? At all? What is the precedent for a 10% ARR in mortality in a critical care trial? There are few, if any. No large, well-conducted trials in critical care that I am aware of have ever demonstrated (least of all consistently) a 10% or greater reduction in mortality of any therapy, at least not as a PRIMARY PROSPECTIVE OUTCOME. Low tidal volume ventilation? 9% ARR. Drotrecogin-alfa? 7% ARR in all-comers. So I therefore argue that all trials powered to detect an ARR in mortality of greater than 7-9% are ridiculously optimistic, and that the trials that spring from this unfortunate optimism are woefully underpowered. It is no wonder that, as JLV purportedly demonstrated, so few trials in critical care are "positive". The prior probability is is exceedingly low that ANY therapy will deliver a 10% mortality reduction. The designers of these trials are, by force of pragmatic constraints, rolling the proverbial trial dice and hoping for a lucky throw.
Then there is the issue of regression to the mean. Suppose that the alternative hypothesis (Ha) is indeed correct in the generic sense that hydrocortisone does beneficially influence mortality in septic shock. Suppose further that we interpret Annane's 2002 data as consistent with Ha. In that study, a subgroup of patients (non-responders) demonstrated a 10% ARR in mortality. We should be excused for getting excited about this result, because after all, we all want the best for our patients and eagerly await the next breaktrough, and the higher the ARR, the greater the clinical relevance, whatever the level of statistical significance. But shouldn't we regard that estimate with skepticism since no therapy in critical care has ever shown such a large reduction in mortality as a primary outcome? Since no such result has ever been consistently repeated? Even if we believe in Ha, shouldn't we also believe that the 10% Annane estimate will regress to the mean on repeated trials?
It may be true that therapies with robust data behind them become standard practice, equipoise dissapates, and the trials of the best therapies are not repeated - so they don't have a chance to be confirmed. But the knife cuts both ways - if you're repeating a trial, it stands to reason that the data in support of the therapy are not that robust and you should become more circumspect in your estimates of effect size - taking prior probability and regression to the mean into account.
Perhaps we need to rethink how we're powering these trials. And funding agencies need to rethink the budgets they will allow for them. It makes little sense to spend so much time, money, and effort on underpowered trials, and to establish the track record that we have established where the majority of our trials are "failures" in the traditional sence and which all include a sentence in the discussion section about how the current results should influence the design of subsequent trials. Wouldn't it make more sense to conduct one trial that is so robust that nobody would dare repeat it in the future? One that would provide a definitive answer to the quesiton that is posed? Is there something to be learned from the long arc of the steroid pendulum that has been swinging with frustrating periodicity for many a decade now?
This is not to denigrate in any way the quality of the trials that I have referred to. The Canadian group in particular as well as other groups (ARDSnet) are to be commended for producing work of the highest quality which is of great value to patients, medicine, and science. But in keeping with the advancement of knowledge, I propose that we take home another message from these trials - we may be chronically underpowering them.
This is discussion forum for physicians, researchers, and other healthcare professionals interested in the epistemology of medical knowledge, the limitations of the evidence, how clinical trials evidence is generated, disseminated, and incorporated into clinical practice, how the evidence should optimally be incorporated into practice, and what the value of the evidence is to science, individual patients, and society.
Showing posts sorted by relevance for query regression to the mean. Sort by date Show all posts
Showing posts sorted by relevance for query regression to the mean. Sort by date Show all posts
Monday, March 10, 2008
Friday, May 31, 2013
Over Easy? Trials of Prone Positioning in ARDS

First, a general principle:
regression to the mean. Few, if
any, therapies in critical care (or in medicine in general) confer a mortality
benefit this large. I refer the reader (again) to our study of delta inflation which tabulated over 30 critical care trials in the top 5
medical journals over 10 years and showed that few critical care trials show mortality
deltas (absolute mortality differences) greater than 10%. Almost
all those that do are later refuted.
Indeed it was our conclusion that searching for deltas greater than or
equal to 10% is akin to a fool's errand, so unlikely is the probability of
finding such a difference. Jimmy T.
Sylvester, my attending at JHH in late 2001 had already recognized this. When the now infamous sentinel trail of intensive insulin therapy (IIT) was published, we discussed it at our ICU
pre-rounds lecture and he said something like "Either these data are
faked, or this is revolutionary." We
now know that there was no revolution (although many ICUs continue to practice as if there had been one). He
could have just as easily said that this is an anomaly that will regress to the
mean, that there is inherent bias in this study, or that "trials stopped early for benefit...."
Friday, March 5, 2010
Levo your Dopa at the Door - how study design influences our interpretation of reality
Another excellent critical care article was published this week in NEJM, the SOAP II study: http://content.nejm.org/cgi/content/short/362/9/779 . In this RCT of norepinephrine (norepi, levophed, or "levo" for short) versus dopamine ("dopa" for short) for the treatment of shock, the authors tried to resolve the longstanding uncertainty and debate surrounding the treatment of patients in various shock states. Proponents of any agent in this debate have often hung their hats on extrapolations of physiological and pharmacological principles to intact humans, leading to colloquialisms such as "leave-em-dead" for levophed and "renal-dose dopamine". This blog has previously emphasized the frailty of pathophysiological reasoning, the same reasoning which has irresistibly drawn cardiologists and nephrologists to dopamine because of its presumed beneficial effects on cardiac and urine output, and, by association, outcomes.
Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".
We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!
So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.
Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.
The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.
But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.
Hopefully all docs with a horse in this race will take note of the outcome of this study. In its simplest and most straightforward and technically correct interpretation, levo was not superior to dopa in terms of an effect on mortality, but was indeed superior in terms of side effects, particularly cardiac arrhythmias (a secondary endpoint). The direction of the mortality trend was in favor of levo, consistent with observational data (the SOAP I study by many of the same authors) showing reduced mortality with levo compared with dopa in the treatment of shock. As followers of this blog also know, the interpretation of "negative" studies (that is, MOST studies in critical care medicine - more on that in a future post) can be more challenging than the interpretation of positive studies, because "absence of evidence is not evidence of absence".
We could go to the statistical analysis section, and I could harp on the choice of delta, the decision to base it on a relative risk reduction, the failure to predict a baseline mortality, etc. (I will note that at least the authors defended their delta based on prior data, something that is a rarity - again, a future post will focus on this.) But, let's just be practical and examine the 95% CI of the mortality difference (the primary endpoint) and try to determine whether it contains or excludes any clinically meaningful values that may allow us to compare these two treatments. First, we have to go to the raw data and find the 95% CI of the ARR, because the Odds Ratio can inflate small differences as you know. That is, if the baseline is 1%, then a statistically significant increase in odds of 1.4 is not meaningful because it represents only a 0.4% increase in the outcome - miniscule. With Stata, we find that the ARR is 4.0%, with a 95% CI of -0.76% (favors dopamine) to +8.8% (favors levo). Wowza! Suppose we say that a 3% difference in mortality in either direction is our threshold for CLINICAL significance. This 95% CI includes a whole swath of values between 3% and 8.8% that are of interest to us and they are all in favor of levo. (Recall that perhaps the most lauded trial in critical care medicine, the ARDSnet ARMA study, reduced mortality by just 9%.) On the other side of the spectrum, the range of values in favor of dopa is quite narrow indeed - from 0% to -0.76%, all well below our threshold for clinical significance (that is, the minimal clinically important difference or MCID) of 3%. So indeed, this study surely seems to suggest that if we ever choose between these two widely available and commonly used agents, the cake goes to levo, hands down. I hardly need a statistically significant result with a 95% CI like this one!
So, then, why was the study deemed "negative"? There are a few reasons. Firstly, the trial is probably guilty of "delta inflation" whereby investigators seek a pre-specified delta that is larger than is realistic. While they used, ostensibly, 7%, the value found in the observational SOAP I study, they did not account for regression to the mean, or allow any buffer for the finding of a smaller difference. However, one can hardly blame them. Had they looked instead for 6%, and had the 4% trend continued for additional enrollees, 300 additional patients in each group (or about 1150 in each arm) would have been required and the final P-value would have still fallen short at 0.06. Only if they had sought a 5% delta, which would have DOUBLED the sample size to 1600 per arm, would they have achieved a statistically significant result with 4% ARR, with P=0.024. Such is the magnitude of the necessary increase in sample size as you seek smaller and smaller deltas.
Which brings me to the second issue. If delta inflation leads to negative studies, and logistical and financial constraints prohibit the enrollment of massive numbers of patients, what is an investigator to do? Sadly, the poor investigator wishing to publish in the NEJM or indeed any peer reviewed journal is hamstrung by conventions that few these days even really understand anymore: namely, the mandatory use of 0.05 for alpha and "doubly significant" power calculations for hypothesis testing. I will not comment more on the latter other than to say that interested readers can google this and find some interesting, if arcane, material. As regards the former, a few comments.
The choice of 0.05 for the type 1 error rate, that is, the probability that we will reject the null hypothesis based on the data and falsely conclude that one therapy is superior to the other; and the choice of 10-20% for the type 2 error rate (power 80-90%), that is the probability that the alternative hypothesis is really true and we will reject it based on the data; derive from the traditional assumption, which is itself an omission bias, that it is better in the name of safety to keep new agents out of practice by having a more stringent requirement for accepting efficacy than the requirement for rejecting it. This asymmetry is the design of trials is of dubious rationality from the outset (because it is an omission bias), but it is especially nettlesome when the trial is comparing two agents already in widespread use. As opposed to the trial of a new drug compared to placebo, where we want to set the hurdle high for declaring efficacy, especially when the drug might have side effects - with levo versus dopa, the real risk is that we'll continue to consider them to be equivalent choices when there is strong reason to favor one over the other based either on previous or current data. This is NOT a trial of treatment versus no treatment of shock, this trial assumes that you're going to treat the shock with SOMETHING. In a trial such as this one, one could make a strong argument that a P-value of 0.10 should be the threshold for statistical significance. In my mind it should have been.
But as long as the perspicacious consumer of the literature and reader of this blog takes P-values with a grain of salt and pays careful attention to the confidence intervals and the MCID (whatever that may be for the individual), s/he will not be misled by the deeply entrenched convention of alpha at 0.05, power at 90%, and delta wildly inflated to keep the editors and funding agencies mollified.
Thursday, March 20, 2014
Sepsis Bungles: The Lessons of Early Goal Directed Therapy
On March 18th, the NEJM published early online three original trials of therapies for the critically ill that will serve as fodder for several posts. Here, I focus on the ProCESS trial of protocol guided therapy for early septic shock. This trial is in essence a multicenter version of the landmark 2001 trial of Early Goal Directed Therapy (EGDT) for severe sepsis by Rivers et al. That trial showed a stunning 16% absolute reduction in mortality in sepsis attributed to the use of a protocol based on physiological goals for hemodynamic management. That absolute reduction in mortality is perhaps the largest for any therapy in critical care medicine. If such a reduction were confirmed, it would make EGDT the single most important therapy in the field. If such reduction cannot be confirmed, there are several reasons why the Rivers results may have been misleading:
- As I have blogged in the case of intensive insulin therapy, Single center studies inflate treatment effects when compared to multicenter studies for reasons that are unclear, but which may be related to bias especially in unblinded studies. (The revelation that Rivers was an investor in one of the devices used in the trial raised additional concerns about bias.)
- Regression to the mean may lead to reduced effect sizes when trials are repeated, especially when the index trial has a very large effect size. In a similar vein, since large absolute mortality reductions are statistically unlikely in critical care medicine, Bayesian inference means that trials reporting large reductions are likely to represent type I statistical errors.
There were other concerns about the Rivers study and how it was later incorporated into practice, but I won't belabor them here. The ProCESS trial randomized about 1350 patients among three groups, one simulating the original Rivers protocol, one to a modified Rivers protocol, and one representing "standard care" that is, care directed by the treating physician without a protocol. The study had 80% power to demonstrate a mortality reduction of 6-7%. Before you read further, please wager, will the trial show any statistically significant differences in outcome that favor EGDT or protocolized care?
Saturday, April 27, 2013
Tell Them to Go Pound Salt: Ideology and the Campaign to Legislate Dietary Sodium Intake
In the March 28th, 2013 issue of the NEJM, a review of sorts
entitled "Salt in Health and Disease - A Delicate Balance" by Kotchen et al can be
found. My interest in this topic stems
from my interest in the question of association versus causation, my personal predilection for salt, my observation that I lose a good deal of sodium in outdoor activities
in the American Southwest, and my concern for bias in the generation of and especially
the implementation of evidence in medicine as public policy.
This is an important topic, especially because sweeping
policy changes regarding the sodium content of food are proposed, but it is a
nettlesome topic to study, rife with hobgoblins. First we need
a well-defined research question: does reduction
in dietary sodium intake: a.) reduce
blood pressure in hypertensive people? in
all people? b.) does this reduction in
hypertension lead to improved outcomes (hypertension is in some ways a
surrogate marker)? In a utopian world,
we would randomize thousands of participants to diets low in sodium and "normal"
in sodium, we would measure sodium intake carefully, and we would follow the
participants for changes in blood pressure and clinical outcomes for a
protracted period. But alas, this has
not been done, and it will not likely be done because of cost and logistics,
among other obstacles (including ideology).
Subscribe to:
Posts (Atom)