Failure to Replicate

March 20, 2013

I should have put that in quotes because it actually appears in the title of this new paper published in Neuropsychopharmacology:

Hart AB, de Wit H, Palmer AA. Candidate gene studies of a promising intermediate phenotype: failure to replicate. Neuropsychopharmacology. 2013 Apr;38(5):802-16. doi: 10.1038/npp.2012.245. Epub 2012 Dec 3. [PubMed]

ResearchBlogging.orgfrom the Abstract alone you can get a sense

We previously conducted a series of 12 candidate gene analyses of acute subjective and physiological responses to amphetamine in 99-162 healthy human volunteers (ADORA2A, SLC6A3, BDNF, SLC6A4, CSNK1E, SLC6A2, DRD2, FAAH, COMT, OPRM1). Here, we report our attempt to replicate these findings in over 200 additional participants ascertained using identical methodology. We were unable to replicate any of our previous findings.

The team, with de Wit’s lab expert on the human phenotyping and drug-response side and Palmer’s lab expert on the genetics, has been after genetic differences that mediate differential response to amphetamine for some time. There’s a human end and a mouse end to the overall program which has been fairly prolific.

In terms of human results, they have previously reported effects as varied as:
-association of an adenosine receptor gene polymorphism with degree of anxiety in response to amphetamine
-association of a dopamine transporter gene promotor polymorphism with feeling the drug effect and diastolic blood pressure
-association of casein-kinase I epsilon gene polymophisms with feeling the drug effect
-association with fatty acid amide hydrolase (FAAH) with Arousal and Fatigue responses to amphetamine
-association of mu 1 opioid receptor gene polymorphisms with Amphetamine scale subjective report in response to amphetamine

There were a dozen in total and for the most part the replication attempt with a new group of subjects failed to confirm the prior observation. The Discussion is almost plaintive at the start:

This study is striking because we were attempting to replicate apparently robust findings related to well-studied candidate genes. We used a relatively large number of new participants for the replication, and their data were collected and analyzed using identical procedures. Thus, our study did not suffer from the heterogeneity in phenotyping procedures implicated in previous failures to replicate other candidate gene studies (Ho et al, 2010; Mathieson et al, 2012). The failure of our associations to replicate suggests that most or all of our original results were false positives.

The authors then go on to discuss a number of obvious issues that may have led to the prior “false positives”.

-variation in the ethnic makeup of various samples- one reanalysis using ancestry as covariate didn’t change their prior results.

-power in Genome-Wide association studies is low because effect sizes / contribution to variance by rare alleles is small. they point out that candidate gene studies continue to report large effect sizes that are probably very unlikely in the broad scheme of things…and therefore comparatively likely to be false positives.

-multiple comparisons. They point out that not even all of their prior papers applied multiple comparisons corrections against the inflation of alpha (the false positive rate, in essence) and certainly they did no such thing for the 12 findings that were reported in a number of independent publications. As they note, the adjusted p value for the “322 primary tests performed in this study” (i.e., the same number included in the several papers which they were trying to replicate) would be 0.00015.

-publication bias. This discussion covers the usual (ignoring all the negative outcomes) but the interesting thing is the confession on something many of us (yes me) do that isn’t really addressed in the formal correction procedures for multiple comparisons.

Similarly, we sometimes considered several alternative methods for calculating phenotypes (eg, peak change score summarization vs area under the curve, which tend to be highly but incompletely correlated). It seems very likely that the candidate gene literature frequently reflects this sort of publication bias, which represents a special case of uncorrected multiple testing.

This is a fascinating read. The authors make no bones about the fact that they’ve found that no less than 12 papers that they have published were the result of false positives. Not wrong…not fraudulent. Let us be clear. We must assume they were published with peer review, analysis techniques and samples sizes that were (and are?) standard for the field.

But they are not true.

The authors offer up solutions of larger sample sizes, better corrections for multiple comparisons and a need for replication. Of these, the last one seems the best and most likely solution. Like it or not, research funding is limited and there will always be a sliding scale. At first we have pilot experiments or even anecdotal observations to put us on the track. We do one study, limited by the available resources. Positive outcomes justify throwing more resources at the question. Interesting findings can stimulate other labs to join the party. Over time, the essential features of the original observation or finding are either confirmed or consigned to the bin of “likely false alarm”.

This is how science progresses. So while we can use experiences like this to define what is a target sample size and scope for a real experiment, I’m not sure that we can ever overcome the problems of publication bias and cherry picking results from amongst multiple analyses of a dataset. At first, anyway. The way to overcome it is for the lab or field to hold a result in mind as tentatively true and then proceed to replicate it in different ways.

__
UPDATE: I originally forgot to put in my standard disclaimer that I’m professionally acquainted with one or more of the authors of this work.

Hart, A., de Wit, H., & Palmer, A. (2012). Candidate Gene Studies of a Promising Intermediate Phenotype: Failure to Replicate Neuropsychopharmacology, 38 (5), 802-816 DOI: 10.1038/npp.2012.245

19 Responses to “Failure to Replicate”

  1. Kate Says:

    Except in climate science, where cherry picking and publication bias NEVER HAPPEN, EVER EVER.

    Like

  2. Justthinkin Says:

    “We must assume they were published with peer review, analysis techniques and samples sizes that were (and are?) standard for the field. ”

    NEVER,EVER,EVER, assume ANYTHING when it comes to “scientists” going for grant(taxpayers) monies! Without a doubt,and NO assumption,it can only be confirmation bias after that many false positives get published as fact. Conceive the theory,then jiggle data,or ignore (also known as cherry picking as Kate points out),to get the desired result.

    Like

  3. Dodgy Geezer Says:

    James Blish (Doctor Mirabilis), describing Roger Bacon, the Franciscan friar who invented the Scientific Method in the 1200s, and whose 800th anniversary this year is scandalously ignored:

    “…his experiments were tests of principles, and as such were almost maddeningly repetitious, as significant experiments remain to this day – a fact always glossed over by popularizations of scientific method, in which experiments , miraculously, always work the first time, and the importance of negative results is never even mentioned…”

    Like

  4. Mr. Gunn Says:

    Well, since the NIH “doesn’t fund replications” and the journals everyone wants to publish in don’t publish replications, it looks like we either have to expect that scientists will spend their otherwise obtained grant money to do this stuff that they can’t publish, we’ll have to come up with another idea. One thing that can help control for the multiple testing problems is to have work done by independent labs on a fee-for-service basis. This means people are getting paid cash money to do the work, not to get any specific result, and that’s why I joined with Elizabeth Iorns of Science Exchange to launch the Reproducibility Initiative: http://reproducibilityinitiative.org

    Not all studies will be appropriate, but getting a experiment replicated via the Initiative or the validation service (http://scienceexchange.com/validation) is like having a multi-center study on demand.

    Thanks for bringing up this issue, DM.

    Like

  5. drugmonkey Says:

    It is false to assert the NIH does not fund replication. That is a misunderstanding of the barriers to consistent replication. GlamourMagification of science is the real problem.

    Like

  6. Dave Says:

    So will they retract their previous papers?

    Like

  7. Dave Says:

    In diabetes research, it is very difficult to publish GWAS data in high IF journals without independent replication cohorts being included in the same manuscript. Typically at least one or two very large cohorts will be used for this purpose. This has led to fairly good reproducibility in recent years.

    And, Dave (nice name), why should they retract the papers? If we retracted papers that later turned out to be incorrect, we wouldn’t have much left.

    Like

  8. DrugMonkey Says:

    They should by no means retract the prior pubs. Being *wrong* is never grounds for retraction.

    Like

  9. Abraham Says:

    Thanks for the excellent summary from DrugMonkey.

    Several of the comments seem to misunderstand the situation and likely fundamental aspects of the scientific process. First of all, all 12 of the original articles said in one way or another that replication of the results was needed before the findings could be considered definitive. We wrote the papers that way both because we believed that was true at the time and indeed we were actively working to generate a replication sample, which took years. We did not publish any of the initial papers in high IF journals, as Dave correctly states, replication is a common and legitimate standard for those journals (many our our original paper came out before replication had become necessary for high IF jourals) . Both the initial study and replication were NIH funded, which is common.

    A key part of the scientific process that makes it different from say sports or religion is that when a scientist has new data showing that a prior hypothesis was incorrect, they change their opinion and publish the new conclusions. That is what was happening here. As DrugMonkey correctly states, retraction is reserved for cases where data were falsified, which is not our situation at all.

    Mr. Gunn has an interesting idea (fee-for-service) that is more similar to what is done to get FDA approval. In those cases (to the best of my limited understanding) the applicant has to state explicitly what analysis will be done before the data are collected. The idea is to limit multiple testing, which is one of the issues that we believe is at play in candidate gene studies. However, basic research has different goals and it would waste a huge amount of money if we only did one analysis with each dataset and the tossed it. Indeed, it also misunderstands the problem of controlling type 1 error: a lab with more funding be able to do more fee-for-service studies and thus would, on average, produce more false positive results. Are we trying to control false positives on a per-lab-per-year basis? a per publication basis? does each of us get a lifetime quota of false positives, after which we should retire? No simple answers here, maybe time to think about Bayesian statistics.

    What makes this paper somewhat more interesting (we hoped) is that it suggests that there are common elements of our experience that others performing similar candidate gene studies should consider — namely that apparently large effect sizes should be viewed with skepticism because we know from GWAS studies that are better powered and more objective that common alleles tend to produce very small effects. Personally I think imaging genetics studies suffer from the same problems but we didn’t have any imaging data so were not in a position to make specific comments about that field.

    Like

  10. drugmonkey Says:

    Dr. Palmer,

    I’m uncertain that a requirement for replication for a high profile journal solves the problem. It may improve the false positive rate somewhat, sure. but two false positives are just less likely, not impossible. Especially when the false positive rate is not really a false positive but more an effect of another or mediating variable that you simply haven’t found yet. Your re-analysis for nonCaucasian subjects sort of gets at the matter. If authors do replications that use the same subject pool restrictions, the thing you feared (but didn’t confirm) can be at play.

    I stand by my assertion that the *real* solution is the incremental, ever replicating conduct of science. Our slide into a journalistic, High Impact, alleged “full story” in one manuscript GlamourMagification of real science is the problem here. *Every* finding is tentative, pending additional replication and extension. That should be our mantra.

    Using your findings to tune up the process is a good thing, yes. but it is not a solution that makes each paper’s findings a done deal.

    Like

  11. Nick Says:

    Dear Drugmonkey,

    You make a reasonable point that it can be extremely difficult to entirely rule out the possibility of a false positive result, even when the result replicates. But cases where an association arises due to confounding/ a mediating factor rather than via a causal mechanism can still be of interest. For example, the search for biomarkers of disease does not necessarily seek to understand the underlying causes of the disease; it may suffice merely to identify factors commonly associated with the condition which might be useful in prediction/ diagnosis. It’s therefore important to distinguish between this kind of false positive, and a false positive that arises simply due to chance or multiple testing (which replication does largely help to guard against).

    Nick

    Like

  12. FabulousMan Says:

    “I’m not sure that we can ever overcome the problems of publication bias and cherry picking results from amongst multiple analyses of a dataset. ”

    um… well… For starters you could do what you were supposed to learn in intro statistics: State your hypothesis up front and then test exactly that hypothesis and nothing else. No post hoc ‘oooh look this is significant’, no re-doing experiments until you get significance.

    Gene associations and other massively-parallel searches are notorious for Type II errors. This is because the ‘experiment’ is a lot of tests without controls for multiple measures. Since random sets of numbers from the same population will be statistically significant — by definition — 5 percent of the time, you will get on average 5 ‘statistically significant’ effects every time you run 100 tests. Run 1000 tests (easily done with genomic stuff), and voila! You have 50 ‘hits’. Lets say you want to be really impressive and validate these. Still, purely by chance, 2-3 of these ‘hits’ will again be statistically significant. So now you have hits that are validated.

    And it’s all just chance bullshit. Tragically, most biomedical scientists have an extremely poor quantitative background, and are particularly deluded when it comes to fundamental statistical concepts, like the meaning of a P-value.

    What was that study where the statisticians went through Nature and found serious statistical errors in 80% of the papers? Honestly, I’m sort of surprised it was that low. It makes me think that maybe stats wasn’t used for 20% of the papers in the issues they read. When I read Neuron, I swear the rate is 95% wrong. I rarely see a paper that doesn’t have at least one obvious statistical boner.

    Trouble is, most reviewers don’t know what to do because statistical mistakes are the ‘norm’ in biomedical science. Worse, unless you willfully ignore the proper way to do experiments (e.g. no post hoc analysis, correction for multiple measures, etc) you can’t compete in today’s research environment. There is virtually no incentive in U.S. Biomedical science for actually getting things right.

    Like

  13. FabulousMan Says:

    Abraham: I think your comments are totally on the right track. The answer to these problems is correction for multiple measures. Unfortunately, few studies use it. Of course there is still possibility of type II errors, but that possibility is reduced.

    Nick: Post hoc analysis, which you seem to be willing to accept, is not statistically acceptable. If you know that something is already statistically significant, is the P value less than 0.05? Not, it’s 1.0 Why? Because you already know that it happened. The odds are therefore 100%. To truly judge whether or not something is statistically significant at the 5% level, you need to do the experiment again. And then accept the result. You can’t say ‘well it worked the first time, but not the second. maybe I should combine the data or do it again…’

    DM: I am horrified at your lack of basic statistical understanding. I recommend starting with ‘Intuitive Biostatistics’. Seriously. You’ll like it.

    Like

  14. FabulousMan Says:

    Yea… we statistical nazis are out here. Not many of us, but we’re out here. Coincidentally, the last paper I reviewed, I recommended rejection precisely because the paper’s conclusions rested on multiple measures that were corrected neither experimentally nor statistically.

    The editor sent me a nice email a couple weeks thanking me specially for my super helpful input.

    Like

  15. drugmonkey Says:

    Stats dweebs crack me up.

    Like

  16. FabulousMan Says:

    That comment just makes you sound willfully ignorant, DM.

    Like

  17. Abraham Says:

    I’m sorry to see the tone of the comments run so negative. DM wrote a very nice and I thought clear summary of this paper. I commended him for that and was pleased that he took an interest in the paper to begin with.

    The paper touched on some hot button issues like replication and statistical significance. We were aware of these issue even when writing the initial papers (which all stated the need for future replication). We used replication as a reasonably stringent way of determining if any of our initial observations held water. Regardless of exactly how you think about statistical issues we expect most false positives to fall apart under scrutiny of replication, and they did. We then discussed how this was not just a sad story about out own publication (naturally we went through a period of being disappointed before writing the paper) but rather there are a lot of themes that are common to other similar candidate gene studies. So it was a ‘mea culpa’ but also ‘Caveat emptor’ or ’emperor has no cloths on’ paper. I will not work on any more candidate gene studies. The hope in presenting the results this way was to convince others not go down this road, or at least not with their eyes closed.

    We had no intention of addressing anything about high profile science, which appears to be an ongoing theme of DM and this blog. I agree with DM and others posting here that there are many influences / temptations that scientists should resist. There are doubtless many examples of people failing to resist such temptations, but I like to think that most of us are doing pretty well most of the time. I think our admitting that we were wrong is an small example of that. As I often say to students, science is corrupted because people are involved, and people corrupt everything they encounter, at least to some extent; sort of another take on the notion of original sin.

    Like

  18. drugmonkey Says:

    I’m sorry to see the tone of the comments run so negative.

    One thing not visible to you is that this post caught an unusual amount of attention from somewhere or two somewheres (one twitter link, one blog but might have been related) that lean(s) political instead of science. Those first two comments seem to derive from that angle. It was the “nyah, nyah, science is all bullshit” type of meme they were riffing on…seem to be climate science denialists to my eye. They were drop-byes but I wouldn’t take too much concern with those people….they won’t be back.

    Beyond that….well, you are new around here, aren’t you 🙂

    We had no intention of addressing anything about high profile science, which appears to be an ongoing theme of DM and this blog.

    I confess to a certain….worldview.

    So it was a ‘mea culpa’ but also ‘Caveat emptor’ or ’emperor has no cloths on’ paper.

    I see it more of a “this is the way science actually works” paper and a rare one at that. My congratulations on your fortitude and honesty at publishing this…we all know how easy it would have been for authors to simply move on and tell very few people about the lack of replication. We all can grasp the potential cost here for your careers.

    I hope the outcome of this is that your paper is brandished like a cudgel by all and sundry to beat back the forces of evil that wish to question why replications are necessary, to explain why we should give decent grant scores to applications that want to drill down on a poorly-replicated finding and to justify publishing such replications in journals of decent reputation.

    Like

  19. Nick Says:

    FabulousMan,

    You seem to have gotten completely the wrong end of the stick here. The reported study wasn’t just a set of post-hoc tests; it was a validation/ replication study designed to assess previous discoveries using a completely new set of participants.

    While I haven’t specifically delved into the full details of the paper(s), I would be very surprised if no corrections for multiple testing were applied at all (everyone knows about the dangers of multiple testing in genomics). For example, in genome-wide association studies, the accepted “level of significance” is generally 5 * 10^-8, which allows for testing 1 million genetic variants (presumably far few variants were tested in these candidate gene studies, so a less stringent correction was probably justified).

    For someone so critical of the level of “basic statistical understanding”, you really ought to get your facts right too: you talk of type II errors being rife in biomedical research, but what you actually refer to are type “I” errors, i.e. false-positive findings.


    Back to the original issue: the lack of replication is, unfortunately, not unsurprising given the history of candidate gene and, more specifically, pharmacogenetic studies over the last decade or so. So few findings have actually replicated from these studies that many presumably spurious results now fill the literature in these fields.

    While several potential explanations have already been raised for these problems, other factors might also include:

    – “The Winner’s Curse”, where estimates of effects in follow-up studies tend to be markedly smaller than in the original discovery phase. This can have disastrous consequences for the statistical power in the replication phase.

    – Batch/chip / plate effects can sometimes bias/ confound a study completely beyond repair. These are becoming increasingly common in the new era of ‘omics research, but are often overlooked at both design and analysis stages of a study.

    – Poor data quality. Genomics requires rigorous and stringent data QC processes, but practices remain unregulated and will inevitably differ between institutions. Although analysts with experience of these issues should be able to (largely) guard against many potential problems, there is no guarantee that an analyst with the correct experience will always carry out the work!

    Cheers,

    Nick

    Like


Leave a comment