Time to N-up!

May 2, 2018

Chatter on the Twitts today brought my attention to a paper by Weber and colleagues that had a rather startlingly honest admission.

Weber F, Hoang Do JP, Chung S, Beier KT, Bikov M, Saffari Doost M, Dan Y.Regulation of REM and Non-REM Sleep by Periaqueductal GABAergic Neurons. Nat Commun. 2018 Jan 24;9(1):354. doi: 10.1038/s41467-017-02765-w.

If you page all the way down to the end of the Methods of this paper, you will find a statement on sample size determination. I took a brief stab at trying to find the author guidelines for Nature Communications because a standalone statement of how sample size was arrived upon is somewhat unusual to me. Not that I object, I just don’t find this to be common in the journal articles that I read. I was unable to locate it quickly so..moving along to the main point of the day. The statement reads partially:

Sample sizes

For optogenetic activation experiments, cell-type-specific ablation experiments, and in vivo recordings (optrode recordings and calcium imaging), we continuously increased the number of animals until statistical significance was reached to support our conclusions.

Wow. WOW!

This flies in the face of everything I have ever understood about proper research design. In the ResearchDesign 101 approach, you determine* your ideal sample size in advance. You collect your data in essentially one go and then you conduct your analysis. You then draw your conclusions about whether the collected data support, or fail to support, rejection of a null hypothesis. This can then allow you to infer things about the hypothesis that is under investigation.

In the real world, we modify this a bit. And what I am musing today is why some of the ways that we stray from ResearchDesign orthodoxy are okay and some are not.

We talk colloquially about finding support for (or against) the hypothesis under investigation. We then proceed to discuss the results in terms of whether they tend to support a given interpretation of the state of the world or a different interpretation. We draw our conclusions from the available evidence- from our study and from related prior work. We are not, I would argue, supposed to be setting out to find the data that “support our conclusions” as mentioned above. It’s a small thing and may simply reflect poor expression of the idea. Or it could be an accurate reflection that these authors really set out to do experiments until the right support for a priori conclusions has been obtained. This, you will recognize, is my central problem with people who say that they “storyboard” their papers. It sounds like a recipe for seeking support, rather than drawing conclusions. This way lies data fakery and fraud.

We also, importantly, make the best of partially successful experiments. We may conclude that there was such a technical flaw in the conduct of the experiment that it is not a good test of the null hypothesis. And essentially treat it in the Discussion section as inconclusive rather than a good test of the null hypothesis.

One of those technical flaws may be the failure to collect the ideal sample size, again as determined in advance*. So what do we do?

So one approach is simply to repeat the experiment correctly. To scrap all the prior data, put fixes in place to address the reasons for the technical failure, and run the experiment again. Even if the technical failure hit only a part of the experiment. If it affected only some of the “in vivo recordings”, for example. Orthodox design mavens may say it is only kosher to re run the whole shebang.

In the real world, we often have scenarios where we attempt to replace the flawed data and combine it with the good data to achieve our target sample size. This appears to be more or less the space in which this paper is operating.

“N-up”. Adding more replicates (cells, subjects, what have you) until you reach the desired target. Now, I would argue that re-running the experiment with the goal of reaching the target N that you determined in advance* is not that bad. It’s the target. It’s the goal of the experiment. Who cares if you messed up half of them every time you tried to run the experiment? Where “messed up” is some sort of defined technical failure rather than an outcome you don’t like, I rush to emphasize!

On the other hand, if you are spamming out low-replicate “experiments” until one of the scenarios “looks promising”, i.e. looks to support your desired conclusions, and selectively “n-up” that particular experiment, well this seems over the line to me. It is much more likely to result in false positives. Well, I suppose running all of these trial experiments at the full power is just as likely it is just that you are not able to do as many trial experiments at full power. So I would argue the sheer number of potential experiments is greater for the low-replicate, n-up-if-promising approach.

These authors appear to have done this strategy even one worse. Because their target is not just an a priori determined sample size to be achieved only when the pilot “looks promising”. In this case they take the additional step of only running replicates up to the point where they reach statistical significance. And this seems like an additional way to get an extra helping of false-positive results to me.

Anyway, you can google up information on false positive rates and p-hacking and all that to convince yourself of the math. I was more interested in trying to probe why I got such a visceral feeling that this was not okay. Even if I personally think it is okay to re-run an experiment and combine replicates (subjects in my case) to reach the a priori sample size if it blows up and you have technical failure on half of the data.

__
*I believe the proper manner for determining sample size is entirely apart from the error the authors have admitted to here. This isn’t about failing to complete a power analysis or the like.