April 22, 2016


Pretty data


Prove the hypothesis

Representative image

Trend for significance

Different subcultures of science may use certain phrases that send people in other traditions into paroxysms of critique.

Mostly it is because such phrasing can sound like bad science. As if the person using it doesn’t understand how dangerous and horrible their thinking is. 

We’ve gone a few rounds over storyboarding and representative images in the past. 

Today’s topic is “n-up”, which is deployed, I surmise, after examining a few results, replicates or subjects that look promising for what the lab would prefer to be so. It raises my hackles. It smells to me like a recipe for confirmation bias and false alarming. To me.

Apparently this is normal phrasing for other people and merely indicates the pilot study is complete? 

How do you use the phrase?

33 Responses to “Shorthand”

  1. tom Says:

    for me, it is phrasing for, we have some prelim data, lets see if it stands up to the full sample size. I never realized it had any negative connotation.


  2. banditokat Says:

    I use it to mean ‘negroni me up’. Like, put a lot of liquor in me. I think it’s the commonly accepted usage.

    Alternatively, when I see students who ALWAYS have n=3 and SUCCESS, we spent a shit ton of time going over their data the day of the experiment and I look in very frequently because I’m sort of awesome but have never ever only had n=3 other than a recipe to do more n’s.

    I’m also not sure that confirmation bias plays as critical of a role in things like structural biology and chemistry where you don’t have as many moving parts that can be impacted by confirmation bias.

    Which is to say that behavioral folks are woo artists.


  3. odyssey Says:

    How do you use the phrase?

    I don’t.


  4. poke Says:

    “I’m also not sure that confirmation bias plays as critical of a role in things like structural biology and chemistry where you don’t have as many moving parts that can be impacted by confirmation bias.”



  5. namnezia Says:

    It’s like 7-up, but more flexible. It’s quite refreshing!


  6. Spike Lee Says:

    Never used it. And I’m unclear on the usage. Does it mean, “increasing the N to make an effect more convincing”? Or is it more like, “collecting more data to test whether an effect is reliable or not”?

    I’ve got no problem with the latter usage, but the former is plainly confirmation bias at work.


  7. tom Says:

    “Or is it more like, “collecting more data to test whether an effect is reliable or not”?”

    this is how I think of it.


  8. dr24 Says:

    I have never heard it until now. But I’m disturbed by how many scientists don’t understand sample size, under power, over power, and other really basic statistical concepts.

    The reason there’s a “replication crisis” is in no small part because when you do bad stats, you report false findings. And that’s *rampant* in biomed.


  9. dr24 Says:

    Hell, I’d bet 95% of biomed scientists can’t accurately explain what a 95% CI is.


  10. Philapodia Says:

    Me: “Did you do stats on your data?”

    Student: “N-up. Should I have?”


  11. drugmonkey Says:

    Or is it more like, “collecting more data to test whether an effect is reliable or not”?

    I’ve got no problem with the latter usage, but the former is plainly confirmation bias at work.

    Isn’t deciding to collect more data an act of confirmation bias?


  12. dr24 Says:

    “Isn’t deciding to collect more data an act of confirmation bias?” No. Don’t be ridiculous.

    Seeking additional data to support or negate a pilot finding is not the same as confirmation bias.


  13. tom Says:

    We breed up animals and do manipulations in the lab. for example, if we have 4 groups, we would need roughly 40 mice to power it appropriately (as we do know how to run power analyses). to breed up 40 mice simultaneously, would require more cage space then I can devote, as there are multiple projects. Because of this, we often have to do in 2 cohorts of 20 instead. so we will get half the data set..finding X, and then have to n it up by completing the experimental design.


  14. Spike Lee Says:

    Collecting more data doesn’t necessarily imply a confirmation bias. To me, it’s about all about your attitude towards that new data.

    If you’re willing to treat the new data with the appropriate detachment and skepticism, then you’re fine. If, on the other hand you, have the urge to massage the new data to line up with your original expectations, you are succumbing to bias.

    The caveat here is that it’s hard to be detached, and to avoid being self serving. But I’d say it’s not impossible.


  15. Grumble Says:

    I use “N-up” when ordering a multi-dimensional lemon-lime soda.

    “Isn’t deciding to collect more data an act of confirmation bias?”

    Huh? Let’s say you do an experiment with N=1 and get an interesting result. Pretty much all scientists would agree that a result from an N of 1 is meaningless. There is no bias involved at all – none whatsoever – in saying, “that looks interesting, but I’m not going to believe it until I try the same experiment 5 more times and see the same result.” And then not believing the result if your expectations weren’t met.


  16. drugmonkey Says:

    Some would suggest you never “do an experiment with N=1”. What you do is that you choose to commit to an experiment with the N you need for the purpose. If you need a pilot, fine. But then you aren’t going to “n up” that pilot into a real study. You are going to step back and commit to the full study with N=8, 12, 16 whatever you need.

    To me, “n-up” sounds like someone who just willy-nilly runs the experiment until it tells them what they want to hear, stopping at that point or at the point of over-whelming exhaustion if there is ultimately a negative result.

    Because of this, we often have to do in 2 cohorts of 20 instead. so we will get half the data set..finding X, and then have to n it up by completing the experimental design.

    yeah. but you can see the danger here, correct? bias for positive. bias for large enough effects that they appear with half the power. the lure of going with 1/2 of the intended sample because the stats look great with the first half…..

    I actually brush up against a related version of this now and again and I’m going to have to recommit myself to making sure we do not fall into the traps I envision.


  17. jmz4 Says:

    N-up is for when you have one result you suspect is an outlier, and you want to show it (out of an n of say 3-4). It’s getting a better sense of the distribution of the value you’re measuring and the experimental variation.
    When you’re sampling, this seems like the right and proper thing to do.
    However, a similar effect is the reason why you never show your boss a “positive” result with an n=1. Cause you’re going to have to do that experiment over and over until you definitively prove the initial result to be an outlier.


  18. qaz Says:

    The real question is whether @tom ever stops experiments after the first cohort if the data isn’t trending towards statistics. That would be a problem.

    PS. There are statistical ways to do this correctly, but it’s very complicated and I’m pretty sure most neuroscientists don’t know how to do it. Basically, you have to preallocate the significance portions of your distributions. If you do any stopping tests, you have to increase the significance thresholds for your later data.

    It is perfectly fine to do a pilot experiment with n=1. It is perfectly fine to use that one animal as the decision criteria to do an experiment or not. But then you need to start over with a new cohort of animals and NOT include the pilot animal in your statistics.


  19. neuromusic Says:

    In my little corner of science (and I presume many others), I need to assess the quality of the data before choosing whether to include said data into my actual analyses.

    I don’t mean “quality” w/r/t the main effect under inquiry, but rather, “do I have something under observation here to observe or not?”

    Specifically, I have electrophysiological recordings from which I need to identify and isolate neurons spiking. I need to judge the quality of these neurons based initially on my own observations and later on more detailed metrics. This is all done without any consideration of “the statistics”.

    I don’t think I’ve ever used the term “N-up” but I can see it being used in this way, especially when piloting a new method or prep or technology… a new electrode design, a new imaging tech, a new calcium indicator. Run the pilot. Is the technology working? Good. Time to N-up.

    If you are studying cheetah behavior, you better not accidentally include a water buffalo in your analysis. It’s totally OK to look through the eyepiece to make sure you’ve got a cheetah on the other side of your camera.


  20. tom Says:

    @drugmonkey. I see your point. I will be more mindful of this.


  21. drugmonkey Says:

    me too tom!


  22. bacillus Says:

    Regarding bad stats how come people who publish in CNS don’t have to correct their data for multiple comparisons? It seems like every other biology paper in this triumvirate runs multiple test groups against one negative control group and declares any results with a p value <0.05 as significant. It's often quite easy to generate the raw data from figures (e.g. survival curves) then re-run the stats, and more often than not find them wanting.


  23. drugmonkey Says:

    I don’t know how that could be since CNS publish only the best science after the most rigorous review process.


  24. jmz4 Says:

    You mean like 5 independent t-tests instead of an ANOVA?


  25. Dave Says:

    I use it to demonstrate to students that an interesting result with 2 mice in each group is meaningless statistically. I have them repeat the experiment in more mice, and typically the data even out and the differences are gone. Occasionally something will hold, but rarely. That’s science

    We see this a lot in mice when looking at their physiology, and especially when running things like luminex ELISAs on blood markers etc. Very noisy.


  26. Dave Says:

    You still here people say that multiple comparison correction is too ‘conservative’. That’s total bullshit if done correctly.

    I recently caught a colleague using one-tailed Ts in a pre and post treatment group, but two tailed Ts in the pre and post placebo group. I had to fight the senior author to get the data removed from the paper (it was not significant when analyzed correctly).

    I have no doubt this happens everywhere.


  27. djbarker Says:

    A few other points:

    1) Using the definition of confirmation biased being proposed here, secondary experiments meant to converge on a single explanation could also be considered as confirmation bias. Most of these experiments are distinctly designed to test an idea you believe to be true based on a prior experiment. Often, these are based on results that could not have been predicted to allocate animals ahead of time. So, are we replicating or acting on our bias? Where is the line?

    2) If we follow the ‘saint-like’ path, set our animals ahead of time and never “N up”, our results will stand and forever be replicated,right? Wrong. This is one of the major arguments against NHST. At high enough sample sizes, any small difference will be significant. So what’s worse, the person who chooses to N-up to reveal a large effect or the person who plans for small effects and has enough animals to detect almost anything?

    3) In an age where scientists are pushing the limits of technology, there are so many practical reasons why even a person who did everything by the book would need to “n-up”. Many experiments require a few ‘perfect’ animals where the virus expresses, the catheter works, the cannula placement is correct, learning occurs, the equipment is functioning… Does this filtering count towards your initial N? Where do you draw the line for when to stop filtering? Tell the data gods your plans…

    The only way to handle these things is to provide good training and rely on good, honest science. There are bad reason to add subjects, bad ways to analyze the data (despite perfect a priori planning), and bad ways to design follow up experiments. Accordingly, there are appropriate ways to “N-up”. The beauty of science is that it self-corrects (eventually).


  28. drugmonkey Says:

    “It costs too much…”. “It is too hard….”

    Valid excuses for piss poor science?

    Who determines the cost or effort threshold?


  29. drugmonkey Says:

    Also, djbarker, the confirmation bias lies in deciding not to complete negative-looking exp and to keep running other twists on the experiment until it is guaranteed to produce the result you want. With all the wrinkles ironed out.


  30. djbarker Says:

    The argument was not that cost or effort are the problem. I can see where I was unclear, though. There is a gap between the theory and practice. The textbook approach of power analysis and a priori planning does not account for modern attrition, nor teach a scientist how to. In many cases, the factors that cause attrition are extraneous variables that must be accounted for.

    If you plan perfectly, and have lots of attrition, you are left with a decision to add N or failing to reject a false null.
    If you plan for attrition, and have none, you run into overpower problems, especially with NHST.

    As you echoed in the second response, the decisions are what matter. Good training begets good decisions.


  31. drugmonkey Says:

    So does the influence of good community standards.

    …I wonder how we promulgate those?


  32. Draino Says:

    Whenever someone suggests we just increase our N to see if it becomes statistically significant, my response is to try a different kind of experiment to address the same question. Usually you can come up with something independent to support or disprove the idea. No need to “N-up” by repeating an experiment that was borderline informative to begin with.


  33. bacillus Says:

    @jmz4. Yes this is what I mean. It’s even worse when they compare not only all the test groups to the one control, but also every test group against each other, and still not correct for multiple comparisons. C/N/S have fancy policies on the use of stats, but you’ll see them broken in every single edition of each of them. Given the terrible grasp of stats most biomedical researchers seem to have, I’m happy to put this down to ignorance by editors, authors and reviewers alike. Personally, with small n type experiments I’d run a T-test , Anova or whatnot then a Cohens D. If both say significant I’ll go ahead with the big expt. Last big expt I ran used 50 two year old mice that I obtained at 8 weeks of age. Each of them was worth at least a grand, but I knew I needed a big n to begin work on the phenomenon of interest. Namely, why did 8/50 of these mice react badly to vaccination. Half the time its due to co-morbidities we don’t discover until necropsy at the end of the expt which leaves us with 4 “good” mice. OTOH, co-morbidities are the norm in human immunosenescence, but how the hell can you ensure that some of your non-reactors have the same co-morbidites as your reactors for matching purposes?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: