British Journal of Pharmacology issues new experimental design standards

December 23, 2015

The BJP has decided to require that manuscripts submitted for publication adhere to certain experimental design standards. The formulation can be found in Curtis et al., 2015.

Curtis MJ, Bond RA, Spina D, Ahluwalia A, Alexander SP, Giembycz MA, Gilchrist A, Hoyer D, Insel PA, Izzo AA, Lawrence AJ, MacEwan DJ, Moon LD, Wonnacott S, Weston AH, McGrath JC. Experimental design and analysis and their reporting: new guidance for publication in BJP. Br J Pharmacol. 2015 Jul;172(14):3461-71. doi: 10.1111/bph.12856 [PubMed]

Some of this continues the “huh?” response of this behavioral pharmacologist who publishes in a fair number of similar journals. In other words, YHN is astonished this stuff is not just a default part of the editorial decision making at BJP in the first place. The items that jump out at me include the following (paraphrased):

2. You should shoot for a group size of N=5 or above and if you have fewer you need to do some explaining.
3. Groups less than 20 should be of equal size and if there is variation from equal sample sizes this needs to be explained. Particularly for exclusions or unintended loss of subjects.
4. Subjects should be randomized to groups and treatment order should be randomized.
6.-8. Normalization and transformation should be well justified and follow acceptable practices (e.g., you can’t compare a treatment group to the normalization control that now has no variance because of this process).
9. Don’t confuse analytical replicates with experimental replicates in conducting analysis.

Again, these are the “no duh!” issues in my world. Sticky peer review issues quite often revolve around people trying to get away with violating one or other of these things. At the very least reviewers want justification in the paper, which is a constant theme in these BJP principles.

The first item is a pain in the butt but not much more than make-work.

1. Experimental design should be subjected to ‘a priori power analysis’….latter requires an a priori sample size calculation that should be included in Methods and should include alpha, power and effect size.

Of course, the trouble with power analysis is that it depends intimately on the source of your estimates for effect size- generally pilot or prior experiments. But you can select basically whatever you want as your assumption of effect size to demonstrate a range of sample sizes as acceptable. Also, you can select whatever level of power you like, within reasonable bounds along the continuum from “Good” to “Overwhelming”. I don’t think there are very clear and consistent guidelines here.

The fifth one is also going to be tricky, in my view.

Assignment of subjects/preparations to groups, data recording and data analysis should be blinded to the operator and analyst unless a valid scientific justification is provided for not doing so. If it is impossible to blind the operator, for technical reasons, the data analysis can and should be blinded.

I just don’t see how this is practical with a limited number of people running experiments in a laboratory. There are places this is acutely important- such as when human judgement/scoring measures are the essential data. Sure. And we could all stand to do with a reminder to blind a little more and a little more completely. But this has disaster written all over it. Some peers doing essentially the same assay are going to disagree over what is necessary and “impossible” and what is valid scientific justification.

The next one is a big win for YHN. I endorse this. I find the practice of reporting any p value other than your lowest threshold to be intellectually dishonest*.


10. When comparing groups, a level of probability (P) deemed to constitute the threshold for statistical significance should be defined in Methods, and not varied later in Results (by presentation of multiple levels of significance). Thus, ordinarily P < 0.05 should be used throughout a paper to denote statistically significant differences between groups.

I’m going to be very interested to see how the community of BJP accepts* this.

Finally, a curiosity.

11. After analysis of variance post hoc tests may be run only if F achieves the necessary level of statistical significance (i.e. P < 0.05) and there is no significant variance in homogeneity.

People run post-hocs after a failure to find a significant main effect on the ANOVA? Seriously? Or are we talking about whether one should run all possible comparison post-hocs in the absence of an interaction? (seriously, when is the last time you saw a marginal-mean post-hoc used?) And isn’t this just going to herald the return of the pre-planned comparison strategy**?

Anyway I guess I’m saying Kudos to BJP for putting down their marker on these design and reporting issues. Sure I thought many of these were already the necessary standards. But clearly there are a lot of people skirting around many of these in publications, specifically in BJP***. This new requirement will stiffen the spine of reviewers and editors alike.

__
*N.b. I gave up my personal jihad on this many years ago after getting exactly zero traction in my scientific community. I.e., I had constant fights with reviewers over why my p values were all “suspiciously” p<0.5 and no backup from editors when I tried to slip this concept into reviews.

**I think this is possibly a good thing.

***A little birdy who should know claimed that at least one AE resigned or was booted because they were not down with all of these new requirements.

39 Responses to “British Journal of Pharmacology issues new experimental design standards”

  1. Ola Says:

    #3 ain’t gonna be easy. Say you’re comparing 3 different treatments in WT and KO mice, 8 groups total (including untreated controls). Say the experiments are 1 animal and condition per day. What are the chances for a prospective N of say 8, that you have the exact right number of animals on the right days, all spread out over 64 consecutive experimental days (3 months in reality)?

    The more likely scenario is you’re gonna end up with 6 in some groups, 10 in others, and a whole mush of 7s and 9s too. Some days the KO colony won’t be breeding properly. Other days test drug #2 won’t dissolve so you do #3 instead. Then the student doing the work goes on medical leave and you want to get the paper out before year’s end to avoid being scooped, so you just have to write it up as-is, perfect equal Ns be damned!

    Oh, and the single post-doc doing this is gonna be blinded to all the groups, because of course they’re never the person doing the breeding and the genotyping of the mice, and the mice never have any visible phenotype that allows you to tell if they’re KO just by looking. We all have mega labs and can afford separate people to do each task in the data pipeline.

    This is just more fucking glam mag wannabes chasing big papers from the glam labs that can afford this BS, meanwhile excluding bread and butter scientists who cannot compete. This is no different than CNS demanding 20+ supplementary figures. Yes, better data integrity is a good thing to shoot for, but imposing rules like this ignores the grim reality of life at the bench.

    For human clinical stuff, it’s a different matter (NCT registry and all that fluff), but for a basic science journal this shows a shocking degree of naïveté.

    Like

  2. Draino Says:

    “10. When comparing groups, a level of probability (P) deemed to constitute the threshold for statistical significance should be defined in Methods, and not varied later in Results (by presentation of multiple levels of significance). Thus, ordinarily P < 0.05 should be used throughout a paper to denote statistically significant differences between groups."

    What is wrong with presentation of multiple levels of significance, when some comparisons are more significant than others? Too many asterisks is too much icing on the cake? I must be missing something here.

    Like

  3. drugmonkey Says:

    It falsely implies a result is less likely to have occurred by chance. Since you’ve already said your threshold is 0.05, anything else is intellectually dishonest.

    Like

  4. Draino Says:

    Is it intellectually dishonest only for certain kinds of studies? Because here I am editing some figure legends, comparing the weight of certain organs between mutant and control, and I’m saying * P<0.05 for some organs and *** P<0.001 for other organs that are really f**kd up. Do you disagree with that approach?

    Like

  5. The Other Dave Says:

    Are you kidding me? Those dumb rules belie a misunderstanding of statistics almost as bad as the screw-ups that forced the rules in the first place. If BJP really wants to fix the problem, it can tell the editors to stop biasing for papers with ‘significant’ results, and tell authors to report actual P values.

    P-value cutoffs (like 0.05) are artifacts from pre-computer days when the numbers had to be looked up in tables. Editors and reviewers need to make authors report the actual computed P-value. No more asterisks. No more ‘P0.05 therefore it’s not real’ bullshit. This trend is already picking up at other journals.

    And we also all need to remember that statistical significance does not necessarily equal biological significance. I have seen CNS papers touting ‘statistically significant’ differences that amount to a fraction of a single molecule!

    Like

  6. qaz Says:

    DM – that’s not correct. A p-value is a statement of the likelihood of a value as occurring by chance. A farther out event has a higher unlikelihood. The likelihood of seeing something with a p-value of 0.001 is ten times less than seeing something with a p-value of 0.01. Go back to the original math.

    One should report the ACTUAL p-value. (Go back and reread Fisher on p-values.) That way the reader can judge for itself whether xe wants to believe something at a 0.05 threshold or a 0.001 threshold or whatever.

    What a p-value reports is the likelihood that you would see value x given a specific distribution. That’s all. Anything else (including whether that distribution has anything to do with your data at all) is a different question.

    Like

  7. David Condon Says:

    I’m not so sure such rules are a great idea. I can readily think of perfectly valid studies that violate those rules. For instance a preliminary safety study on a new drug might have no need for a separate control group. I’d say maybe those are study guidelines for computing average effects, but some studies aren’t trying to compute the average effect of an experiment.

    Like

  8. drugmonkey Says:

    qaz- that is bullshit and you know it. We do *not* use p values as descriptive statistics. We use them as a decision criterion. A cutoff. A threshold.

    They *could* be used solely as descriptive stats, as you seem to be saying, but they are not. It is dishonest to conflate these uses in papers by reporting anything other than whether it passed the decision ctiterion or not.

    Like

  9. qaz Says:

    DM – the fact that p-values are misused does not change the fact that they *are* descriptive statistics. I use them as descriptive statistics in my papers, and I insist that they get used as descriptive statistics in the papers I review(*), and (as far as I can tell) many (unfortunately, not all) of my colleagues in my field do as well.

    * Among other things, this means showing the whole distribution and doing the correct statistics on that distribution. It also means knowing what p less than 0.0000001 really means. (That your value is far outside your distribution, not that your experiment was correct!)

    It is incorrect to use p-values as a decision criterion. (Yes, I know that many people do this. Nevertheless, it is wrong, and should be stopped.) It provides a false sense of being “done” and having “solved a puzzle”. The correct way to do science is to report the data, including the correct descriptive statistics.

    The misuse of p-values as decision criteria plays into the reproducability problem because we are using p less than x as a decision criterion. First, we should expect x percent of the times p less than x to be false positives. Second, because there are so many reasons an experiment can produce a negative result (like bad reagents), we don’t publish negative results, and we have a biased scientific literature. Instead, we should publish the complete descriptive statistics (including allowing things with p greater than x to be published as well), and we will find that the likelihood of reproducing a result is going to be related(**) to x.

    ** I’m sure it won’t be perfectly related. There are still lots of non-statistical ways for experiments to go wrong (bad reagents, bad design, bad practice, incorrect blinding, etc.) Nevertheless, at least we will have stopped one problem (the error of the decision criterion).

    Like

  10. The Other Dave Says:

    qaz: Those last couple posts are gems.

    DM: qaz is correct. His last post was a particularly nice summary of the situation. Everyone should understand and agree with every sentence that he wrote.

    Like

  11. The Other Dave Says:

    …but to avoid confusion, qaz didn’t literally mean” “That your value is far outside your distribution…” He meant ‘far into the teeny weeny littlest part of the tail of your distribution’. e.g. ‘definitely possible but very very very rare’.

    Like

  12. ecologist Says:

    Wow, the statistical naivete of the journal’s new rules is pretty astonishing. I will mention a couple of points. One is that p-values are simply a mess. The flavor of applied statistics that developed, especially in biology, in the 20th century was a horrible mashup of incompatible theories of Fisher and of Neyman-Pearson. There *is* a way to use statistics as decision criteria, but that is not what we do in publications; we use the results of experiments as scientific evidence about our hypotheses. There is a great discussion of this, with direct application to clinical research, to be found in Royall (1997, Statistical Evidence: A Likelihood Paradigm, Chapman and Hall Monographs in Applied Statistics and Probability). Roughly speaking, trying to “fix” p-values the way that BJP is trying to do is simply not appropriate.

    Second issue is that they don’t address any of the ways that have been developed that *do* address experiments as evidence. They don’t, for example, give any guidance on the use of information criteria (AIC and the like). Why not?

    Most glaring, though, is that the guidelines seem to imagine that the be-all and end-all of scientific studies is a t-test or, perhaps, an analysis of variance. That seems pretty outdated. In many fields, the actual goal is not a p-value, it is an estimate of something. I don’t know anything about pharmacology, but I expected that estimation of functions (e.g., dose-response, or survival, or rates of some sort) would be a major activity. So, I went to the BJP, and sure enough, in the few open-access papers in the most recent issue, the results were exactly that. So it would seem that the p-value-centric standards are missing a lot of what pharmacological researchers actually do.

    I’m (genuinely) curious — are statistical null hypothesis tests really so central to this field, even though some significant fraction of the publications in the journal do something else?

    Like

  13. qaz Says:

    @TOD: He meant ‘far into the teeny weeny littlest part of the tail of your distribution’. e.g. ‘definitely possible but very very very rare’.

    Yes, TOD, that’s better said. Thanks.

    @ecologist – It’s not that statistical null hypothesis tests are so central to this field, but that most of the field doesn’t know how to use anything else. About every 2-3 years, there are a host of editorial commentaries in journals like JNeurophys and eLife that beg authors not to use statistical null hypothesis tests, and to use confidence intervals or Bayesian statistics or effect sizes or “anything else just not p-values”. (We’re about due for the next round…)

    Also, I suspect that part of the problem is that there are no trivial-to-use statistical packages for using more complex mathematics. In my experience, those more complex mathematics have to be tailored to the question at hand (*), and require quite a bit of mathematical sophistication to get an answer. It’s a lot easier to plug the numbers into some off-the-shelf application. Moreover, the simplicity of these packages provide the illusion of doing statistics correctly.

    * Null hypothesis tests also have to be tailored to the question at hand, since they only tell you how far out into the teeny-weeny-weeds of the distribution you are. Not whether that distribution is meaningful.

    By the way, we should remember that the whole p less than 0.05 cutoff is 100 year old mathematical technology. A lot of other fields have progressed in those 100 years. Shouldn’t we be using more advanced methods?

    Like

  14. The Other Dave Says:

    This discussion reminds me of a PhD defense about a year ago, where I turned out to be the harsh committee member (I made the student re-do all the statistics in the entire thesis, because they were all wrong. It shouldn’t have been a surprise. I even asked him: “Don’t you remember at your prelims when I told you that your planned statistics were not appropriate?” He said yes, but did what he did anyway because “it’s what everyone does”).

    Anyway, point of the story is that a couple other committee members were as befuddled as the student. One of them asked “just tell him what the correct statistical test is, so he can use it.”

    Ugh…

    That’s when I went into an hour long (seriously!) explanation that there is no such thing as a ‘correct’ statistical test for any particular situation. There are just lots of different options, with different plusses and minuses. Just like anything else in science, you need to understand the tools you are using and use them properly for what they were designed to do. Honestly, half his conclusions had low P values but didn’t even make sense. His results were not even internally consistent…

    Like

  15. drugmonkey Says:

    So dishonest qaz. You admit that I am right about how p values are actually used in science, acknowledge that it is your personal view that the world of science “shouldn’t” behave as it does and so therefore it is a-okay to pull this decision-criterion jujitsu.

    Continuing to go on about the math of a p value is beside the point. The question at hand is how they are *used*. And on this you appear to be agreeing with me.

    Like

  16. drugmonkey Says:

    TOD- so not surprised you are That Committee Member.

    Like

  17. The Other Dave Says:

    DM: You know what you sounds like? You sound like the guy who holds on to the metal shaft of a screwdriver and uses the handle to pound screws into things. When told that you’re using the tool wrong, you say “It’s for putting in screws, right? Am I not doing that? Why are you quibbling?”

    Willful ignorance is not a virtue.

    I would enjoy serving on panel with qaz. People like him restore my faith.

    Like

  18. JustThisGuyYouKnow Says:

    Isn’t the point of a journal setting standards to improve the use of statistics in their field. If pharmacology uses p-values this wrongly, then shouldn’t the journal set a standard that get’s away from this misapplication rather than cementing it (and also hiding useful information — there is nothing false about the fact a lower p-value means a result is less likely to have arisen if the null hypothesis were true).

    Like

  19. jmz4 Says:

    “Also, I suspect that part of the problem is that there are no trivial-to-use statistical packages for using more complex mathematics.”
    -There are quite a lot of them online these days, if you look around you’ll find some pretty great stuff. Most of them even have tutorials and explanations of the math behind it. I generally settle for knowing why and how they differ from each other.

    And, of course, R has every test you could ever dream of, and the tutorials on how to run it. So lack of access isn’t a great excuse for improper statistics.

    I’d suspect that biology’s aversion to statistics stems from disinclination (most of us are bad at or ignorant of math) and a desire to publish results. As Qaz/TOD point out, there’s a huge difference between statistical significance and biological relevance, but somehow the cult of of the .05 p value has such disgusting sway over biologists’ minds. This notion that a .05 value is the be-all and end-all for a piece of data’s significance is lunacy, bu it’s a mindset that is abused by the unscrupulous and the dullards to advance their publication record.
    Personally, I generally just look at the error bars and effect sizes for most things to decide if I’m convinced. Big time n-omics data, though yeah, you needs statistics and multiple comparison correction and all that fancy stuff, or you can delude yourself very easily.

    As for fraud generally, making people post full. or at least expanded, image series would go a long way towards reducing it.

    @TOD
    Shame on you, making PhD students learn things.

    Like

  20. drugmonkey Says:

    I generally just look at the error bars and effect sizes for most things to decide if I’m convinced.

    Right, and we’re supposed to do anything but laugh at any other opinion on stats you care to offer?

    Like

  21. qaz Says:

    DM – How is it dishonest to try to get scientists to do science correctly? I would *hope* that explicit journal-publication rules about required analyses would be based on doing things properly, not doing things popularly.

    jmz4 – The people we are talking about do not understand enough programming to use R. They don’t even understand concepts of things like variables or functions. They are using programs like GraphPad Prism or SigmaStat or Excel. One lab I tried to collaborate with sent their data out to a statistician to have the statistician do an ANOVA on it. (They complained to me that they had wasted hundreds of dollars on a non-significant result, because I had casually said “you could just run a quick ANOVA to see what’s going on” thinking this was a five-minute process in R or Matlab.) Another lab was typing their data into Excel and sometimes wrote the words “one”, “two”, “three” rather than the numbers! These were both extremely well-published labs in journals like BrJPharm.

    PS. It would be an interesting study, and I’m not sure how I’d set it up, but I bet that jmz4’s inter-ocular trauma test (“It looks different.”) is more likely to produce replicable results than the statistical p less than 0.05 decision criterion currently used.

    Like

  22. drugmonkey Says:

    The error bar eyeball test fails with repeated measures designs and I have the dumbass reviewer comments to prove it.

    For the home viewer, get out a piece of graph paper and sketch out a weight loss trial in which the product causes a 5 pound loss in each subject. Make sure to sample everything in the meat of the (US) distribution- from a 100 pound woman to a 250 pound man in your dataset.

    Plot the means and SDs. Eyeball it.

    Like

  23. drugmonkey Says:

    Older post related to the eye ball technique of inferential stats:

    Do Appearances Indicate Anything About the Quality of the Data?

    Like

  24. drugmonkey Says:

    The repeated measures eye ball issue is a subplot of this post:

    IACUC 101: Satisfying the erroneous inference by eyeball technique

    Like

  25. dsks Says:

    The term “significance” refers to whether a difference meets the arbitrary threshold chosen by the investigator for rejection of the null hypothesis; it’s a straightforward either/or call; it did or it didn’t, there is no continuum. In this sense, BJP is correct that it is misleading to attempt to speak in terms of grades of “significance” (after all, if one accepts that P=0.01 is “more significant” than P=0.049, then by extension one must, for consistency, accept that P=0.051 is “almost significant”).

    This does not preclude stating P values, btw, and I’m not sure the BJP is ruling out this practice. They are simply asking investigators to draw their conclusions in the conventional manner based on a simply up/down decision on whether the null hypothesis has been rejected. If investigators want to include P values as an indicator of the level of the relative reliability of observed differences (or lack thereof), okay, but the current convention in the life sciences* (for good or ill) is that only the threshold matters for purposes of hypothesis testing.

    * whether this could be improved – perhaps even throwing out either/or hypothesis testing and talking about probability as a continuum – is not what this particular fight is about, imho. BJP is merely trying to improve adherence to the currently accepted convention. A full on overhaul of how we interpret data in the life sciences is going to require a lot more coordination than can be achieved by a single scientific society.

    Like

  26. qaz Says:

    Yeah, DM, there are lots of ways to do bad science. Bad experimental design is one of them. Using p less than 0.05 as an arbitrary cut-off is another.

    @dsks – You can define your words (“significant”) any way you please, but trying to argue that p=0.051 means you didn’t find an effect, but p=0.049 means you did shows a deep misunderstanding of what p-values mean. (Yes, I know that’s the way it is usually used. It doesn’t change the fact that it is wrong.) We need to remember that p-values are reporting a measurement of relationships between samples. We are trying to interpret that as having an implication for the rest of the world, which we hope our samples are somehow reflective of. This means that p-values are effectively *estimates* of validity of the rejection of the null hypothesis. We should not expect that the next experiment is going to find exactly the same p-value, anymore than we expect that the next experiment is going to find exactly the same values. What we really want to do is estimate the distributions of the things we’re measuring and talk about that. (But that’s the advanced modern statistics that are hard to implement and hard to report.)

    I will put my bets on well-designed experiments reporting the relationships between distributions over poorly-designed experiments that find arbitrary cutoffs of p=0.05.

    Like

  27. The Other Dave Says:

    DM: The whole point of figures is to create something that people CAN eyeball. Figures are devices for visual communication. If yours don’t communicate your point accurately and clearly, then you are doing figures wrong.

    I think the major problem is actually an *overreliance* on statistical testing. Too many people are relying on mindless (and tragically also often miscalculated) statistical spew and not enough on common sense. A P value <0.00000001 is less convincing to me then five methodologically different approaches that all showed the predicted effect. This is more convincing to me because I know that the result doesn't depend on a particular method. It suggests that it is repeatable. It suggests that there probably aren't twenty other unreported experiments that showed no effect (this is a HUGE problem, because you can get P<0.05 for anything if you only publish the 'one experiment that worked' out of twenty experiments, and I think many labs do pull this shit.) None of that confidence in the conclusions has anything to do with any P value. I actually love statistics, because I think it's a fascinating subject that gets to the heart of how we understand and precisely communicate reality. But it's not the goal. It's just one of many tools that can help us get to the goal.

    People need to learn to think again instead of relying on proxies, whether those proxies are impact factors, H-indices, or P values.

    So, yea, DM. Maybe I *am* THAT committee member sometimes. But I'd also rather a student not use any statistics at all rather than fuck them up. Don't show me ugly uninterpretable western blots and tell me that they support your conclusions. And don't show me fucked up statistical comparisons and tell me that those support your conclusions.

    Like

  28. jmz4 Says:

    “Plot the means and SDs. Eyeball it.”
    -I see your point, and I’ll admit those types of studies are not things I really come across often (I tried to read about mixed-effect models and gave up). But do people really present data this way? E.g. as just straight pounds lost, without any normalization for the underlying broad distribution of the starting population? E.g: Using the random number generator for excel to calculate placebo weight loss for the control group/arm (variance of +5 and -5), you end up with essentially zero average weight loss. So comparing that to the average 3.5% weight loss (as a percent of initial weight) for my hypothetical distribution (20 individuals ranging from 103 to 250 lbs), with a st. dev of .8%, this looks fine by eyeball. -3% (+/-.8) vs. .3%(+/-2.2%). Graphing it, it doesn’t really fail the eyeball test. Error bars are still outside of each other.

    And most people report SEM for their error bars anyway, even when it isn’t really appropriate, at least as far as my understanding of the distinction.

    “Right, and we’re supposed to do anything but laugh at any other opinion on stats you care to offer?”
    Ouch.
    I should have qualified that statement with “for papers in my field”, since I do go along with p-values and statistics for things I have no experience in, but…
    If these statistics were done in a more meaningful way, I would pay more attention to them (as I do for the good omics labs, which have statistical and computational backgrounds).
    As it is, the p<.05 threshold you endorse is usually meant to just proclaim data as important without any real attempts to justify their statistics or experimental design. An example is people who cite a 5% difference in lifespan as significant by Kaplain-Meier estimates (log-rank test) when they only measure in intervals of 10-15% (mostly in worms so 1 and 2-3 days, respectively). The experiment design doesn't really justify the statistics being used, but they don't care as long as they get a p<.05.

    We've all read hundreds or thousands of papers in our fields, papers which use many of the same analyses and presentations of data. You should have a good knowledge of how much variance assays have, what are large effect sizes, what are small ones. Relying on this experience is a much a better means of assessment than relying on a (with the current state of the art) completely arbitrary threshold for significance.

    All this is before you take into account that the fetishization of the .05 significance level leads to people hacking their data to fit under it (e.g. the Masicampo and Lalande effect).

    As to qaz's question. Can we think of an example where, in a well controlled experiment, things look significant by eye but are meaningless noise? I can only think of ontogeny analysis, which has its own more rigorous set of statistics anyway.

    Like

  29. drugmonkey Says:

    I didn’t “endorse” any threshold. Do try to address my points and not the thing you and qaz prefer it to be about.

    Like

  30. The Other Dave Says:

    As to qaz’s question. Can we think of an example where, in a well controlled experiment, things look significant by eye but are meaningless noise?

    This happens quite often; how often is what the P value is supposed to represent.

    Like

  31. MoBio Says:

    @TOD

    ” A P value <0.00000001 is less convincing to me then five methodologically different approaches that all showed the predicted effect. "

    Yes independent, convergent lines of evidence.

    In the absence of 'five approaches' I typically will have key experiments performed independently by collaborator(s) and by another postdoc in my lab. When small molecules are involved they are aliquoted as coded compounds. When behavior is involved we will always replicate with independent cohort.

    Like

  32. qaz Says:

    @TOD – There are lots of cases where experiments look significant by eye but are not significant. Particularly with low n. This is the key to statistics, they measure the unlikelihood of seeing some result. Correct statistics with effect sizes, p-values, and confidence intervals are all useful tools to help you understand if what appears different is really different.

    Like

  33. The Other Dave Says:

    “…help you understand if what appears different is really different”.

    …assuming certain things about the sampling and underlying distribution from which the data was sampled.

    Like

  34. drugmonkey Says:

    In the olden days converging evidence and replication with independent samples was what the community of scientists did to advance understanding. Is there any particular reason to insist that has to happen within one “lab”? Is this not a very big driver of mega-labification, the corresponding lack of opportunity and $$ for noobs and a slowing of advances?

    Like

  35. qaz Says:

    Converging evidence within a lab is definitely good. Converging evidence across labs is better.

    But I do agree that there has been a willingness to insist that papers include converging evidence. This is part of the issue where reviewers feel they are allowed to insist on additional experiments to check a point. I think that the insistence on converging evidence and replication within lab is closely tied to the insistence that papers be complete stories, that papers need to report discoveries that are perfectly replicable (rather than being experimental observations or theoretical syntheses). It is closely tied to the current complaint that “a large subset of experiments aren’t replicable” (rather than seeing the field as a whole working on discovering the underlying truth). I suspect it is also related to grants being harder to get, and thus that being wrong in one’s interpretation is believed to be the kiss of death.

    And, I suspect that it is part of why people like the idea of (mis)using p-values as decision-thresholds of whether to believe a result or not.

    I would say that converging evidence within a lab is good, converging evidence across labs is better, and that we should see science as much more of an integrated whole rather than a bunch of decision-problems.

    PS. DM – Aren’t the following statements endorsing a threshold? “I find the practice of reporting any p value other than your lowest threshold to be intellectually dishonest*.” “* I gave up my personal jihad on this many years ago ….” “We do *not* use p values as descriptive statistics. We use them as a decision criterion. A cutoff. A threshold.”

    Like

  36. drugmonkey Says:

    No, qaz, it is not endorsement of p<0.05 which was the charge. But since you ask, no it isn't endorsing a threshold at all. It is a description.

    The drawback to within-lab replication (especially for a single paper) is that once you have the main result, you are now seeking to defend it, not to test it or to describe the limits to the generality. This, in my view, is more likely to result in shenanigans than is the between-lab replication attempt.

    Like

  37. qaz Says:

    Do you think the within-lab replication is a cause or a consequence of the “seeking to defend” phenomenon? That is, do you think that people do more within-lab replication because they really can’t afford to be wrong? Or do you think people are more interested in defending their territory because all of the “replication” is done within-lab, before publication? I wonder how much this is all a consequence of the changes in journal practice we’ve discussed (publication in GlamourMagz as primary report and consequential Glamourization of society journals)?

    Like

  38. drugmonkey Says:

    The causes lie elsewhere. It is coming from the desire of well funded labs to narrow the playing field such that only similarly elite labs can play. Many labs might be able to beat you to that finding in zebrafish. But if you have to show it in flies, mice and CRISPR a rat also? Narrows the competition for those Nature papers, doesn’t it?

    A side effect of this is the intense pressure for all the models to produce congruent data.

    People have always been keen to defend a finding they believe in. I argue that when multiple labs are arguing back and forth on a single sub-question that we get the quickest and most reliable advance. Nothing like proving the other lab wrong to support the publication of negative data or sharply limiting caveats.

    Like

  39. jmz4 Says:

    “It is coming from the desire of well funded labs to narrow the playing field such that only similarly elite labs can play.”
    -Honest question, as my experience is limited. Do you think this is really a conscious decision by these PIs, or is it more that their frame of reference for “good science” subconsciously follows their own pecuniary situation?

    “Nothing like proving the other lab wrong to support the publication of negative data or sharply limiting caveats.”
    -I agree completely with this paradigm, however, among my peer group, I’ve often found that failure to reproduce another lab’s work is met with frustration and self-doubt. At the PI level, data which contradicts published work is often ignored or rationalized away. It seems like it’s becoming (increasingly?) rare for publications to call out or dispute another publication, unless it is a really niche subject. And they’re almost never published in the same tier of journal. It would help if the gainsaying paper were linked to the first one via Pubmed (the way the currently do for Commentaries).

    Like


Leave a comment