BJP issues new policy on SABV

September 4, 2019

The British Journal of Pharmacology has been issuing a barrage of initiatives over the past few years that are intended to address numerous issues of scientific meta-concern including reproducibility, reliability and transparency of methods. The latest is an Editorial on how they will address current concerns about including sex as a biological variable.

Docherty et al. 2019 Sex: A change in our guidelines to authors to ensure that this is no longer an ignored experimental variable. https://doi.org/10.1111/bph.14761 [link]

I’ll skip over the blah-blah about why. This audience is up to speed on SABV issues. The critical parts are what they plan to do about it, with respect to future manuscripts submitted to their journal. tldr: They are going to shake the finger but fall woefully short of heavy threats or of prioritizing manuscripts that do a good job of inclusion.

From Section 4 BJP Policy: The British Journal of Pharmacology has decided to rectify this neglect of sex as a research variable, and we recommend that all future studies published in this journal should acknowledge consideration of the issue of sex. In the ideal scenario for in vivo studies, both sexes will be included in the experimental design. However, if the researcher’s view is that sex or gender is not relevant to the experimental question, then a statement providing a rationale for this view will be required.

Right? Already we see immense weaseling. What rationales will be acceptable? Will those rationales be applied consistently for all submissions? Or will this be yet another frustrating feature for authors in which our manuscripts appear to be rejected on grounds that other papers published seem to suffer from?

We acknowledge that the economics of investigating the influence of sex on experimental outcomes will be difficult until research grant‐funding agencies insist that researchers adapt their experimental designs, in order to accommodate sex as an experimental variable and provide the necessary resources. In the meantime, manuscripts based on studies that have used only one sex or gender will continue to be published in BJP. However, we will require authors to include a statement to justify a decision to study only one sex or gender.

Oh a statement. You know, the NIH has (sort of, weaselly) “insisted”. But as we know the research force is fighting back, insisting that we don’t have “necessary resources” and, several years into this policy, researchers are blithely presenting data at conferences with no mention of addressing SABV.

Overall sex differences and, more importantly, interactions between experimental interventions and sex (i.e., the effect of the intervention differs in the two sexes) cannot be inferred if males and females are studied in separate time frames.

Absolutely totally false. False, false, false. This has come up in more than one of my recent reviews and it is completely and utterly, hypocritically wrong. Why? Several reasons. First of all in my fields of study it is exceptionally rare that large, multi-group, multi-sub-study designs (in single sex) are conducted this way. It is resource intensive and generally unworkable. Many, many, many studies include comparisons across groups that were not run at the same time in some sort of cohort balancing design. And whaddaya know those studies often replicate with all sorts of variation across labs, not just across time within lab. In fact this is a strength. Second, in my fields of study, we refer to prior literature all the time in our Discussion sections to draw parallels and contrasts. In essentially zero cases do the authors simply throw up their hands and say “well since nobody has run studies at the same time and place as ours there is nothing worth saying about that prior literature”. You would be rightfully laughed out of town.

Third concern: It’s my old saw about “too many notes“. Critique without an actual reason is bullshit. In this case you have to say why you think the factor you don’t happen to like for Experimental Design 101 reasons (running studies in series instead of parallel) has contributed to the difference. If one of my peer labs says they did more or less the same methods this month compared to last year compared to five years ago…wherein lies the source of non-sex-related variance which explains why the female group self-administered more cocaine compared with the before, after and in between male groups which all did the same thing? And why are we so insistent about this for SABV and not for the series of studies in males that reference each other?

In conscious animal experiments, a potential confounder is that the response of interest might be affected by the close proximity of an animal of the opposite sex. We have no specific recommendation on how to deal with this, and it should be borne in mind that this situation will replicate their “real world.” We ask authors merely to consider whether or not males and females should be physically separated, to ensure that sight and smell are not an issue that could confound the results, and to report on how this was addressed when carrying out the study. Obviously, it would not be advisable to house males and females in different rooms because that would undermine the need for the animals to be exposed to the same environmental factors in a properly controlled experiment.

NO SHIT SHERLOCK!

Look, there are tradeoffs in this SABV business when it comes to behavior studies, and no doubt others. We have many sources of potential variance that could be misinterpreted as a relatively pure sex difference. We cannot address them all in each and every design. We can’t. You would have to run groups that were housed together, and not, in rooms together and not, at times similar and apart AND MULTIPLY THAT AGAINST EACH AND EVERY TREATMENT CONDITION YOU HAVE PLANNED FOR THE “REAL” STUDY.

Unless the objective of the study is specifically to investigate drug‐induced responses at specific stages of the oestrous cycle, we shall not require authors to record or report this information in this journal. This is not least because procedures to determine oestrous status are moderately stressful and an interaction between the stress response and stage of the oestrous cycle could affect the experimental outcome. However, authors should be aware that the stage of the oestrous cycle may affect response to drugs particularly in behavioural studies, as reported for actions of cocaine in rats and mice (Calipari et al., 2017; Nicolas et al., 2019).

Well done. Except why cite papers where there are oestrous differences without similarly citing cases where there are no oestrous differences? It sets up a bias that has the potential to undercut the more correct way they start Section 5.5.

My concern with all of this is not the general support for SABV. I like that. I am concerned first that it will be toothless in the sense that studies which include SABV will not be prioritized and some, not all, authors will be allowed to get away with thin rationales. This is not unique to BJP, I suspect the NIH is failing hard at this as well. And without incentives (easier acceptance of manuscripts, better grant odds) or punishments (auto rejects, grant triages) then behavior won’t change because the other incentives (faster movement on “real” effects and designs) will dominate.

There’s an interesting issue of Pharmacology, Biochemistry & Behavior that is focused on Reproducibility of animal models for neuropsychiatric diseases.

Reading through the articles I am struck by how this effort is like throwing a bucket of water on a barn blaze. You might think that a “reproducibility” paper would answer a lof of questions. They do. But they also raise more and more and more questions. Take the following from Richetto et al. Effects of light and dark phase testing on the investigation of behavioural paradigms in mice: Relevance for behavioural neuroscience.

This is a pretty duck-soup behavioral pharmacology assay- inject an experimental subject with drugs and see what happens. In this case, inject a mouse with amphetamine and see how much it runs around. (Nice feature – N=6 male and 6 female per group, SABV orthodox, yo!)

Figure 5a from Richetto et al, 2019

Ok, so there’s a light cycle effect. I’ve blogged about that before (2015; 2018). and there was also a light cycle effect on cFos in Nucleus Accumbens and midbrain which did not interact with an effect of amphetamine treatment. So. Whether or not light cycle affects replication in this narrow instance is whether you are interested in quantitative differences or relative differences. The behavioral curves are more or less the same to a first approximation. But what’s missing here? Threshold, for one thing. This is a single dose of amphetamine (2.5 mg/kg, i.p.). What happens at lower or higher doses? At some point you fail to distinguish a drug effect from vehicle….ooops. Where’s the vehicle control. Injection prior to the amphetamine and characterized for only 20 minutes. Where are the mRNA controls coming from? Wait that whole experiment was 30 min after the saline or drug injection. What about later time-points (25th time bin) when the behavioral difference emerged?

Crap, back to the behavioral control. Why not run the saline injections in parallel all the way out to the 25th time bin? Because what you would find is that in light cycle the animals basically go to sleep. Wait they are testing in the dark, right? ….back to the paper. OMG it doesn’t really directly say and all we have to go on is testing in dark versus light vivarium cycle. Another factor, gaaaah. Testing in dark or light versus the circadian cycle.

The point is not to ask why they didn’t test absolutely everything but to point out that even a fairly effortful “replication” study of an exceptionally simple phenomenon gets complicated in a huge hurry.

Light still matters

July 2, 2018

In the midst of all this hoopla about reliability, repeatability, the replication crisis and what not the Editorial Board of the Journal of Neuroscience has launched an effort to recommend best practices. The first one was about electrophysiology. To give you a flavor:

There is a long tradition in neurophysiology of using the number of neurons recorded as the sample size (“n”) in statistical calculations. In many cases, the sample of recorded neurons comes from a small number of animals, yet many statistical analyses make the explicit assumption that a sample constitutes independent observations. When multiple neurons are recorded from a single animal, however, either sequentially with a single electrode or simultaneously with multiple electrodes, each neuron’s activity may not, in fact, be independent of the others. Thus, it is important for researchers to account for variability across subjects in data analyses.

I emphasize the “long tradition” part because clearly the Editorial Board does not just mean this effort to nibble around the edges. It is going straight at some long used practices that they think need to change.

There was a long and very good twitter thread from someone which dealt in part with unreliability relating to when one chooses to conduct behavioral tasks in rodents with respect to their daily light cycle. As a reminder, rodents are nocturnal and are most active (aka “awake”) in the dark. Humans, as a reminder, are not. So, as you might imagine, there is a lot of rodent research (including behavioral research) that fails to grasp this difference and simply runs the rats in their light cycle. Also known as their inactive part of the day. Aka. “asleep”.

I am being totally honest when I say that the response has been astonishing to me. The pushback!

It’s totally surprising that we not only got a lot of “it doesn’t matter” responses but actually a lot of implication that it is better (without quite saying so directly). I’m not going to run down everything but players include @svmahler, @StephenMaren, @aradprof, @DennisEckmeier, @jdavidjentsch, and @sleepwakeEEG.

There are just too many ludicrous things being said to characterize them all. But, one species of argument is “it doesn’t matter [for my endpoint]”. The last part is implied. But early in this thread I posted a link to my prior post which discusses two of my favorite papers on this topic. Scheving et al, 1968 showed a four fold difference in mortality rate after a single dose of amphetamine depending on when it was administered. Roberts and colleagues showed that cocaine self-administration changes all across the day in a very nice circadian pattern. I also noted a paper I had discussed very indirectly in a post on contradicting your own stuff. Halberstadt and colleagues (2012) played around with some variables in a very old model from the Geyer lab and found that time of day interacted with other factors to change results in a rat locomotor assay. I mean c’mon, how many thousands of papers use locomotor assays to asssess psychomotor stimulant drugs?

There’s some downshifting and muttering in the tweet discussion about “well if it doesn’t matter who cares” but nobody has started posting published papers showing where light cycle doesn’t matter for their assays (as a main factor or as an interaction). Yet. I’m sure it is just an oversight. Interestingly the tone of this seems to be arguing that it is ridiculous to expect people to do their rat assays in reverse light unless it is proven (I guess by someone else?) that it changes results.

This, my friends, is very much front and center in the “reproducibility crisis” that isn’t. Let us return to the above comment at J Neuro about “long traditions”. Do you know how hard it is to fight long traditions in scientific subareas? Sure you do. Trying to get funded, or publish resulting studies, that deal with the seemingly minor choices that have been made for a long time is very difficult. Boring and incremental. Some of these things will come out to be negative, i.e., it won’t matter what light cycle is used. Good luck publishing those! It’s no coincidence that the aforementioned Halberstadt paper is published in a very modest journal. So we end up with a somewhat random assortment of some people doing their work in the animals’ inactive period and some in the active period. Rarely is there a direct comparison (i.e., within lab). So who knows what contribution that is….until you try to replicate it yourself. Wasting time and money and adding potential interactions…..very frustrating.

So yes, we would like to know it all, kind of like we’d like to know everything in male and female animals. But we don’t. The people getting all angsty over their choice to run rodents in the light next tried the ploy to back and fill with “can’t we all get along” type of approach that harmonizes with this sentiment. They aren’t wrong, exactly. But let us return to the J Neuro Editorial effort on best practices. There IS a best option here, if we are not going to do it all. There’s a slope in your choice of default versus checking the other. And for behavioral experiments that are not explicitly looking at sleepy rats or mice, the best option is running in their active cycle.

There is lots of fun ridiculousness in the thread. I particularly enjoyed the argument that because rats might be exposed briefly to light in the course of trying to do reverse-cycle experiments, we should just default to light cycle running. Right? Like if you walk from the light into a darkened room you suddenly fall asleep? Hell no. And if you are awakened by a light in your face in the middle of the night you are suddenly as awake as if it were broad noon? HAHAHHAHAA! I love it.

Enjoy the threads. Click on those tweeter links above and read the arguments.

__
Roberts DC, Brebner K, Vincler M, Lynch WJ. Patterns of cocaine self-administration in rats produced by various access conditions under a discrete trials procedure. Drug Alcohol Depend. 2002 Aug 1;67(3):291-9. [PubMed]

Scheving LE, Vedral DF, Pauly JE. Daily circadian rhythm in rats to D-amphetamine sulphate: effect of blinding and continuous illumination on the rhythm. Nature. 1968 Aug 10;219(5154):621-2. [PubMed]

Self plagiarism

June 8, 2018

A journal has recently retracted an article for self-plagiarism:

Just going by the titles this may appear to be the case where review or theory material is published over and over in multiple venues.

I may have complained on the blog once or twice about people in my fields of interest that publish review after thinly updated review year after year.

I’ve seen one or two people use this strategy, in addition to a high rate of primary research articles, to blanket the world with their theoretical orientations.

I’ve seen a small cottage industry do the “more reviews than data articles” strategy for decades in an attempt to budge the needle on a therapeutic modality that shows promise but lacks full financial support from, eg NIH.

I still don’t believe “self-plagiarism” is a thing. To me plagiarism is stealing someone else’s ideas or work and passing them off as one’s own. When art critics see themes from prior work being perfected or included or echoed in the masterpiece, do they scream “plagiarism”? No. But if someone else does it, that is viewed as copying. And lesser. I see academic theoretical and even interpretive work in this vein*.

To my mind the publishing industry has a financial interest in this conflation because they are interested in novel contributions that will presumably garner attention and citations. Work that is duplicative may be seen as lesser because it divides up citation to the core ideas across multiple reviews. Given how the scientific publishing industry leeches off content providers, my sympathies are…..limited.

The complaint from within the house of science, I suspect, derives from a position of publishing fairness? That some dude shouldn’t benefit from constantly recycling the same arguments over and over? I’m sort of sympathetic to this.

But I think it is a mistake to give in to the slippery slope of letting the publishing industry establish this concept of “self-plagiarism”. The risk for normal science pubs that repeat methods are too high. The risks for “replication crisis” solutions are too high- after all, a substantial replication study would require duplicative Introductory and interpretive comment, would it not?

__

*although “copying” is perhaps unfair and inaccurate when it comes to the incremental building of scientific knowledge as a collaborative endeavor.

A new Op/Ed in PNAS takes up the reproducibility crisis.

A. D Redish, E. Kummerfeld, R. L. Morris, A. Love (2018) “Opinion: Reproducibility failures are essential to scientific inquiry” PNAS 115(20):5042-5046. [Journal Site]

Takeaway quote from the Abstract

Most importantly, these proposed policy changes ignore a core feature of the process of scientific inquiry that occurs after reproducibility failures: the integration of conflicting observations and ideas into a coherent theory.

As you will see, they had me at:

In many of these cases, what have been called “failures to replicate” are actually failures to generalize across what researchers hoped were inconsequential changes in background assumptions or experimental conditions

(Oh, wait, they cited me! AHAHAA, of course I like this thing!)

Seriously though, this is good stuff. Go read. Bookmark to forward to anyone who starts in on how there is a reproducibility “crisis”.

The wikipedia entry on Generalization reads:

A generalization (or generalisation) is the formulation of general concepts from specific instances by abstracting common properties.

This is a very clean description of what many scientists think that they are about. I certainly do. I think that we are trying to use our experiments as specific instances from which to identify concepts and phenomena that have common properties with other situations not currently being tested. Thus our results should, we hope, generalize as predictions of what will happen in other situations.

Usually situations related to human health and behavior.

A recent paper by Voelkl and colleagues talks about this but totally borks the framing and terminology. They continually misuse “reproducibility” when they really mean to refer to generalization. And this harms science.

First, a quick overview. What Voelkl et al. present is a study which conducts meta-analysis of published studies. This technique includes a host of different studies which use approximately the same methods to address approximately the same question. The outcome of such a meta-analysis can tell us if a given qualitative interpretation is more likely to be true than not (think of it as a box score of the outcomes weighted by some qualities of the specific studies) and estimate the effect size (distance of mean effect relative to the variation expected, i.e. Cohen’s d is most comprehensible to me).

As you can imagine, in a typical meta-analysis the studies vary quite a bit in detail. Perhaps it is the strain of rat being used. Or the sex. Or the light cycle the studies were run in. Perhaps it is the time of year or the humidity of the building. And most frequently there is variation in the scientists who are doing the investigating.

Meta-analysis is a test of generalization!

This is critical.

The big point in the Voelkl paper is that individual papers which include data sets on the same topic from multiple labs are more closely correlated with the meta-analytic result. As the authors put it in the Discussion:

Using simulated sampling, we compared the outcomes of single- and multi-laboratory studies, using the same overall number of animals, in terms of their accuracy of effect size estimates (pc) and FNR. For these simulations, we chose to use a large sample of published data from preclinical studies to guarantee that the results reflect real-life conditions. We found that pc increased substantially with the number of participating laboratories, without causing a need for larger sample sizes. This demonstrates that using more representative study samples through multi-laboratory designs improves the external validity and reproducibility of preclinical animal research.

Well, no shit Sherlock. A multi-laboratory study is already a test of generalization. It says that the same qualitative interpretation can be drawn from the study regardless of variation in laboratory, personnel and probably some other key variables. Since this is also what the meta-analysis is testing, it is no surprise whatever that this would be the result.

But. These authors use “reproducibility”. The Wikipedia entry on this topic is a disaster which conflates several key issues together, most pertinently generalization, reproducibility and replicability. It starts out okay:

Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers.

Absolutely. Reproducibility is indeed the ability to reach the same conclusion (inferences) based on doing everything just like the other researchers did it. Great. It then immediately goes off the rails:

A related concept is replicability, meaning the ability to independently achieve non identical conclusions that are at least similar, when differences in sampling, research procedures and data analysis methods may exist.

what? That sounds more like a flexible version of reproducibility. If I had to try to parse out a difference for replicability I might observe the term “replicates” gives us a clue. As it does further down in the Wikipedia entry which now conflates the term repeatable with replicable.

The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. … Although they are often confused, there is an important distinction between replicates and an independent repetition of an experiment. Replicates are performed within an experiment.

Seriously, who is editing this thing? Replicable now equals repeatable which means are all your subjects in the sample doing the same thing, more or less. I can get behind this needing a separate term but can we just pick one please? And not confuse that with the issue of whether the scientific result (“inference”) can be reproduced or will generalize?

Back to reproducibility.

A particular experimentally obtained value is said to be reproducible if there is a high degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people

See how they’ve immediately diverged? This may or may not be generalization depending on what you call “replicate specimens”. To most eyes this means a whole different experiment which is for sure a test of generalization. Well the entry immediately clears up the intent is to conflate:

in science, a very well reproduced result is one that can be confirmed using as many different experimental setups as possible and as many lines of evidence as possible

The line about “as many different setups as possible” is the essence of generalization. And if that isn’t enough confusion this sentence brings in converging evidence which is another concept entirely!

Back to Voelkl et al.:

our results suggest that eliminating these and other risks of bias (e.g., low statistical power, analytical flexibility) is not sufficient to guarantee reproducibility; the results will remain idiosyncratic to the specific laboratory conditions unless these conditions are varied.

“Idiosyncratic” here means reproducible. It means that if you keep the conditions identical, you should be able to repeat the experiment over and over and come up with the same approximate finding (“inference”). This finding can be endlessly reproducible, be built on experiments that are highly replicable within the samples and still fail to generalize beyond the idiosyncratic way that a given lab chooses to run the experiment.

So why do I say this failure to be clear about what we mean harms science?

Well, we are deep in the midst of much furor about a “reproducibility crisis” in science. There isn’t one. Or at least if there is one, it has not been demonstrated clearly. The low grade annoyance of writing and reviewing the NIH grant section on Rigor is not a huge deal (at least it hasn’t been for me so far). But it is yet another thing for people to beat up grants, possibly for no good reason. On the other end of the scale this will eventually be grist for conservative Congress Critters trying to cut investment in research. Somewhere in between lies the goal of the BigPharma voices promoting the lie so as to further offload their research and development costs onto the public purse.

The more immediate problem is that if we are not clear about what we mean in this discussion, our solutions will never solve anything, and may even hurt. I believe that to some extent people are indeed accusing science of having a reproducibility problem. Meaning, one assumes, that significant amounts of published work come to inferences that cannot be sustained if the experiments are done in exactly the same way. The solution for this, one deduces, can only be that each lab much perform many replicate experiments to provide improved confidence on reproducibility prior to publishing. “Make those sloppy bastards repeat it six times and I won’t have to work so hard to figure out how to get my experiment working“, goes the thinking. I guess

One interpretation of what Voelkl and colleagues are saying is that this won’t help at all.

Besides known differences between the studies included in our analysis, such as the species or strain of animals (i.e., genotype) or reported differences in animal husbandry and experimental procedures, sources of variation included also many unknown and unknowable differences, such as the influence of the experimenter [38,39] or the microbiome [40], as well as subtle differences in visual, olfactory, and auditory stimulation. All those factors might affect treatment effects. Multi-laboratory designs are ideal to account for all of these sources of between-laboratory variation and should therefore replace standardized single-laboratory studies as the gold standard for late-phase preclinical trials

If we don’t do work in a way that can test how well a conclusion generalizes across these issues, we will never solve the real problem. We will not know the limits of said generalization (it is not one thing, btw), the key experimental factors and the irrelevant detail. Instead we will continue to promote a collection of arbitrary and highly constrained experimental parameters and talk as if surely our results will generalize to a treatment medication for humans in rapid order.

In point of fact working to improve reproducibility (as we all do!) may be directly opposed to improving generalization and thereby compromise translation to helping improve human health.

And despite where people in science are pointing the finger of blame (i.e., the reproducibility of inferences that we can make using precisely the same approaches), they are really motivated and angered by the lack of generalization.

Seriously, listen to what has the scientists who are eager to be puppeted by Big Pharma have to say. Listen to their supposed examples that show “the problem is real”. Look at what makes them really mad. Ask about their attempts to perform experiments related to the ones in the published literature that anger them so much. You will be more likely to conclude that they are not in fact miffed about directly reproducing a result. More often it is a failure to generalize beyond the original experimental conditions.

__
Voelkl B, Vogt L, Sena ES, Würbel H (2018) Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol 16(2): e2003693. https://doi.org/10.1371/journal.pbio.2003693

If you believe there is a “replication and reproducibility crisis” in science, you have limited proposal options.

Multiple independent replications of each phenomenon in a paper. Not n-up, but triplicate or more full power repeats.

Are you demanding this? Are you doing this in your own research?

Or, we create some way to encourage and enhance direct replications and simple tests of generalization from other labs or in subsequent papers.

I favor this one.

As it happens,I have had multiple incidences in the past few years which address this. As both an author and as a reviewer.

Editors and reviewers for decidedly modest JIF journals are overtly and explicitly saying replications and simple tests of generalization of a finding should not be published.

I can’t stress this enough. We’re not talking “this isn’t cool enough for this 2.5 JIF journal”. These are opinions that such things do not merit publication at all.

Findings that fail to replicate a prior finding (that is actually poorly supported) *simultaneously* take heat for not getting the same result.

Direct replication is too incremental and refutation / modification is too…..doubtful?

As my longer term Readers know, I tend to think this is just the way science works. If you keep at it your manuscript will find a home eventually. It is a PIA but it is not damning of the entire enterprise.

But if there is any validity to the reproducibility claims and you keep braying on about it…I want to know a lot more about how your reviewing behavior matches your fine talk.

I’m also looking forward to NIH grant review in about 3-4 years. We should be expecting the experienced PI to have some specific and concrete examples about their rigor.

Even if it is “this preprint shows our replication attempts, even if idiotic reviewers prevented them from being published”.

How many times do we see the publication of a group effect in an animal model that is really just a failure to replicate? Or a failure to completely replicate?

How many of those sex-differences, age-differences or strain-differences have been subjected to replication?

A report by Begley and Ellis, published in 2012, was hugely influential in fueling current interest and dismay about the lack of reproducibility in research. In their original report the authors claimed that the scientists of Amgen had been unable to replicate 47 of 53 studies.

Over the past decade, before pursuing a particular line of research, scientists (including C.G.B.) in the haematology and oncology department at the biotechnology firm Amgen in Thousand Oaks, California, tried to confirm published findings related to that work. Fifty-three papers were deemed ‘landmark’ studies (see ‘Reproducibility of research findings’). It was acknowledged from the outset that some of the data might not hold up, because papers were deliberately selected that described something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Nevertheless, scientific findings were confirmed in only 6 (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.

Despite the limitations identified by the authors themselves, this report has taken on a life of truthy citation as if most of all biomedical science reports cannot be replicated.

I have remarked a time or two that this is ridiculous on the grounds the authors themselves recognize, i.e., a company trying to skim the very latest and greatest results for intellectual property and drug development purposes is not reflective of how science works. Also on the grounds that until we know exactly which studies and what they mean by “failed to replicate” and how hard they worked at it, there is no point in treating this as an actual result.

At first, the authors refused to say which studies or results were meant by this original population of 53.

Now we have the data! They have reported their findings! Nature announces breathlessly that Biotech giant publishes failures to confirm high-profile science.

Awesome. Right?

Well, they published three of them, anyway. Three. Out of fifty-three alleged attempts.

Are you freaking kidding me Nature? And you promote this like we’re all cool now? We can trust their original allegation of 47/53 studies unreplicable?

These are the data that have turned ALL OF NIH UPSIDE DOWN WITH NEW POLICY FOR GRANT SUBMISSION!

Christ what a disaster.

I look forward to hearing from experts in the respective fields these three papers inhabit. I want to know how surprising it is to them that these forms of replication failure occurred. I want to know the quality of the replication attempts and the nature of the “failure”- was it actually failure or was it a failure to generalize in the way that would be necessary for a drug company’s goals? Etc.

Oh and Amgen? I want to see the remaining 50 attempts, including the positive replications.
__

Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012 Mar 28;483(7391):531-3. doi: 10.1038/483531a.

The BJP has decided to require that manuscripts submitted for publication adhere to certain experimental design standards. The formulation can be found in Curtis et al., 2015.

Curtis MJ, Bond RA, Spina D, Ahluwalia A, Alexander SP, Giembycz MA, Gilchrist A, Hoyer D, Insel PA, Izzo AA, Lawrence AJ, MacEwan DJ, Moon LD, Wonnacott S, Weston AH, McGrath JC. Experimental design and analysis and their reporting: new guidance for publication in BJP. Br J Pharmacol. 2015 Jul;172(14):3461-71. doi: 10.1111/bph.12856 [PubMed]

Some of this continues the “huh?” response of this behavioral pharmacologist who publishes in a fair number of similar journals. In other words, YHN is astonished this stuff is not just a default part of the editorial decision making at BJP in the first place. The items that jump out at me include the following (paraphrased):

2. You should shoot for a group size of N=5 or above and if you have fewer you need to do some explaining.
3. Groups less than 20 should be of equal size and if there is variation from equal sample sizes this needs to be explained. Particularly for exclusions or unintended loss of subjects.
4. Subjects should be randomized to groups and treatment order should be randomized.
6.-8. Normalization and transformation should be well justified and follow acceptable practices (e.g., you can’t compare a treatment group to the normalization control that now has no variance because of this process).
9. Don’t confuse analytical replicates with experimental replicates in conducting analysis.

Again, these are the “no duh!” issues in my world. Sticky peer review issues quite often revolve around people trying to get away with violating one or other of these things. At the very least reviewers want justification in the paper, which is a constant theme in these BJP principles.

The first item is a pain in the butt but not much more than make-work.

1. Experimental design should be subjected to ‘a priori power analysis’….latter requires an a priori sample size calculation that should be included in Methods and should include alpha, power and effect size.

Of course, the trouble with power analysis is that it depends intimately on the source of your estimates for effect size- generally pilot or prior experiments. But you can select basically whatever you want as your assumption of effect size to demonstrate a range of sample sizes as acceptable. Also, you can select whatever level of power you like, within reasonable bounds along the continuum from “Good” to “Overwhelming”. I don’t think there are very clear and consistent guidelines here.

The fifth one is also going to be tricky, in my view.

Assignment of subjects/preparations to groups, data recording and data analysis should be blinded to the operator and analyst unless a valid scientific justification is provided for not doing so. If it is impossible to blind the operator, for technical reasons, the data analysis can and should be blinded.

I just don’t see how this is practical with a limited number of people running experiments in a laboratory. There are places this is acutely important- such as when human judgement/scoring measures are the essential data. Sure. And we could all stand to do with a reminder to blind a little more and a little more completely. But this has disaster written all over it. Some peers doing essentially the same assay are going to disagree over what is necessary and “impossible” and what is valid scientific justification.

The next one is a big win for YHN. I endorse this. I find the practice of reporting any p value other than your lowest threshold to be intellectually dishonest*.


10. When comparing groups, a level of probability (P) deemed to constitute the threshold for statistical significance should be defined in Methods, and not varied later in Results (by presentation of multiple levels of significance). Thus, ordinarily P < 0.05 should be used throughout a paper to denote statistically significant differences between groups.

I’m going to be very interested to see how the community of BJP accepts* this.

Finally, a curiosity.

11. After analysis of variance post hoc tests may be run only if F achieves the necessary level of statistical significance (i.e. P < 0.05) and there is no significant variance in homogeneity.

People run post-hocs after a failure to find a significant main effect on the ANOVA? Seriously? Or are we talking about whether one should run all possible comparison post-hocs in the absence of an interaction? (seriously, when is the last time you saw a marginal-mean post-hoc used?) And isn’t this just going to herald the return of the pre-planned comparison strategy**?

Anyway I guess I’m saying Kudos to BJP for putting down their marker on these design and reporting issues. Sure I thought many of these were already the necessary standards. But clearly there are a lot of people skirting around many of these in publications, specifically in BJP***. This new requirement will stiffen the spine of reviewers and editors alike.

__
*N.b. I gave up my personal jihad on this many years ago after getting exactly zero traction in my scientific community. I.e., I had constant fights with reviewers over why my p values were all “suspiciously” p<0.5 and no backup from editors when I tried to slip this concept into reviews.

**I think this is possibly a good thing.

***A little birdy who should know claimed that at least one AE resigned or was booted because they were not down with all of these new requirements.

Thought of the day

December 5, 2014

One thing that always cracks me up about manuscript review is the pose struck* by some reviewers that we cannot possibly interpret data or studies that are not perfect.

There is a certain type of reviewer that takes the stance* that we cannot in any way compare treatment conditions if there is anything about the study that violates some sort of perfect, Experimental Design 101 framing even if there is no reason whatsoever to suspect a contaminating variable. Even if, and this is more hilarious, if there are reasons in the data themselves to think that there is no effect of some nuisance variable.

I’m just always thinking….

The very essence of real science is comparing data across different studies, papers, paradigms, laboratories, etc and trying to come up with a coherent picture of what might be a fairly invariant truth about the system under investigation.

If the studies that you wish to compare are in the same paper, sure, you’d prefer to see less in the way of nuisance variation than you expect when making cross-paper comparisons. I get that. But still….some people.

Note: this is some way relates to the alleged “replication crisis” of science.
__
*having nothing to go on but their willingness to act like the manuscript is entirely uninterpretable and therefore unpublishable, I have to assume that some of them actually mean it. Otherwise they would just say “it would be better if…”. right?

Ok, ok, I have no actual data on this. But if I had to pick one thing in substance abuse science that has been most replicated it is this.

If you surgically implant a group of rats with intravenous catheters, hook them up to a pump which can deliver small infusions of saline adulterated with cocaine HCl and make these infusions contingent upon the rat pressing a lever…

Rats will intravenously self-administer (IVSA) cocaine.

This has been replicated ad nauseum.

If you want to pass a fairly low bar to demonstrate you can do a behavioral study with accepted relevance to drug abuse, you conduct a cocaine IVSA study [Wikipedia] in rats. Period.

And yet. There are sooooo many ways to screw it up and fail to replicate the expected finding.

Note that I say “expected finding” because we must include significant quantitative changes along with the qualitative ones.

Off the top of my head, the types of factors that can reduce your “effect” to a null effect, change the outcome to the extent even a statistically significant result isn’t really the effect you are looking for, etc

  • Catheter diameter or length
  • Cocaine dose available in each infusion
  • Rate of infusion/concentration of drug
  • Sex of the rats
  • Age of rats
  • Strain of the rats
  • Vendor source (of the same nominal strain)
  • Time of day in which rats are run (not just light/dark* either)
  • Food restriction status
  • Time of last food availability
  • Pair vs single housing
  • “Enrichment” that is called-for in default guidelines for laboratory animal care and needs special exception under protocol to prevent.
  • Experimenter choice of smelly personal care products
  • Dirty/clean labcoat (I kid you not)
  • Handling of the rats on arrival from vendor
  • Fire-alarm
  • Cage-change day
  • Minor rat illness
  • Location of operant box in the room (floor vs ceiling, near door or away)
  • Ambient temperature of vivarium or test room
  • Schedule- weekends off? seven days a week?
  • Schedule- 1 hr? 2hr? 6 hr? access sessions
  • Schedule- are reinforcer deliveries contingent upon one lever press? five? does the requirement progressively increase with each successive infusion?
  • Animal loss from the study for various reasons

As you might expect, these factors interact with each other in the real world of conducting science. Some factors you can eliminate, some you have to work around and some you just have to accept as contributions to variability. Your choices depend, in many ways, on your scientific goals beyond merely establishing the IVSA of cocaine.

Up to this point I’m in seeming agreement with that anti-replication yahoo, am I not? Jason Mitchell definitely agrees with me that there are a multitude of ways to come up with a null result.

I am not agreeing with his larger point. In fact, quite the contrary.

The point I am making is that we only know this stuff because of attempts to replicate! Many of these attempts were null and/or might be viewed as a failure to replicate some study that existed prior to the discovery that Factor X was actually pretty important.

Replication attempts taught the field more about the model, which allowed investigators of diverse interests to learn more about cocaine abuse and, indeed, drug abuse generally.

The heavy lifting in discovering the variables and outcomes related to rat IVSA of cocaine took place long before I entered graduate school. Consequently, I really can’t speak to whether investigators felt that their integrity was impugned when another study seemed to question their own work. I can’t speak to how many “failure to replicate” studies were discussed at conferences and less formal interactions. But given what I do know about science, I am confident that there was a little bit of everything. Probably some accusations of faking data popped up now and again. Some investigators no doubt were considered generally incompetent and others were revered (sometimes unjustifiably). No doubt. Some failures to replicate were based on ignorance or incompetence…and some were valid findings which altered the way the field looked upon prior results.

Ultimately the result was a good one. The rat IVSA model of cocaine use has proved useful to understand the neurobiology of addiction.

The incremental, halting, back and forth methodological steps along the path of scientific exploration were necessary for lasting advance. Such processes continue to be necessary in many, many other aspects of science.

Replication is not an insult. It is not worthless or a-scientific.

Replication is the very lifeblood of science.

__
*rats are nocturnal. check out how many studies**, including behavioral ones, are run in the light cycle of the animal.

**yes to this very day, although they are certainly less common now