Generalization, not “reproducibility”
February 26, 2018
The wikipedia entry on Generalization reads:
A generalization (or generalisation) is the formulation of general concepts from specific instances by abstracting common properties.
This is a very clean description of what many scientists think that they are about. I certainly do. I think that we are trying to use our experiments as specific instances from which to identify concepts and phenomena that have common properties with other situations not currently being tested. Thus our results should, we hope, generalize as predictions of what will happen in other situations.
Usually situations related to human health and behavior.
A recent paper by Voelkl and colleagues talks about this but totally borks the framing and terminology. They continually misuse “reproducibility” when they really mean to refer to generalization. And this harms science.
First, a quick overview. What Voelkl et al. present is a study which conducts meta-analysis of published studies. This technique includes a host of different studies which use approximately the same methods to address approximately the same question. The outcome of such a meta-analysis can tell us if a given qualitative interpretation is more likely to be true than not (think of it as a box score of the outcomes weighted by some qualities of the specific studies) and estimate the effect size (distance of mean effect relative to the variation expected, i.e. Cohen’s d is most comprehensible to me).
As you can imagine, in a typical meta-analysis the studies vary quite a bit in detail. Perhaps it is the strain of rat being used. Or the sex. Or the light cycle the studies were run in. Perhaps it is the time of year or the humidity of the building. And most frequently there is variation in the scientists who are doing the investigating.
Meta-analysis is a test of generalization!
This is critical.
The big point in the Voelkl paper is that individual papers which include data sets on the same topic from multiple labs are more closely correlated with the meta-analytic result. As the authors put it in the Discussion:
Using simulated sampling, we compared the outcomes of single- and multi-laboratory studies, using the same overall number of animals, in terms of their accuracy of effect size estimates (pc) and FNR. For these simulations, we chose to use a large sample of published data from preclinical studies to guarantee that the results reflect real-life conditions. We found that pc increased substantially with the number of participating laboratories, without causing a need for larger sample sizes. This demonstrates that using more representative study samples through multi-laboratory designs improves the external validity and reproducibility of preclinical animal research.
Well, no shit Sherlock. A multi-laboratory study is already a test of generalization. It says that the same qualitative interpretation can be drawn from the study regardless of variation in laboratory, personnel and probably some other key variables. Since this is also what the meta-analysis is testing, it is no surprise whatever that this would be the result.
But. These authors use “reproducibility”. The Wikipedia entry on this topic is a disaster which conflates several key issues together, most pertinently generalization, reproducibility and replicability. It starts out okay:
Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers.
Absolutely. Reproducibility is indeed the ability to reach the same conclusion (inferences) based on doing everything just like the other researchers did it. Great. It then immediately goes off the rails:
A related concept is replicability, meaning the ability to independently achieve non identical conclusions that are at least similar, when differences in sampling, research procedures and data analysis methods may exist.
what? That sounds more like a flexible version of reproducibility. If I had to try to parse out a difference for replicability I might observe the term “replicates” gives us a clue. As it does further down in the Wikipedia entry which now conflates the term repeatable with replicable.
The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. … Although they are often confused, there is an important distinction between replicates and an independent repetition of an experiment. Replicates are performed within an experiment.
Seriously, who is editing this thing? Replicable now equals repeatable which means are all your subjects in the sample doing the same thing, more or less. I can get behind this needing a separate term but can we just pick one please? And not confuse that with the issue of whether the scientific result (“inference”) can be reproduced or will generalize?
Back to reproducibility.
A particular experimentally obtained value is said to be reproducible if there is a high degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people
See how they’ve immediately diverged? This may or may not be generalization depending on what you call “replicate specimens”. To most eyes this means a whole different experiment which is for sure a test of generalization. Well the entry immediately clears up the intent is to conflate:
in science, a very well reproduced result is one that can be confirmed using as many different experimental setups as possible and as many lines of evidence as possible
The line about “as many different setups as possible” is the essence of generalization. And if that isn’t enough confusion this sentence brings in converging evidence which is another concept entirely!
Back to Voelkl et al.:
our results suggest that eliminating these and other risks of bias (e.g., low statistical power, analytical flexibility) is not sufficient to guarantee reproducibility; the results will remain idiosyncratic to the specific laboratory conditions unless these conditions are varied.
“Idiosyncratic” here means reproducible. It means that if you keep the conditions identical, you should be able to repeat the experiment over and over and come up with the same approximate finding (“inference”). This finding can be endlessly reproducible, be built on experiments that are highly replicable within the samples and still fail to generalize beyond the idiosyncratic way that a given lab chooses to run the experiment.
So why do I say this failure to be clear about what we mean harms science?
Well, we are deep in the midst of much furor about a “reproducibility crisis” in science. There isn’t one. Or at least if there is one, it has not been demonstrated clearly. The low grade annoyance of writing and reviewing the NIH grant section on Rigor is not a huge deal (at least it hasn’t been for me so far). But it is yet another thing for people to beat up grants, possibly for no good reason. On the other end of the scale this will eventually be grist for conservative Congress Critters trying to cut investment in research. Somewhere in between lies the goal of the BigPharma voices promoting the lie so as to further offload their research and development costs onto the public purse.
The more immediate problem is that if we are not clear about what we mean in this discussion, our solutions will never solve anything, and may even hurt. I believe that to some extent people are indeed accusing science of having a reproducibility problem. Meaning, one assumes, that significant amounts of published work come to inferences that cannot be sustained if the experiments are done in exactly the same way. The solution for this, one deduces, can only be that each lab much perform many replicate experiments to provide improved confidence on reproducibility prior to publishing. “Make those sloppy bastards repeat it six times and I won’t have to work so hard to figure out how to get my experiment working“, goes the thinking. I guess
One interpretation of what Voelkl and colleagues are saying is that this won’t help at all.
Besides known differences between the studies included in our analysis, such as the species or strain of animals (i.e., genotype) or reported differences in animal husbandry and experimental procedures, sources of variation included also many unknown and unknowable differences, such as the influence of the experimenter [38,39] or the microbiome [40], as well as subtle differences in visual, olfactory, and auditory stimulation. All those factors might affect treatment effects. Multi-laboratory designs are ideal to account for all of these sources of between-laboratory variation and should therefore replace standardized single-laboratory studies as the gold standard for late-phase preclinical trials
If we don’t do work in a way that can test how well a conclusion generalizes across these issues, we will never solve the real problem. We will not know the limits of said generalization (it is not one thing, btw), the key experimental factors and the irrelevant detail. Instead we will continue to promote a collection of arbitrary and highly constrained experimental parameters and talk as if surely our results will generalize to a treatment medication for humans in rapid order.
In point of fact working to improve reproducibility (as we all do!) may be directly opposed to improving generalization and thereby compromise translation to helping improve human health.
And despite where people in science are pointing the finger of blame (i.e., the reproducibility of inferences that we can make using precisely the same approaches), they are really motivated and angered by the lack of generalization.
Seriously, listen to what has the scientists who are eager to be puppeted by Big Pharma have to say. Listen to their supposed examples that show “the problem is real”. Look at what makes them really mad. Ask about their attempts to perform experiments related to the ones in the published literature that anger them so much. You will be more likely to conclude that they are not in fact miffed about directly reproducing a result. More often it is a failure to generalize beyond the original experimental conditions.
__
Voelkl B, Vogt L, Sena ES, Würbel H (2018) Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol 16(2): e2003693. https://doi.org/10.1371/journal.pbio.2003693
February 27, 2018 at 12:56 am
Welcome back DM! Your long verbose post indicate that you are refreshed! Good on you!
LikeLike
February 27, 2018 at 7:21 am
Well said, DM!
The problem with the “reproducibility crisis” discussion is that it is working on the wrong timeline. What matters is exactly the generalizability of a topic. That takes a decade or more to determine because different labs have to explore the space of the topic. “I see phenomenon X.” “I don’t.” The correct response is “why not?” not “it’s not reproducible”. Determining what factors underlie a phenomenon is exactly where science is, which is why science is so powerful a tool for society, because it’s those factors that provide the knobs for engineering to control.
A paper is one element of a chain. Too many journals and too many scientists view the goal of a paper as a “complete story”. A paper is never a complete story. We recently had a paper rejected from a GlamourMag because it was the first paper on a topic, not the last. Our results set the stage for several R01s worth of work, and we were basically told (after months of arguing) that it wasn’t going to fly because we didn’t have the final answer.
I am very concerned that this whole “reproducibility crisis” is providing ammunition for the anti-science crowd to diminish the population’s trust in science. That would be truly devastating. Sure, whether chocolate is good for you or not might be changing from week to week, but we’ve had 30 years to figure out that climate change is real.
LikeLike
February 27, 2018 at 5:38 pm
I can’t speak to how reproducibility is being interpreted in the realm of grant review, or in your discipline.
I can agree that, in as far as Voelkl et al. are concerned about idiosyncratic results that are stable across replicate experiments in a single lab, that is properly a question of generalization and that’s not a crisis, that’s just science.
The “reproducibility crisis” that I know of is caused by misusing statistical methods to do noise-mining. Run an experiment, ideally one with weak theory. Don’t use too large a sample size; that would be wasteful of resources. Between-subject experimental designs are easiest, so do that. Then look at the results and if something has a p-value below the threshold, write it up — look at that huge effect size, it’s earth shattering — and publish it, ideally along with a press release to maximize impact on the field. Otherwise run an experiment …
Lots of published work looks like that. The “chocolate is good for you” papers tend to look like that. “Priming” papers from psychology also tend to look like that. Some biomarker papers look like that.
Disciplines with stronger theory have less of this. If you’re in one of those disciplines, more power to you. If PIs and reviewers in your discipline truly understand the statistical methods they’re using, so noise is rarely published, more power to y’all.
But when people are doing noise mining, the problem isn’t that the results don’t generalize. The same lab may never repeat the experiment. Sometimes they might repeat it. And lo and behold if you “upgrade” your coding, discard outliers (this time, but not last time), and introduce a new mediator, yeah the results hold in the repeated experiment. And maybe for the third repeat there’s another new mediator…
Serious scientists — maybe even the vast majority of scientists in some disciplines — are diligent to rule out noise. They start with an exploratory experiment, then they follow up with some confirmation experiments where they aren’t adjusting things midstream. The experiment connects with pre-existing theory that really constrains the interpretation of the experiment. It motivates new theory that really will constrain future experiments. Only then do they make, and publicize, big claims.
Many people who call themselves scientists don’t do that. Maybe they think they’re doing that but they’re not, because they don’t understand the assumptions behind p-values. Maybe they think no one does that. Maybe it takes too long and it’ll get in the way of their publication record.
That does decrease trust in science. As it should. And it’s pervasive in some disciplines, which is a crisis and we should fix it. This is called the reproducibility crisis.
LikeLike
February 28, 2018 at 8:45 am
There is no reproducibility crisis. The crisis is in journalism that is reporting science that is untested and unready and in Pharma that is expecting that one experiment of drug X on mouse strain Y means that drug X can be used in humans. Real science takes decades to explore the space of a result. (That’s why science is about exploration and discovery!) People need to stop thinking of papers as the final result. They are one step in a long journey.
If the lab never follows up on the experiment, if no one builds on it, then how is it having any effect in the literature? Science is a building. You put your brick in the wall. Someone else puts a brick on top of yours. If their brick falls down because yours was unstable, you have to rebuild it. If you put your brick out in never-never-land and no one builds on it, why do we care about it?
The problem is that engineers (e.g. pharma) are trying to cross the river when someone threw one brick into the water. Wait until we’ve got a working bridge. Then we can cross the river. All of the major consequences of breakthroughs that science has seen in the last couple of centuries have typically taken 30 years to go from discovery to implementation (e.g. CCD cameras, genetic testing, NMR, deep learning, statins, etc etc etc). This is why basic (fundamental) science is funded by the federal government, not by industry (q.v. Xerox Parc and the development of the personal computer).
LikeLike
March 30, 2018 at 4:38 pm
[…] In the previous issue of the Newsletter, we have already highlighted the recent paper by Bernhard Voelkl et al. from University of Bern. This publication has apparently caught the attention of many of our colleagues and triggered diverse feedback. We would like to mention one particular commentary published by ‘DrugMonkey’. […]
LikeLike
May 16, 2018 at 2:39 pm
[…] wait, they cited me! AHAHAA, of course I like this […]
LikeLike