Pier and colleagues recently published a study purporting to address the reliabiliy of the NIH peer review process. From the summary:

We replicated the NIH peer-review process to examine the qualitative and quantitative judgments of different reviewers examining the same grant application. We found no agreement among reviewers in evaluating the same application. These findings highlight the subjectivity in reviewers’ evaluations of grant applications and underscore the difficulty in comparing the evaluations of different applications from different reviewers—which is how peer review actually unfolds.

emphasis added.

This thing is a crock and yet it has been bandied about on the Twitts as if it is the most awesome thing ever. “Aha!” cry the disgruntled applicants, “This proves that NIH peer review is horrible, terrible, no good, very bad and needs to be torn down entirely. Oh, and it also proves that it is a super criminal crime that some of my applications have gone unfunded, wah.

A smaller set of voices expressed perplexed confusion. “Weird“, we say, “but probably our greatest impression from serving on panels is that there is great agreement of review, when you consider the process as a whole.

So, why is the study irretrievably flawed? In broad strokes it is quite simple.
Restriction of the range. Take a look at the first figure. Does it show any correlation of scores? Any fair view would say no. Aha! Whatever is being represented on the x-axis about these points does not predict anything about what is being represented on the y-axis.

This is the mistake being made by Pier and colleagues. They have constructed four peer-review panels and had them review the same population of 25 grants. The trick is that of these 16 were already funded by the NCI and the remaining 9 were prior unfunded versions of grants that were funded by the NCI.

In short, the study selects proposals from a very limited range of the applications being reviewed by the NIH. This figure shows the rest of the data from the above example. When you look at it like this, any fair eye concludes that whatever is being represented by the x value about these points predicts something about the y value. Anyone with the barest of understanding of distributions and correlations gets this. Anyone with the most basic understanding grasps that a distribution does not have to have perfect correspondence for there to be a predictive relationship between two variables.

So. The authors claims are bogus. Ridiculously so. They did not “replicate” the peer review because they did not include a full range of scores/outcomes but instead picked the narrowest slice of the funded awards. I don’t have time to dig up historical data but the current funding plan for NCI calls for a 10%ile payline. You can amuse yourself with the NIH success rate data here, the very first spreadsheet I clicked on gave a success rate of 12.5% for NCI R01s.

No “agreement”. “Subjectivity”. Well of course not. We expect there to be variation in the subjective evaluation of grants. Oh yes, “subjective”. Anyone that pretends this process is “objective” is an idiot. Underinformed. Willfully in denial. Review by human is a “subjective” process by its very definition. That is what it means.

The only debate here is how much variability we expect there to be. How much precision do we expect in the process.

The most fervent defenders of the general reliability of the NIH grant peer review process almost invariably will acknowledge that the precision of the system is not high. That the “top-[insert favored value of 2-3 times the current paylines]” scoring grants are all worthy of funding and have very little objective space between them.

Yet we still seem to see this disgruntled applicant phenotype, responding with raucous applause to a crock of crap conclusion like that of Pier and colleagues, that seem to feel that somehow it is possible to have a grant evaluation system that is perfect. That returns the exact same score for a given proposal each and every time*. I just don’t understand these people.
__
Elizabeth L. Pier, Markus Brauer, Amarette Filut, Anna Kaatz, Joshua Raclaw, Mitchell J. Nathan, Cecilia E. Ford and Molly Carnes, Low agreement among reviewers evaluating the same NIH grant applications. 2018, PNAS: published ahead of print March 5, 2018, https://doi.org/10.1073/pnas.1714379115

*And we’re not even getting into the fact that science moves forward and that what is cool today is not necessarily anywhere near as cool tomorrow

Shorthand

April 22, 2016

Storyboard

Pretty data

N-up

Prove the hypothesis

Representative image

Trend for significance

Different subcultures of science may use certain phrases that send people in other traditions into paroxysms of critique.

Mostly it is because such phrasing can sound like bad science. As if the person using it doesn’t understand how dangerous and horrible their thinking is. 

We’ve gone a few rounds over storyboarding and representative images in the past. 

Today’s topic is “n-up”, which is deployed, I surmise, after examining a few results, replicates or subjects that look promising for what the lab would prefer to be so. It raises my hackles. It smells to me like a recipe for confirmation bias and false alarming. To me.

Apparently this is normal phrasing for other people and merely indicates the pilot study is complete? 

How do you use the phrase?

via comment from A Salty Scientist:

Gnosis:

When you search for papers on PubMed, it usually gives the results in chronological order so many new but irrelevant papers are on the top. When you search papers on Google Scholar, it usually gives results ranked by citations, so will miss the newest exciting finding. Students in my lab recently made a very simple but useful tool Gnosis. It ranks all the PubMed hits by (Impact Factor of the journal + Year), so you get the newest and most important papers first.

Emphasis added, as if I need to. You see, relevant and important papers are indexed by the journal impact factor. Of course.

A recent HuffPo piece on that rather flagrant bit of cover trolling from TIME magazine irritates me.
Read the rest of this entry »

One still occasionally gets whinging from some corner or other about not being able to run Analysis of Variance statistical procedures (ANOVA) because the data didn’t pass a test of normality. I.e., a test of whether they appear to fit a normal distribution.

Paper reviewers, trainees, colleagues….this can come from any corner. It betrays a grad-school class level of understanding of what statistical analysis of data is supposed to do…but not a grasp of what it is doing for us at a fundamental level within the conduct of science.

Your stock response should be “the ANOVA is robust against violations of normality, move along“.

I note that the company GraphPad, which makes the Prism statistical/curve fitting package beloved of behavioral pharmacologists, has a tidy FAQ answer.

The extract version:

A population has a distribution that may be Gaussian or not. A sample of data cannot be Gaussian or not Gaussian. That term can only apply to the entire population of values from which the data were sampled…In almost all cases, we can be sure that the data were not sampled from an ideal Gaussian distribution… an ideal Gaussian distribution includes some very low negative numbers and some superhigh positive values…When collecting data, there are constraints on the possible values…Other variables can…have physical or physiological limits that don’t allow super large values… plenty of simulations have shown that these tests work well even when the population is only approximately Gaussian…It is hard to define what “close enough” means, and the normality tests were not designed with this in mind.

I have a trainee running a study in which she is examining the effects of methamphetamine on Bunny Hopping using the established open field to hedgerow assay. The primary dependent variable is escape latency from stimulus onset to crossing the plane of the hedge.

She is examining the effects of a locomotor stimulant dose of methamphetamine derived from her pilot dose-response study versus vehicle in groups of Bunnies which have been trained for six weeks in our BunnyConditioning Model and age matched sedentary Bunnies. (The conditioning training consists of various sprint, long run, horizonal hop and vertical leap modules.)

So we have four groups of Bunnies as follows:
1. Conditioned, Vehicle
2. Conditioned, Meth
3. Sedentary, Vehicle
4. Sedentry, Meth

The trainee is actually a collaborating trainee and so these data involve the analytic input of multiple PIs in addition to the trainee’s opinio. We are having a slight disagreement over the proper analysis technique so I thought I would turn to the brilliant DM readers.

I stumbled back onto something I’ve been meaning to get to. It touches on both the ethical use of animals in research, the oversight process for animal research and the way we think about scientific inference.

 

Now, as has been discussed here and there in the animal use discussions, one of the central tenets of the review process is that scientists attempt to reduce the number of animals wherever possible. Meaning without compromising the scientific outcome, the minimum number of subjects required should be used. No more.

physioprofitinErrBars-1.jpg

run more subjects..

We accept as more or less a bedrock that if a result meets the appropriate statistical test to the standard p < 0.05. Meaning that sampling the set of numbers that you have sampled 100 times from the same underlying population, fewer than five times will you get the result you did by chance. From which you conclude it is likely that the populations are in fact different.

 

There is an unfortunate tendency in science, however, to believe that if your statistical test returns p < 0.01 that this result is better. Somehow more significant, more reliable or more..real. On the part of the experimenter, on the part of his supervising lab head, on the part of paper reviewers and on the part of readers. Particularly the journal club variety.

False.

Read the rest of this entry »

So you’ve just completed your last assays on physioprofitin signaling in the Namnezian complex. Lo and behold it is qaz-mediated, just like you suspected and the beccans are off the freaking chart. woot! PiT/PlS ratios are within relevant physiological ranges and still this work of art, your labor of love, came through with the experimental goods.
With a hope and a prayer you run your stats….and YES! p < 0.01!!!!!!!!!!!!!!!
What is the correct way to report your big Result?

The statistical analysis____________ qaz-mediated upregulation of physioprofitin in the Namnezian complex.polls

Andrew Gelman, professor of statistics at Columbia University, has joined the Scienceblogs as author of Applied Statistics. One look at his professional page and you will see why I am so happy to have him here:

Andrew has done research on a wide range of topics, including: …[snip]… methods in surveys, experimental design, statistical inference,

I think I am going to enjoy having this guy around. If you want to know what you are in for, visit his old blog for a taste.

In case you have been living under a rock (yet inexplicably reading this blog) Usain Bolt has now run both the 100 meter and 200 meter track events faster than anyone ever. The margin of improvement in the 100 meter event (which occurred earlier) was sufficient to start the sports world abuzz. Naturally, sports fans are willing to talk endlessly about the most absurd minutia and implications of such an event in terms pedestrian and embarrassingly overwrought.
YHN is no different.

Read the rest of this entry »

“What we need are one-handed scientists!” – Sen Edmund Muskie

The title is a question that is most frequently asked by a parent or close relative of an individual (typically male) who is in the early adult years. Said parent is clearly distressed by the career choices made/not made by their son, grandson or nephew who (they have finally acknowledged to themselves) smokes a lot of dope. Has for years and shows no signs of quitting.
The pot smoker has, of course, turned out to be a disappointment to their relatives in one way or another, typically vocationally. And they ask me, almost pleadingly, frequently with a tinge of self-flagellation, to confirm their suspicion that the pot smoking is at the root of junior’s lack of gumption.
I have to tell them that nobody can satisfactorily answer this question for them. Not me, not science. Not with any confidence of certainty, anyway.

Read the rest of this entry »

I often think to myself that if we could effectively teach one single concept to all citizens, get them to really understand it and apply it to life we would all be better off. It has relevance for so many facets of our public and private decision making; ofttimes ignorance of this concept makes for a dismal political or personal outcome.
Brazillion Thoughts has an English-language translation of a post originally written by Karl at Ecce Medicus.
The traslation reads in part:

Many times, in my practice, I am required to explain some statistical concepts to my patients in order to make them avoid some frequent pitfalls. The most common concept I explain is what is “normal” in lab exams. Let’s suppose someone invents a new lab test to measure the glucose in the blood. How would we determine what are the normal values for this test?