Pier and colleagues published a study purporting to evaluate the reliability of NIH style peer review of grant applications. Related work that appears to be from the same study was published by this group in 2017.

From the supplement to the 2018 paper, we note that the reviewer demographics were 62% Asian, 38% white with zero black or hispanic reviewers. I don’t know how that matches the panels that handle NCI applications but I would expect some minimal black/hispanic representation and a lot lower Asian representation to match my review panel experiences. The panels were also 24% female which seems to match with my memory of NIH stats for review running under 1/3 women.

There were 17% of reviewers at assistant professor rank. This is definitely a divergence from CSR practice. The only data I saw right around the time of Scarpa’s great Purge of Assistant Professors suggested a peak of 10% of reviewers. Given the way ad hoc / empaneled reviewer loads work, I think we can conclude that way fewer than 10% of reviews were coming from Assistant Professors. As you know, we are now a decade past the start of the purge and these numbers have to be lower. So the panel demographics are not similar.

N.b., The 2017 papers says they surveyed the reviewers on similarity to genuine NIH review experience but I can’t find anywhere it states the amount of review experience for the subjects. Similarly, while they all had to have been awarded at least one R01, we don’t know anything about their experiences as applicants. Might be relevant. A missed opportunity would seem to be the opportunity to test reviewer demographics in the 2017 paper which covers more about the process of review, calibration of scoring, agreement after discussion, etc.

The paper(s) also says that they tried to de-identify the applicants.

All applications were deidentified, meaning the names of the PIs, any co-investigators, and any other research personnel were replaced with pseudonyms. We selected pseudonyms using public databases of names that preserved the original gender, nationality, and relative frequency across national populations of the original names. All identifying information, including institutional addresses, email addresses, phone numbers, and hand-written signatures were similarly anonymized and re-identified as well.

I am still looking but I cannot find any reference to any attempt of the authors to validate whether the blinding worked. Which is in and of itself a fascinating question. But for the purposes of the “replication” of NIH peer review we must recognize that Investigator and Environment are two of five formally co-equal scoring criteria. We know that the NIH data show poor correlation of Investigator and Environment criterion scores with overall voted impact score (Approach and Significance are the better predictors), but these are still scoring criteria. How can this study attempt to delete two of these and then purport to be replicating the process? It is like they intentionally set out to throw noise into the system.

I don’t think the review panels triaged any of the 25 proposals. The vast majority of NIH review involves triage of the bottom ~half of the assigned proposals. Reviewers know this when they are doing their preliminary reading and scoring.

Pier and colleagues recently published a study purporting to address the reliabiliy of the NIH peer review process. From the summary:

We replicated the NIH peer-review process to examine the qualitative and quantitative judgments of different reviewers examining the same grant application. We found no agreement among reviewers in evaluating the same application. These findings highlight the subjectivity in reviewers’ evaluations of grant applications and underscore the difficulty in comparing the evaluations of different applications from different reviewers—which is how peer review actually unfolds.

emphasis added.

This thing is a crock and yet it has been bandied about on the Twitts as if it is the most awesome thing ever. “Aha!” cry the disgruntled applicants, “This proves that NIH peer review is horrible, terrible, no good, very bad and needs to be torn down entirely. Oh, and it also proves that it is a super criminal crime that some of my applications have gone unfunded, wah.

A smaller set of voices expressed perplexed confusion. “Weird“, we say, “but probably our greatest impression from serving on panels is that there is great agreement of review, when you consider the process as a whole.

So, why is the study irretrievably flawed? In broad strokes it is quite simple.
Restriction of the range. Take a look at the first figure. Does it show any correlation of scores? Any fair view would say no. Aha! Whatever is being represented on the x-axis about these points does not predict anything about what is being represented on the y-axis.

This is the mistake being made by Pier and colleagues. They have constructed four peer-review panels and had them review the same population of 25 grants. The trick is that of these 16 were already funded by the NCI and the remaining 9 were prior unfunded versions of grants that were funded by the NCI.

In short, the study selects proposals from a very limited range of the applications being reviewed by the NIH. This figure shows the rest of the data from the above example. When you look at it like this, any fair eye concludes that whatever is being represented by the x value about these points predicts something about the y value. Anyone with the barest of understanding of distributions and correlations gets this. Anyone with the most basic understanding grasps that a distribution does not have to have perfect correspondence for there to be a predictive relationship between two variables.

So. The authors claims are bogus. Ridiculously so. They did not “replicate” the peer review because they did not include a full range of scores/outcomes but instead picked the narrowest slice of the funded awards. I don’t have time to dig up historical data but the current funding plan for NCI calls for a 10%ile payline. You can amuse yourself with the NIH success rate data here, the very first spreadsheet I clicked on gave a success rate of 12.5% for NCI R01s.

No “agreement”. “Subjectivity”. Well of course not. We expect there to be variation in the subjective evaluation of grants. Oh yes, “subjective”. Anyone that pretends this process is “objective” is an idiot. Underinformed. Willfully in denial. Review by human is a “subjective” process by its very definition. That is what it means.

The only debate here is how much variability we expect there to be. How much precision do we expect in the process.

The most fervent defenders of the general reliability of the NIH grant peer review process almost invariably will acknowledge that the precision of the system is not high. That the “top-[insert favored value of 2-3 times the current paylines]” scoring grants are all worthy of funding and have very little objective space between them.

Yet we still seem to see this disgruntled applicant phenotype, responding with raucous applause to a crock of crap conclusion like that of Pier and colleagues, that seem to feel that somehow it is possible to have a grant evaluation system that is perfect. That returns the exact same score for a given proposal each and every time*. I just don’t understand these people.
Elizabeth L. Pier, Markus Brauer, Amarette Filut, Anna Kaatz, Joshua Raclaw, Mitchell J. Nathan, Cecilia E. Ford and Molly Carnes, Low agreement among reviewers evaluating the same NIH grant applications. 2018, PNAS: published ahead of print March 5, 2018, https://doi.org/10.1073/pnas.1714379115

*And we’re not even getting into the fact that science moves forward and that what is cool today is not necessarily anywhere near as cool tomorrow