I often write blog comments about NIH grant review matters that exist in an uncomfortable tension between what NIH wants us to do on study section and what I see as our professional obligation to try to make NIH do what we want it to do. Part of peer contribution to determining what grants should be funded is inevitably doing things that, you guessed it, determine what grants should be funded.

NIH could not be any clearer that they do not want their peer reviewers determining what grants should be funded on an acute and tactical basis. We evaluate and they decide. I point this out with regularity on this blog. I defend this with regularity on this blog.

But there are many ways in which I think we extramural academic scientists also contribute to what Program should decide. Including by being very intentional about how we evaluate merit of the proposals within NIH’s review structure and desired approach.

As we face yet another contraction of paylines and implicit paylines*, I believe this tension will become increasingly acute over the next few to several rounds of review. NCI famously threatened a 4 percentile payline for the end of FY2025. I don’t know if that was so, but this represents roughly halving their high single-digit paylines of many recent Fiscal Years. Multi-year funding inevitably means success rates (and implicit paylines) are going to be lower in FY2025 and likely for a few more to come, even if the budget remains where it is. As paylines and virtual paylines sink lower into single-digits, the way peer review hands out those paylines gets increasingly important.

A two-percentile shift from 3%ile to 5%ile had no impact at even the stingiest ICs a year or two ago. Now it may mean a clear difference between funded and not-funded.

So what is the problem with nice study sections? Sure, everyone knows that percentiles are there to account for hardass sections that give everything a 30 or worse and easy sections that hand out 20s like tic tacs. But there is a more technical feature of percentile calculation that is not apparent to everyone. and it interacts with the kind of niceness in study sections that reflects being nice to other reviewers and other proposals.

And this is the two sided coin of “consensus”.

When NIH adopted the 1-9 scoring scheme there was a subtext that they actually intended to produce a lot of tied scores at 20, 30 and 40. This was a highly predictable result of study section consensus and data have shown it to have arrived as predicted.

[Updated to add 11/25/2025: I located a depiction of the scores here in a NIH/CSR website “A Pilot Study of Half-Point Increments in Scoring“. The takeaway interpretation of their study was that adding the helf-point increments did not affect score spreading. It says absolutely nothing about percentile calculation. Which I chalk up to the usual way NIH goes about evaluating their changes in policy, i.e., misdirection to support their policy moves, rather than seeking true illumination.]

Reviewers nominate pre-Discussion and post-Discussion scores in whole digits. The study section is mostly expected to vote within the range of the post-Discussion scores. Understandably, it is often the case that reviewers “reach consensus” and all give a 2. Or a 3. Or even, less frequently a 4. Even a 1.

So let’s take a theoretical study section that reviews 100 R01 applications in their three rounds of review. Let’s say they “reach consensus” on seven of them at 2 impact scores, which translates into a voted score of 20. I can’t recall a single study section I’ve been on since the new scoring system was put in place that didn’t have at least one application getting a 2-2-2 on post-discussion, and mostly it has been more than one. Let’s also say for didactic purposes that there are three applications in the year that score better than 20, even if only a single person voted a 1**.

NIH calculates percentile as the average rank of tied score applications. So the first three get 1%ile, 2%ile, 3%ile but the seven apps tied at 20 which take up ranks 4-10 all get 7%ile. The 4th ranked grant that should have been a 4%ile has now shot way over the payline. For those ICs that use a soft payline and a robust pickup behavior, there is a problem for the 11th ranked proposal. All of a sudden the 10-11 split is not 10%ile to 11%ile but 7%ile to 11%ile. Forming a perceptual quality distinction that is not at all warranted.

To take another example, suppose only the 4th through 7th proposals are tied at 20? No biggie, the average rank is 5.5, pretty close to the perception of 4%ile, right? Nope. The NIH rounds up. So that now becomes a 6%ile instead of a 4%ile score

It gets worse. Suppose that instead of 100 proposals, the study section has somewhat fewer? Say, 90. Well, now that 5.5 average rank calculates to 6.1%ile so no difference, right? NOPE!

When I say the NIH rounds up, I mean they ROUND UP!

Anything over the whole digit goes UP. A 6.1%ile is rounded up to 7%ile. And honestly I do not know how strict this is but I assume anything over the whole number is rounded up. So 99 apps in the denominator may have this impact.

I remind you this was intentional. At least some of the plan for this scoring system was to try to force Program Officers to stop relying on nonsensically small differences in score/percentile and apply their brain to pick and choose which of tied score grants were going to get funded. Or maybe we should phrase this as allowing them to ignore the tyranny of false distinction.

The result of this is to diminish the impact of study sections in deciding what grants were going to fund and which were not. Study sections which are “nice” by reaching consensus after discussion, and voting in a bunch of tied scores, are diminishing their own impact. Maybe that proposal that should have been the fourth ranked in their section gets a 7%ile instead of a 4%ile, dooming its chances for funding.

Depending on what other study sections are doing, they could even be almost guaranteeing their section results in fewer funded proposals compared with other sections simply because they are “nice”.

Thereby screwing over their own people.

Which isn’t actually very nice.


*As always, I remind you to look at the NIH Data Book page on funding by percentiles and start using the dialogs for FY and IC to see what’s what in past funding climates.

**Many years ago when I was appointed on study section a new Chair declared that we “must” reserve our very best scores for superlative applications that were on the order of the best we’d ever reviewed, lifetime. I may have blogged how silly and statistically ignorant it was, I can’t recall. This sentiment, I’m here to tell you, is still with us. So anything below a 20 would be very rare in this approach to review.

Design a site like this with WordPress.com
Get started