Clustering under the new NIH scoring system, just as predicted.
April 1, 2010
Score clustering, that is.
From the recent NIAID Funding bulletin (h/t: writedit), we have confirmation of what everyone expected when the new scoring system was announced for NIH grant review.
As a brief reminder, scores used by NIH grant reviewers ranged from 1.0 to 5.0 in decimal increments. The average of scores assigned by the panel members (or any applications that were discussed at the meeting-roughly the top 40-50%) was multiplied by ten to give the old application priority score range of 100 (the best) to ~250 (assuming about half were scored). The new system changed to an integer assignment system of 1 to 9.
Prior experience affirms that when the three assigned reviewers were in fairly tight agreement about a score under the prior system, the range might be from 1.2 to 1.5. A more typically ambivalent (but still “pretty good app” range) might have been from 1.3 to 1.8. Add on the effect of ~20 other panel members and you are looking at score outcomes that are reasonably distributed between, say 120-150 or 130-180. Lots of chances for numerical differences between closely-scored applications.
The new system poses the chance that a lot of “ranges” for the application are going to be 1-2 or 2-3 and, in some emerging experiences, a whole lot more applications where the three assigned reviewers agree on a single number. Now, if that is the case and nobody from the panel votes outside the range (which they do not frequently do), you are going to end up with a lot of tied 20 and 30 priority scores. That was the prediction anyway.
NIAID has data from one study section that verifies the prediction.
The bulletin points out the obvious:
In one study section, overall impact scores are clustered at 20, which translates to the 9 percentile. That study section has 8 ties, all of which get a percentile of 9. Because of that clustering, 21, the next score, skips a point to a percentile of 11, which is above the payline. Hence the distortion.
In contrast, when few applications score at a given level, several scores can get the same percentile. For example in the same study section, overall impact scores of 13 to 15 all translate to the 2 percentile.
Lately, we have seen a disproportionate number of overall impact scores at 10, 20, and 30. In the first two review cycles of this fiscal year, approximately 3 percent of applications reviewed by CSR received a score of 20.
I love the tone of that NIAID bulletin. “blur the payline picture”. “distortion”. “the uncertainty we face”. HAHAHAHAAHA. Why, you might almost get the idea that they are as mystified as the applicant. And on our side, don’t you see? What a joke. It cannot be repeated frequently enough.
The NIH knew this going in.
The new scoring system may produce more applications with identical scores (“tie” scores). Thus, other important factors, such as mission relevance and portfolio balance, will be considered in making funding decisions when grant applications are considered essentially equivalent on overall impact, based on reviewer ratings.
More importantly, read this 1996 report on NIH grant scoring. Page 2 of the document proper (page 9 of the pdf) has this bullet point which I find highly illustrative.
Scores are generated (by reviewers), calculated (by computer algorithms) and used (by program staff) as if they represented a higher degree of reliability and precision than they actually have…. Program staff are then put in a position of and held hostage to making and defending funding decisions based on these small and likely meaningless mathematical differences.
The NIH wanted and expected a lot more identical scores. Which inevitably come with massive percentile jumps across single allowable priority score increments. It was built in. The natural result is a lot of kvetching from applicants who see a 1-pt difference in priority score and a big-ol, and highly funding-relevant, jump in percentile between their application and that of their colleague down the hall (I think their 1-pt to 2%ile example here is optimistic, personally).
Beyond this general discontent effect on my fellow PIs who are not thinking this whole thing through, I have to say I’m in favor of this approach and the outcome. I feel that in the past POs were all too willing to act as if they believed, and likely did actually believe, that “small and likely meaningless mathematical differences” in score were producing bedrock quality distinctions. I felt that this allowed them to take the easy way out when it came to sorting through applications for funding. Easy to ignore bunny hopper bias which resulted in 10X of the same-ol, same-ol projects being funded. Easy to ignore career-stage bias. Easy to think that if HawtNewStuff really was all that great, of course it would get good scores. Etc.
I like that the POs are going to have to look at the tied applications and really think about them. It doesn’t guarantee a given PO will come to the decision that you or I would, but at least they have to make a decision based on something related to the content of the application.