Clustering under the new NIH scoring system, just as predicted.

April 1, 2010

Score clustering, that is.
From the recent NIAID Funding bulletin (h/t: writedit), we have confirmation of what everyone expected when the new scoring system was announced for NIH grant review.
As a brief reminder, scores used by NIH grant reviewers ranged from 1.0 to 5.0 in decimal increments. The average of scores assigned by the panel members (or any applications that were discussed at the meeting-roughly the top 40-50%) was multiplied by ten to give the old application priority score range of 100 (the best) to ~250 (assuming about half were scored). The new system changed to an integer assignment system of 1 to 9.
Prior experience affirms that when the three assigned reviewers were in fairly tight agreement about a score under the prior system, the range might be from 1.2 to 1.5. A more typically ambivalent (but still “pretty good app” range) might have been from 1.3 to 1.8. Add on the effect of ~20 other panel members and you are looking at score outcomes that are reasonably distributed between, say 120-150 or 130-180. Lots of chances for numerical differences between closely-scored applications.
The new system poses the chance that a lot of “ranges” for the application are going to be 1-2 or 2-3 and, in some emerging experiences, a whole lot more applications where the three assigned reviewers agree on a single number. Now, if that is the case and nobody from the panel votes outside the range (which they do not frequently do), you are going to end up with a lot of tied 20 and 30 priority scores. That was the prediction anyway.
NIAID has data from one study section that verifies the prediction.

NIAID-score-compress.gif

The bulletin points out the obvious:

In one study section, overall impact scores are clustered at 20, which translates to the 9 percentile. That study section has 8 ties, all of which get a percentile of 9. Because of that clustering, 21, the next score, skips a point to a percentile of 11, which is above the payline. Hence the distortion.
In contrast, when few applications score at a given level, several scores can get the same percentile. For example in the same study section, overall impact scores of 13 to 15 all translate to the 2 percentile.
Lately, we have seen a disproportionate number of overall impact scores at 10, 20, and 30. In the first two review cycles of this fiscal year, approximately 3 percent of applications reviewed by CSR received a score of 20.

I love the tone of that NIAID bulletin. “blur the payline picture”. “distortion”. “the uncertainty we face”. HAHAHAHAAHA. Why, you might almost get the idea that they are as mystified as the applicant. And on our side, don’t you see? What a joke. It cannot be repeated frequently enough.

This is a feature, not a bug.

The NIH knew this going in.

The new scoring system may produce more applications with identical scores (“tie” scores). Thus, other important factors, such as mission relevance and portfolio balance, will be considered in making funding decisions when grant applications are considered essentially equivalent on overall impact, based on reviewer ratings.

More importantly, read this 1996 report on NIH grant scoring. Page 2 of the document proper (page 9 of the pdf) has this bullet point which I find highly illustrative.


Scores are generated (by reviewers), calculated (by computer algorithms) and used (by program staff) as if they represented a higher degree of reliability and precision than they actually have…. Program staff are then put in a position of and held hostage to making and defending funding decisions based on these small and likely meaningless mathematical differences.

Emphasis added.
The NIH wanted and expected a lot more identical scores. Which inevitably come with massive percentile jumps across single allowable priority score increments. It was built in. The natural result is a lot of kvetching from applicants who see a 1-pt difference in priority score and a big-ol, and highly funding-relevant, jump in percentile between their application and that of their colleague down the hall (I think their 1-pt to 2%ile example here is optimistic, personally).
Beyond this general discontent effect on my fellow PIs who are not thinking this whole thing through, I have to say I’m in favor of this approach and the outcome. I feel that in the past POs were all too willing to act as if they believed, and likely did actually believe, that “small and likely meaningless mathematical differences” in score were producing bedrock quality distinctions. I felt that this allowed them to take the easy way out when it came to sorting through applications for funding. Easy to ignore bunny hopper bias which resulted in 10X of the same-ol, same-ol projects being funded. Easy to ignore career-stage bias. Easy to think that if HawtNewStuff really was all that great, of course it would get good scores. Etc.
I like that the POs are going to have to look at the tied applications and really think about them. It doesn’t guarantee a given PO will come to the decision that you or I would, but at least they have to make a decision based on something related to the content of the application.

18 Responses to “Clustering under the new NIH scoring system, just as predicted.”

  1. Lab Lemming Says:

    Then why use numerical scores at all?

    Like

  2. Rob Says:

    Back with the Medicinal Chemistry study section was “merged” with the Bioorganic and Natural Products study section into SBCA and SBCB (Synthetic and Biological Chemistry), we had a chance to rebase scores, so the median score was 3.0. This worked very well for 2-3 meetings, where really good grants got scores below 1.5, but fundable grants (12-15 percentile) were nearer 2.0. After about three meetings, the scores went right back to where they were before the reorganization, clustered as you note around 1.7-1.8. It’s human nature. With apologies to Garrison Keillor, all the grants were above average.

    Like

  3. qaz Says:

    Lab Lemming #1 is right. The new system still gives the illusion of precision and now it gives the illusion of clustering. But the math is now such that clustering is a consequence of random study section personalities. If all three reviewers say “2”, then most of the study section will say “2” and the score will be 20. If one reviewer is a *#$@ and says “5”, then the score will be 21 and the proposal will fall out of the cluster.
    At this point, I don’t know what NIH wanted. If NIH wanted more clustering, so that program would have more power (like it does at NSF), then they should have switched to a non-scored system.
    If NIH wanted to fix all those biases, they should do it explicitly, not through this backhanded mess with this new scoring system. For example, career bias should have been handled by separating funding pools from the start, not by mucking with the scoring system to create more clustering.
    BTW, there is absolutely no evidence that this clustering helps any of those problems (too much bunny-hopping, career biases). For all we know, the early stage investigators are getting 21s and the senior people are getting 20s. (I can easily imagine scenarios where that would occur – for example, imagine a scenario where someone is more likely to vote out of range on an early career proposal than a late one.)

    Like

  4. qaz Says:

    Rob #2 – All the grants you saw were above average. Just as all the gymnasts at the Olympics are clustered near the top. The original scoring system had a complete range to make room for the occasional disaster. (I have seen a grant that got a 5.0. It deserved it. And it set the anchor to make me believe that all those 1.7-1.8 grants were placed exactly right.)

    Like

  5. Rob Says:

    @qaz
    That was not my point, or at least my intended point. Score compression happens despite attempts to prevent it. It will happen under the new system just as it did under the old system. I’ve been to study sections were half of the grants that were scored ranged from 1.7-1.9.

    Like

  6. DK Says:

    at least they have to make a decision based on something related to the content of the application.
    More precisely, they will make a decision based on whatever they feel like. May have something to do with the content, may be a name that rings the bell or maybe a mood waking up today. Might as well throw dice. Dice sound a lot more objective to me.

    Like

  7. DrugMonkey Says:

    More precisely, they will make a decision based on whatever they feel like.
    Oh grow up. Program staff are scientists and professionals. Yes, they are people subject to the foibles and failings common to the species. But they have a job to do and they do it. Just like anyone else.

    Like

  8. DK Says:

    Program staff are scientists and professionals.
    That does no mean that they are objective.

    Like

  9. whimple Says:

    That does not mean that they are objective.
    But since nobody else is objective either, lack of objectivity isn’t much of a critique.

    Like

  10. Eli Rabett Says:

    NIH panels may be to large, but NSF and NASA panels explicitly order proposals

    Like

  11. Paul Murray Says:

    “If NIH wanted to fix all those biases, they should do it explicitly, not through this backhanded mess with this new scoring system.”
    Seems to me that the goal is to spend the money in a worthwhile way. If one person drops out owing to a 21 instead of a 20 well – meh – there’s plenty of “20” projects that are well worth spending the money on. If this 21 project was so damn good, then it would have ranked better.
    But, yes: “OMG, teh num3er5!”. Some sort of preferential voting system might be better. There are several systems to choose from, each of which produces slightly different results when the differences between the candidates are marginal.

    Like

  12. qaz Says:

    Paul #11 – The problem is that the goal is to do more than “to spend money in a worthwhile way”. Study sections do fine at a tactical level: is grant X more likely to produce impact than grant Y? But study sections are terrible at strategy: are we starving the next generation of scientists? are we spending enough money on the right big questions?
    DM’s suggestion at the top post was that NIH’s goal was to cluster the scores so that Program could steer the money based on more strategic considerations. For example, issues of newbies getting shut-out, large labs getting over-funded, bunny-hopping study-sections keeping money from teh Translational research, etc.
    If all they wanted was to spend money on science, the old system was doin’ just fine.

    Like

  13. whimple Says:

    Study sections do fine at a tactical level: is grant X more likely to produce impact than grant Y?
    They do fine at this? Oh good. Is there any evidence to back this up, or do we all just feel this to be true? Just sayin’ 🙂

    Like

  14. Carolyn Says:

    Program staff do strive to be objective, but you had better believe there are political pressures on them also of various kinds. If they have the choice between funding application 1 from Obscure University in the Boondocks versus application 2 from Famous Ivy League University Whose Graduates Are High Up in Your Institute’s Programs and Whose Researchers are Very Aggressive and Who Are Frequently Consulted by Your Institute and Who Will Make a Lot of Phone Calls to Highly Placed Individuals if Their Grant Does Not Get Funded, then we pretty much can guess which one will fare better.

    Like

  15. travc Says:

    This may be a dumb question since I know little of the NIH grant process (my wife does NIH, I’ve only been involved with NSF, DARPA, ONR, and such).
    Why use the absolute score at all? Wouldn’t using the distribution be a lot simpler and more reliable? AKA normalization. Am I just completely confused how the system works?

    Like

  16. DrugMonkey Says:

    Am I just completely confused how the system works?
    Perhaps. In the normal, run of the mill review a grant will be reviewed in a standing study section that meets three times per year. There’s a lot of variation in review load but let’s say a panel reviews 100 R01 proposals per round.
    A given application is assigned a priority score in the meeting. This score is then tossed into the entire pool of scores for all R01s evaluated in that panel. For the current round and for the two previous rounds (giving a rolling year of proposals). This total population of scores is then used to assign a percentile to the application. It is the percentile that then becomes the all-important ranking as Program seeks to determine what gets funded.
    So scores are normalized within study section and within a rolling one-year interval.

    Like

  17. Karl Says:

    I complained about the new scoring system from the beginning, but CSR doesn’t seem to care about reviewers’ opinions. While an important part of the argument for the new system was that reviewers are not able to score grants with much precision, I think a key consideration is: if you have 25 imprecise measurement devices, should you round their results and then average, or should you average and then round? The new system does the former, but the latter is clearly better.

    Like


  18. […] means that the top half of the proposals may have an overall score of three to four (out of nine). DrugMonkey blogged about the problems in score compression last year, so I refer you to that discussion. In a […]

    Like


  19. […] If you list a lot of weaknesses it will be hard to defend a "3", for example. In order to prevent score compression, you are encouraged to consider a 5 as average. We were all told that if we score something better […]

    Like

  20. Steve Says:

    Nothing has changed. Its ridiculous.

    Like


Leave a comment