NIH/CSR reverts to random-order grant review
September 13, 2019
A tweet from Potnia Theron alerts us to a change in the way the CSR study sections at NIH will review grants for this round and into the future. A tweet from our good blog friend @boehninglab confirms a similar story from the June rounds. I am very pleased.
When I first started reviewing NIH grants, the grants were ordered for discussion in what appeared to be clusters of grants assigned to the same Program Officer or at least a given Program Branch or Division. This was back when a substantial number of the POs would attend the meeting in person to follow the discussion of the grants which might be of interest to them to fund. Quite obviously, it would be most efficient and reasonable for a SRO to be able to tell a PO that their grants would all be discussed in, e.g., a two hour contiguous and limited time interval instead of scattered randomly across a two day meeting interval.
Importantly, this meant that grants were not reviewed in any particular order with respect to pre-meeting scores and the grants slated for triage were ordered along with everything that was slated to be discussed.
When we shifted to reviewing grants in ascending order of preliminary score (i.e., best to worst) I noticed some things that were entirely predictable*. These things had a quelling effect on score movement through the discussion process for various reasons. Now I do say “noticed“. I have not seen any data from the CSR on this and would be very interested to see some before / after for the prior change and for the current reversion. So I cannot assert any strong position that indeed my perceptions are valid.
This had the tendency to harden the very best scores. Which, btw, were the ones almost guaranteed to fund since this came along during a time of fixed budget and plummeting paylines. Still, the initial few projects were not as subject to…calibration…as they may have been before. When you are facing the first two proposals in the round, it’s easy for everyone to nod along with the reviewers who are throwing 2s and saying the grant is essentially perfect. When you get such a beast in day 2 when you’ve already battled through a range of issues…..it’s more likely someone is going to say “yeah but whattabout….?”
It’s axiomatic that there is no such thing as an unassailable “perfect” grant proposal. Great scores arise not because the reviewers can find no flaws but because they have chosen to overlook or downplay flaws that might have been a critical point of discussion for another proposal. The way the NIH review works, there is no re-visitation of prior discussions just because someone realizes that the issue being used to beat the heck out of the current application also applied to the one discussed five grants ago that was entirely downplayed or ignored. This is why, fair or not, discussion tends to get more critical as the meeting goes on. So in the old semi-random order, apps that had good and bad preliminary scores were equally subject to this factor. In the score-ordered era, the apps with the best preliminary scores were spared this effect.
Another factor which contributed to this hardening of the preliminary score order is the “why bother?” factor. Reviewers are, after all, applicants and they are sensitive to the perceived funding line as it pertains to the scores. They have some notion of whether the range of scores under current discussion means “this thing is going to fund unless the world explodes“, “this thing is going to be a strong maybe and is in the hunt for gray zone pickup” or “no way, no how is this going to fund unless there is some special back scratching going on“. And believe you me they score accordingly despite constant admonishment to use the entire range and that reviewers do not make funding decisionsTM.
When I was first on study section the SRO sent out scoring distribution data for the prior several rounds and it was awesome to see. The score distribution would flatten out (aka cluster) right around the operative perceived score line at the time. The discussions would be particularly fierce around that line. But since an app at any given score range could appear throughout the meeting there was motivation to stay on target, right through to the last app discussed at times. With the ordered review, pretty much nothing was going to matter after lunch on the first day. Reviewers were not making distinctions that would be categorically relevant after that point. Why bother fighting over precisely which variety of unfundable score this app receives? So I argue that exhaustion was more likely to amplify score hardening.
I don’t have any data for that but I bet the CSR does if they would care to look.
These two factors hit the triage list in a double whammy.
To recap, anyone on the panel (and not in conflict) can request that a grant slated not to be discussed be raised for discussion. For any reason.
In the older way of doing things, the review order would include grants scheduled for triage, the Chair would come to it and just say that it was triaged and ask if anyone wanted to discuss it. Mostly everyone just enters ND on the form and goes on to the next one. However sometimes a person wanted to bring it up out of triage and discuss it.
You can see that if this was in order of the third proposal on the first day that the psychology of pulling it up would differ from if it were an application scheduled last in the meeting on day 2 when everyone is eager to rush to the airport.
In the score order way of doing things, this all came at the end. When the mind of the reviewer was already on early flights and had sat through many hours of “why are we discussing this one when it can’t possibly fund”. The pressure not to pull up any more grants for discussion was severe. My perception is that the odds of being pulled up for discussion went way, way, way down. I bet CSR has data on that. I’d like to see it.
I don’t have full details if the new policy of review order will include triaged apps or be a sort of hybrid. But I hope it returns to scheduling the triaged apps right along with everything else so that they have a fairer chance to be pulled up for discussion.
__
*and perhaps even intentional. There were signs from Scarpa, the CSR Director at the time, that he was trying to reduce the number of in-person meetings (which are very expensive). If scores did not change much (and if the grants selected for funding did not change) between the pre-meeting average of three people and the eventual voted score, then meetings were not a good thing to have. Right? So a suspicious person like myself immediately suspected that the entire goal of reviewing grants in order of initial priority score was to “prove” that meetings added little value by taking structural steps to reduce score movement.
September 14, 2019 at 10:09 am
In some of study sections I have been on, they started with three or four applications to set the range. These were grants where the reviewers were in agreement, so likely to be good examples of a given score. Generally, one or two that were great, and one or two that were not. And then as we went through, we kept referring back to those first grants.
However, this was much more common back when part of the goal of study section was to help people do better science. (Providing criticism of the experimental design. Much like people are arguing that the pre-registration of experiments allows peer-review of experimental design before running a lot of subjects. Back then, grants were 25 pages and included explicit details of reagents and details and the like. There was a lot less “they’ll figure it out” done then. [I like the goal of helping people do better science, but the criticism did often end up being silly in its detail.]) In these study sections, scores were meant as a message to both program AND the applicant. Now that scores are supposed to be just for program, there is a lot less care taken on the lower end of the scores, because who cares when it’s below fundability? (Yes, I know applicants use scores as information, but they’re not well-designed for that anymore. As evidenced by the nonlinearity and the whole triage process.)
I think the real issue here is neither of these things, however. The real issue is the “muffin memo”, which prevents NIH from supplying food (or coffee!) to study sections. Which means that the longer you go, the hangrier the committee is. Which can have severe effects on a application’s score. By going in scoring order, the hardest grants to score (the ones on the gray-area cusp) are the ones in the mid-morning when everyone is hangriest.
In my experience with random order, it doesn’t help people stay on point or awake. We still saw that people would rush through grants when they wanted to run to the airport. What this did was push grants to the extremes based on rush judgements. It wasn’t that they got worse, but that nuance went away. What random order does is add noise to the later grants, which means some people just get unlucky.
PS. On all the study sections I’ve been on recently, the decision to remove from triage is done first, before any reviews are started. They are done as a block. Certainly, people can later call a grant up from triage, but that is very rare. Generally, in my experience, grants are called up from triage when there is a disagreement between the three reviewers or when one reviewer is unsure of a complaint in their review (is this a real problem or am I being unfair? can I get the committee’s opinion?).
LikeLike
September 16, 2019 at 1:41 pm
What random order does is add noise to the later grants, which means some people just get unlucky.
Ummm, and some people just get lucky.
My point is that if there are indeed factors of late-meeting rushing or low blood sugar or whatever, shouldn’t they affect everyone with equal probability? Doesn’t the score-order way harden the effects of three assigned reviewers? Personally I think that is a bad thing. The whole point of having a panel discussion and a panel vote is to make sure the assigned reviewers are not the sole determinant of grant merit/rank.
Generally, in my experience, grants are called up from triage when there is a disagreement between the three reviewers or when one reviewer is unsure of a complaint in their review (is this a real problem or am I being unfair? can I get the committee’s opinion?).
I agree and I think there should be more of it. And less of those lengthy discussions of grants where the pre-meeting range is only 1 point or 0.
LikeLike
September 17, 2019 at 8:46 am
It’s an interesting question of whether these other effects that add noise (late meeting rush, low blood sugar) should affect all people equally. Given the theory that reviewers are providing signal plus noise, shouldn’t study section order take that signal into account?
I would argue that if you are concerned that the reviewers are not the sole determinant of grant merit/rank that we should have a lot more than three reviewers per grant. I think that would be a more effective fix than changing grant order.
Re: triage – the question is whether the role of grant review is to determine fundability or to provide feedback to the applicant. If the role of grant review is to determine fundability, then there are very few triaged grants that are going to change scores enough to become fundable. If the goal is to communicate information to the applicant, then we shouldn’t have triage and scores need to be linear.
Also, in the modern system I think there’s a lot of hidden scoring effects. What proportion of the study section agrees with the reviewer saying 2 and what proportion agrees with the reviewer saying 3 can change funding likelihoods a lot. Similarly, I’ve seen my share of grants where all three reviewers give a score (say 2) and the veterans on the study section say that it’s not consistent with their typical scoring and “vote out of range”. So I think discussion of cases where there’s little or no difference between reviewers can still show large changes after discussion.
LikeLike