On SROs and child soccer coaches
November 25, 2025
As someone who once spent much of his life watching kid soccer games, from the herding cats stage onward, I have many anecdotes. Anecdotes about why I should never coach kids sports, and about why the people who coach kids are saints who deserve many thanks from the parents of their players. But one of my favorite vignettes was only fully fleshed out in retrospect. The coach of one of my kids’ teams would often exhort the players to “anticipate!”. This was somewhere around the late elementary school, perhaps early middle school, age of the players.
And indeed, as a spectator it was indeed frustrating that some of the players, perhaps most, did not seem to be able to read the flow of the game. This is something critical in passing sports such as soccer. Running to where the ball should be, instead of reacting to a pass after it was kicked, was apparently the main goal of the exhortation.
“Anticipate!“, the coach would shout.
Several years later I was chatting with one of the other parents and she said her player went for a good year and a half on this team before finally telling this parent that they did not know what “anticipate” means! The coach was not a dumb man, in fact quite the contrary. But he never stopped to think perhaps he needed to be very specific about what his pleas for anticipation meant and coach the little minions in how to do the thing he expected.
I think the Scientific Review Officers of the Center for Scientific Review of the National Institutes of Health could perhaps learn something from this.
I located this CSR Pilot study when hunting for an illustration of the way that “consensus” on study sections can play havoc with percentile ranks. It led with an unmistakable depiction of score clustering, in this case induced by the whole integer scoring system and the cultural pressure of study sections to “reach consensus”. Reviewers can only use whole integers (lower is better) and the mean is then multipled by ten to get the voted overall impact score. Reviewers typically vote within the range recommended by the three assigned reviewers after the discussion and thus if they “reach consensus” on 2s, 3s or 4s, there end up being a lot of 20, 30 and 40 voted scores. As depicted here in three rounds from 2016.
I cannot remember ever having been on a NIH study section convened by the CSR that did not at one point or another experience the SRO urging reviewers to spread scores. It started with my very first ad hoc invitation where the SRA (the job title at the time was Administrator, now it is Officer) circulated the score distributions from the past dozen or so rounds with a set of lengthy comments about why clustering scores around the perceived payline was bad. In those days the scoring was 1.0-5.0 (lower was better), and the voted average would be multiplied by 100 (in contrast with the current 1-9, multiplied by 10 scheme). We’d be told that the ten 20 point bins from 100 to 299 should have an “ideal” distribution of equal numbers of proposals in each bin. Of course there were too few in the first two bins and 2-3 times the “ideal” number in the 160-179 and 180-199 bins, or something like that, in the SRAs data report. And we’d be told how a score that would be a 20%ile in a flat distribution would creep up to a 25%ile due to score clustering/compression. We were shown that if one plotted the voted scores against the resulting percentile, the slope diverged from ideal. The good grants got worse scores than they should have with none being awarded perfect scores of 100 (remember this is a percentile base, i.e. all three rounds). Somewhere above 22%ile, scores were better than they should have been, and too many proposals were stacking up in that 180-190 zone.

I have never received such detailed instruction about score compression from any other SRO and I have no idea to what extent this was the SRO’s personal approach or general to the times. I was new.
Still, the SRO encouragement to spread scores is invariant up to this day.
I am wondering today if this consistent trend is in part because nobody tells reviewers how to spread scores. Perhaps we reviewers would like to do what we are asked but we just don’t know how to do it.
There are a lot of internal tendencies and assumptions about scoring. There are psychological factors about scoring, particularly acute because reviewers are recipients of reviews. It’s tough to receive a very bad score and tough to hand out worse* ones. There are cultural factors in the discussions of study sections that shape behavior**. And the above graphs have more features that tell the tale.
Reviewers tend to cling to “perfect” scores as if they are precious jewels they personally own and refuse to award them to proposals. Sure, we see the tied-1s uptick in the first graph but the infrequency of anything below a 15 or so, maybe below a 20, is striking. Just like my depiction in the second graph of that older state of affairs. There is a huge drop off in the first graph after voted scores of 50 and it is clear that review panels agreeing on 7s, 8s or 9s does not happen. While those terrible initial scores might be used, but tossed into the triage bin (spoiler, the 9 is almost never seen and the 8 is quite rare), there is no reason a study section cannot decide to use the full range after triage. No reason they couldn’t have their 50%ile grants garnering 90s. But this would require some very explicit instruction. Anticipate!
I may have over-interpreted what the SRA was telling me in the lead up to my first meeting. I decided that if I put the best grant in my pile at 1.0, the worst at 5.0 and used even intervals for the rest of them in-between this would be a way to spread the scores properly.
I came to believe*** everyone should adopt this strategy.
Use the full range for best and worst, and apply even intervals. As a scoring baseline. Then, if you really convinced yourself the best wasn’t really deserving, or the worst was better than garbage, adjust. Or if some of the in-between ones needed to be moved closer together, fine, do that. But the key was to start off highly spread and closely consider violations of spreading on…well, the merits.
I concluded my scheme was unlikely to give out that many 100s if everyone did it, since how often would three assigned reviewers agree it was the best in their pile through discussion and everyone else concurred? Once or twice a year for the entire panel, I bet. And if reviewers did agree, what was the harm in using the full extent of the range by handing out “perfect” scores once or twice a year? It isn’t as if NIH ICs decide to give the applicant extra money just because they received a perfect score, instead of the same percentile with a sub-perfect score. The R37 MERIT extension may be reserved for low single digit percentiles but I’ve never heard it required a perfect score at any IC.
The proposals in the middle of one person’s pile inevitably would have some variability across reviewers, but that was the fodder for discussion and panel voting. Just like usual. It just seemed as though that right hand part of the Actual line above would be pulled straighter, into better agreement with the flat distribution by radical score interval spreading.
Maybe there are other concrete strategies that would help review panels to spread scores. My approach is but one suggestion. But this does require thinking about how/why panels devolve into certain patterns and how to help reviewers overcome these motivations. They need permission to hand out perfect scores (when bizarrely CSR does the opposite, telling us to reserve the 1 for a bestest-lifetime-ever grant). They need permission and emotional support for handing out 7s, 8s and 9s to voted, non-ND, proposals. They need permission to do some sort of forced choice ranking within their piles.
And above all else, they need feedback, using the actual voting data of the panel to show them what is actually being done with respect to scoring.
*Back in those days there were no criterion scores. A ND grant did not get any indication of whether it was a just-missed 289 or a 499 disaster. Still, the other members of the panel would know if some reviewers was giving out 5.0s.
**I would hear literal gasps around the table when I said my pre-discussion score was a 1.0.
***I gave up this approach after I left empaneled service. Because, say it with me now, peer review of grants is a communication. And when you are ad hoc, there is minimal time for a substantial number of people on the panel to understand the communication if you use an unfamiliar dialect.
