It’s Uninterpretable!

August 6, 2020

No, it isn’t.

One of my favorite species of manuscript reviewer comment is that the data we are presenting are “uninterpretable”. Favorite as in the sort of reaction I get where I can’t believe my colleagues in science are this unbelievably stupid and are not completely embarrassed to say any such thing ever.

“Uninterpretable” is supposed to be some sort of easy-out Stock Critique, I do understand that. But it reveals either flagrant hypocrisy (i.e., the reviewer themselves would fall afoul of such a criticism with frequency) or serious, serious misunderstanding of how to do science.

Dr. Zen is the latest to run afoul of my opinion on this. He posted a Tweet:

and then made the mistake of bringing up the U word.

(his followup blog post is here)

Now, generally when I am laughing at a reviewer comment, it is not that they are using “uninterpretable” to complain about graphical design (although this occasionally comes into the mix). They usually mean they don’t like the design of the experiment(s) in some way and want the experiment conducted in some other way. Or the data analyzed in some other way (including graphical design issues here) OR, most frequently, a whole bunch of additional experiments.


“If the authors don’t do this then the data they are presenting are uninterpretable” – Reviewer # 3. It’s always reviewer #3.

Let me address Zen’s comment first. It’s ridiculous. Of COURSE the graph he presented is interpretable. It’s just that we have a few unknowns and some trust. A whole lot of trust. And if we’ve lost that, science doesn’t work. It just doesn’t. So it’s ridiculous to talk about the case where we can’t trust that the authors aren’t trying to flagrantly disregard norms and to lie to us with fake data. There’s just no point. Oh and don’t forget that Zen construed this in the context of a slide presentation. There just isn’t time for minutia and proving beyond any doubt that the presenter/authors aren’t trying to mislead with fakery.

Scientific communication assumes some reasonable common ground, particularly within a subfield. This is okay. When there is cross talk between fields with really, really different practices, ok, maybe a little extra effort is needed.

But this is a graph using the box-and-whiskers plot. This is familiar to the audience and indeed Zen does not seem to take issue with it. He is complaining about the exact nature of the descriptive statistic conventions in this particular box-and-whiskers plot. He is claiming that if this is not specified that the data are “uninterpretable”. NONSENSE!

These plots feature an indicator of central tendency of a distribution of observations, and an indicator of variablity in that distribution. Actually, most descriptive illustrations in science tackle this task. So..it’s familiar. This particular type of chart gives two indications of the variability- a big one and a small one. This is baseline knowledge about the chart type and, again, is not the subject of Zen’s apparent ire. The line is the central tendency. The box outlines the small indicator and the whiskers outline the big indicator. From this we move into interpretation that is based on expectations. Which are totally valid to deploy within a subfield.

So if I saw this chart, I’d assume it was most likely depicting the central tendency of a median or mean. Most likely the median, particularly if the little dot indicates the mean. The box therefore outlines the intraquartile range, i.e., the 25%ile and 75%ile values. If the central tendency is the mean, then it is most likely that the box outlines plus or minus one standard error of the mean or one standard deviation. Then we come to the whiskers. I’d assume it was either the 95% Confidence Interval or the range of values.

I do NOT need to know which of these minor variants is involved to “interpret” the data. Because scientific interpretation functions along a spectrum of confidence in the interpretation. And if differences between distributions (aha another ready assumption about this chart) cannot be approximated from the presentation then, well, it’s okay to delve deeper. To turn to the inferential statistics. In terms of if the small indicator is SD or SEM? meh, we can get a pretty fair idea. If it isn’t the SD or SEM around a mean, or the 25%ile/75%ile around a median, but something else like 3SEM or 35/65? Well, someone is doing some weird stuff trying to mislead the audience or is from an entirely disparate field. The latter should be clear.

Now, of COURSE, different fields might have different practices and expectations. Maybe it is common to use 5 standard deviations as one of the indicators of variability. Maybe it is common to depict the mode as the indicator of central tendency. But again, the audience and the presenter are presumably operating in approximately the same space and any minor variations in what is being depicted do not render the chart completely uninterpretable!

This is not really any different when a manuscript is being reviewed and the reviewers cry “Uninterpretable!”. Any scientific paper can only say, in essence, “Under these conditions, this is what happened”. And as long as it was clear what was done and the nature of the data, the reporting of can be interpreted. We may have more or fewer caveats. We may have a greater or smaller space of uncertainty. But we can most certainly interpret.

It sometimes gets even worse and more hilarious. I have this common area where we present data where the error bars are smaller than the (reasonably sized) symbols for some (but not all) of the groups. And we may have cases where the not-different (by inferential stats *and* by any rational eyeball and consideration of the data at hand) samples cannot be readily distinguished from each other (think: overlapping longitudinal or dose curves).

“You need to use color or something else so that we can see the overlapping details or else it is all uninterpretable!” – Reviewer 3.

My position is that if the eye cannot distinguish any differences this is the best depiction of the data. What is an error is presenting data in a way that gives some sort of artificial credence to a difference that is not actually there based on the stats, the effect size and a rational understanding of the data being collected.

Status

July 30, 2015

This moment

January 19, 2011

We have some great stuff in the hopper.

Immediate data flow is rocking, I get PI crack updates every few days that are fun and fascinating.

Middle term, the projects themselves are on track and doing what they are supposed to be doing, I.e. turning up unexpected leads for more studies. Moar! I say!

Long term we have almost a programmatic effort going on at least two things I’ve been working on for a couple, three years.

The last is my *job*, of course. That’s what the PI is supposed to be doing. Getting the several year plan harmonized into a program of investigation. Lining up the people, of course. And the funding. And wrangling the local institution into doing what you need it to do.

The data-crack is undeniable. It is a powerful and immediate reinforcer. So are new hypotheses and unexpected results that need to be figured out, investigated and nailed down with more data.

But the *programmatic* successes?

Sublime.

I like my job quite a lot today.

This is amazing. Strike that, AMAZING!

A paper published in PLoS ONE by Martin and colleagues examines the fate of R01 applications reviewed in 61 of the 172 standing study sections convened by the Center for Scientific Review of the NIH in a single round (the January 2009 Council one- submitted Jun-Jul 2008 and reviewed in Oct-Nov 2008).

It is going to take me a bit to go through all the data but lets start with Figure 1. This plots the preliminary scores (average of ~3 assigned reviewers) against the final priority score voted by the entire panel.

Figure 1. Average Preliminary Score versus SRG Final Priority Score. Preliminary Scores represent the average of the independent R01 priority scores given by the three assigned reviewers; the final priority score is the average of all the scores given by the voting members of the panel. Each data point represents the outcome for one R01 application. The difference between preliminary and final priority scores represents the change between the two values. Applications with differences displayed on the left declined after discussion; those on the right improved. doi:10.1371/journal.pone.0013526.g001

The first and most obvious feature is the tendency for discussion to make the best scores (lowest in the NIH scoring system) more extreme. I would suggest that this results from two factors. First, reviewers are reluctant (in my experience) to assign the best possible score prior to discussion. I don’t understand this personally, but I guess I can grasp the psychology. People have the idea that perfection exists out there in some application and they want to reserve some room so that they can avoid having ever awarded a perfect score to a lesser application. Silly, but whatever. Once discussion starts and everyone is nodding along approvingly it is easier to drift to a more perfect score.

Second, there is a bit of the old “Fund that puppy NOW!” going on. Particularly, I would estimate, for applications that were near misses on a prior version and have come back in review. There can be a tendency to want to over-emphasize to Program staff that the study section found the application to be in the must-fund category.

__
Martin MR, Kopstein A, Janice JM, 2010 An Analysis of Preliminary and Post-Discussion Priority Scores for Grant Applications Peer Reviewed by the Center for Scientific Review at the NIH. PLoS ONE 5(11): e13526. doi:10.1371/journal.pone.0013526

I have a trainee running a study in which she is examining the effects of methamphetamine on Bunny Hopping using the established open field to hedgerow assay. The primary dependent variable is escape latency from stimulus onset to crossing the plane of the hedge.

She is examining the effects of a locomotor stimulant dose of methamphetamine derived from her pilot dose-response study versus vehicle in groups of Bunnies which have been trained for six weeks in our BunnyConditioning Model and age matched sedentary Bunnies. (The conditioning training consists of various sprint, long run, horizonal hop and vertical leap modules.)

So we have four groups of Bunnies as follows:
1. Conditioned, Vehicle
2. Conditioned, Meth
3. Sedentary, Vehicle
4. Sedentry, Meth

The trainee is actually a collaborating trainee and so these data involve the analytic input of multiple PIs in addition to the trainee’s opinio. We are having a slight disagreement over the proper analysis technique so I thought I would turn to the brilliant DM readers.

Data w000tang!!!

August 24, 2010

You know those weeks where every day the data keep getting more and more exciting? Yeah, I’m having one of those times three…w00t!!!!

As disposable as a Kimwipe

August 13, 2010

Odyssey observed recently that the most disposable resource in the laboratory ought to be… the hypothesis.

Well, I’m getting some pretty cool results from one of my projects right now. It required the application of a couple of technologies in combination so it took us awhile to get it running. I probably came up with the hypothesis three years ago, maybe two.

And now, I’m applying the approach we’ve developed to a slightly different question than originally intended, but close enough for BWAAHAHAHA! purposes. The question is fascinating and more novel anyway so we have a three-fer instead of a two-fer (or something like that).

Trouble is, these fascinating results are questioning the original hypothesis that I’ve been working toward testing. I have grant proposals written on this stuff!

But you know what? Being possibly wrong on my original hypothesis is no big deal, we’ll just follow on from what our current data are telling us. It will still end up someplace that is interesting.

That’s the beauty of not being obsessed with your theories and hypotheses. In a lot of ways you are much freer this way. You may not waste as much time driving your pet hypothesis straight through the dust and into the bedrock before you realize you were wrong.