Data curation in the laboratory
April 26, 2019
A semi-thread from frustrated bioinformaticians emerged on twitter recently. In it they take shots at their (presumably) collaborators who do not take their requests for carefully curated and formatted data to heart.
Naturally this led me to taunt the data leech OpenScienceEleventy waccaloons for a little bit. The context is probably a little different (i.e., it seems to reference established collaborations between data-generating and data-analyzing folks) but the idea taps on one of my problems with the OpenScience folks. They inevitably don’t just mean they want access to the data that went into your paper but ALL of your data related to it. Down to the least little recorded unit (someone in the fighty thread said he wanted raw electrophysiological recording to test out his own scoring algorithm or some such). And of course they always mean that it should be nicely formatted in their favorite way, curated for easy understanding by computer (preferably) and, in all ways, the burden should be on the data-generating side to facilitate easy computational analysis. This is one of the main parts that I object to in their cult/movement- data curation in this way comes with a not-insubstantial cost expended to the benefit of some internet random. I also object on the basis of the ownership issues, bad actors (think: anti-science extremists of various stripes including right wing “think tanks” and left wing animal right terrorists), academic credit, opportunity loss among other factors.
However, the thought of the day is about data curation and how it affects the laboratory business and my mentoring of science trainees. I will declare that consistent data collation, curation, archiving and notation is a good thing for me and for my lab. It helps the science advance. However, these things come at a cost. And above all else when we consider these things, we have to remember that not every data point collected enters a scientific manuscript or is of much value five or ten years down the line. Which means that we are not just talking about the efficient expenditure of effort on the most eventually useful data, we’re talking about everything. Does every single study get the full data analysis, graphical depiction and writeup? Not in my lab. Data are used at need. Data are curated to the extent that it makes sense and sometimes that is less than complete.
Data are collected in slightly different ways over time. Maybe we changed the collection software. Maybe our experiments are similar, but have a bit of a tweak to them. Maybe the analyses that we didn’t think up until later might be profitably applied to earlier datasets but…..the upside isn’t huge compared to other tasks. Does this mean we have to go back and re-do the prior analyses with the current approach? If we want to, this sometimes that requires that third and fourth techniques (programs, analysis strategies, etc) be created and applied. This comes with additional effort costs. So why would we expend those efforts for something? If there was interest or need on the part of some member of the laboratory, sure. If a collaborator “needs” that analysis, well, this is going to be case by case on the basis of what it gains us, the collaboration or maybe the funded projects. Because it all costs. Time, which is money, and the opportunity cost of those staff members (and me) not doing other tasks.
Staff members. Ah, yes, the trainees. I am totally supportive of academic trainees who want to analyze data and come up with new ways to work with our various stock-in-trade data sets and archive of files. This, btw, is what I did at one of my postdoctoral stops. I was working with a model where we were somewhat captive to the rudimentary data analyses provided by the vendor’s software. The data files were essentially undocumented, save for the configuration data, dates and subject identifiers. I was interested in parsing the data in some new ways so I spent a lot of time making it possible to do so. For the current files I was collecting and for the archive of data collected prior to my arrival and for the data being collected by my fellow trainees. In short, I faced the kind of database that OpenData people claim is all they are asking for. Oh, just give us whatever you have, it’s better to have anything even if not annotated, they will claim. (Seriously). Well, I did the work. I was able to figure out the data structure in the un-annotated files. This was only possible because I knew how the programs were working, how the variables could be set for different things, what the animals were doing in a general sense in terms of possible responses and patterns, how the vendor’s superficial analysis was working (for validation), what errors or truncated files might exist, etc. I wrote some code to create the slightly-more-sophisticated analyses that I happened to dream up at the time. I then started on the task of porting my analysis to the rest of the lab. So that everyone from tech to postdoc was doing initial analysis using my programs, not the vendor ones. And then working that into the spreadsheet and graphing part of the data curation. And THEN, I started working my way back through the historical database from the laboratory.
It was a lot of work. A lot. Luckily my PI at the time was okay with it and seemed to think I was being productive. Some of the new stuff that I was doing with our data stream ended up being included by default in most of our publications thereafter. Some of it ended up in its own publication, albeit some 12 years after I had completed the initial data mining. (This latter paper has barely ever been cited but I still think the result is super cool.) The data mining of files from experiments that were run before I entered the laboratory required a second bit of work, as you might readily imagine. I had to parse back through the lab books to find out which subject numbers belonged together as cohorts or experiments. I had to separate training data from baseline / maintenance studies, from experimental manipulations of acute or longitudinal variety. And examine these new data extractions in the context of the actual experiment. None of this was annotated in the files themselves. There wasn’t really a way to even do it beyond 8 character file names. But even if it had been slightly better curated, I’m just not seeing how it would be useful without the lab books and probably some access to the research team’s memory.
Snapping forward to me as a PI, we have somewhat similar situation in my lab. We have a behavioral assay or two run by proprietary commercial software that generate data files that could, in theory, be mined by anyone that was interested* in some aspect of the behavior that struck their fancy. It would still take a lot of work and at least some access to the superordinate knowledge about the studies a given subject/date stamped file related to. I am happy for trainees in my lab to play with the data files, present and past. I’m happy for them to even replace analysis and reporting strategies that I have developed with their own, so long as they can translate this to other people in the lab. I.e., I am distinctly unkeen on the analysis of data being locked up in the proprietary code or software on a single trainee’s laptop. If they want to do that, fine, but we are going to belt-and-suspenders it. There is much value in keeping a set of data analysis structures more or less consistent over time. Sometimes the most rudimentary output from a single data file (say, how many pellets that rat earned) is all that we need to know, but we need to know that value has been used consistently across years of my work.
I have at least two interests when it comes to data curation in my lab. I need some consistency and I need to be able to understand as the PI what I am looking at. I need to be able to go back to some half-remembered experiment and quickly whip up a preliminary data or slide figure. This leans towards more orthodoxy of analysis. Towards orthodoxy of data structures and formats. Towards orthodoxy in the graphs, for pete’s sake. My attempts to manage this into reality has mixed results, I will note. At the level of an individual staffer, satisfying some data curation goal of the PI (or anyone else, really) can seem like make-work. And it is definitely work to the ends of someone else, I just happen to be the PI and am more equal that anyone else. But it is work. And this means that short cuts are taken. Often. And then it is down to the effort of someone to bring things back up to curation standard. Sure it may seem to be “just as easy” for the person to do it the way I want it, but whaddayaknow, they don’t always see it that way. Or are rushed. Or mean to get to that at the end of the study but then forget. Tomorrow. When it is really needed.
I get this. It is a simple human reality.
In my lab, I am the boss. I get to tell staff members what to do and if they won’t do it, eventually, I can fire them. Their personal efforts (and mine for that matter) are supposed to be directed towards the lab good, first, and the institutional good second. The NIH good is in there somewhere but we all know that since a grant is not a contract, this is a very undefined concept.
There is very little that suggests that the effort of my laboratory staff has to be devoted to the good of some other person who wants access to our data in a way that is useful to them. In fact, I am pretty sure in the extreme case that if I paid a tech or trainee from my grant to work substantial amounts of time on a data analysis/curation project demanded of us by a private for-profit company solely for their own ends, this would violate the rules. There would probably be a technical violation if we did the same for a grant project funded to another researcher if the work had nothing whatever to do with the funded aims in my own lab that were paying the staff member’s salary.
Data curation for others’ ends costs. It costs time and that means that it costs money. It is not trivial. Even setting up your data stream within lab so that it could possibly be easier to share with external data miners costs. And the costs apply to all of the data collected, not just that that eventually, one day is requested of you and ends up in a paper.
__
*as it happens we just fielded a request but this person asked us to collaborate, rightfully so.
Fighting with the New Biosketch format
May 19, 2015
I have been flailing around, of and on for a few months, trying to write my Biosketch into the new format [Word doc Instructions and Sample].
I am not someone who likes to prance around bragging about “discoveries” and unique contributions and how my lab’s work is I am so awesomely unique because, let’s face it, I don’t do that kind of work. I am much more of a work-a-day type of scientist who likes to demonstrate stuff that has never been shown before. I like to answer what are seemingly obvious questions for which there should be lots of literature but then it turns out that there is not. I like to work on what interests me about the world and I am mostly uninterested in what some gang of screechy monkey GlamourHumpers think is the latest and greatest.
Ahem.
This is getting in the way of my ability to:
Briefly describe up to five of your most significant contributions to science. For each contribution, indicate the historical background that frames the scientific problem; the central finding(s); the influence of the finding(s) on the progress of science or the application of those finding(s) to health or technology; and your specific role in the described work.
Now interestingly, it was someone who works in a way most unlike the way I do that showed me the light. Actually, he gave me the courage to think about ignoring this supposed charge in the sample / instruction document. This person recommended just writing a brief sentence or two about the area of work without trying to contextualize the importance or significance of the “contribution”. I believe I actually saw one of the five permitted subheadings on his version that was more or less “And here’s some other stuff we work on that wasn’t easily categorized with the rest of it.”
I am at least starting from this minimalist standpoint. I don’t know if I will have the courage to actually submit it like this, but I’m leaning towards doing so.
I have been hearing from quite a number of you that you are struggling with creating this new version of the NIH Biosketch. So I thought I’d open it up to comment and observation. Anyone have any brilliant solutions / approaches to recommend?
UPDATE:
One of the things that has been bothering me most about this is that it takes the focus off of your work that is specific to the particular application in question. In the most recent version of the Biosketch, you selected 15 pubs that were most directly relevant to the topic at hand. These may not be your “most significant contributions” but they are the ones that are most significant for the newly proposed studies.
If one is now to list “your most significant contributions”, well, presumably some of these may not have much to do with the current application. And if you take the five sections seriously, it is hard to parse the subset of your work that is relevant to one focal R01 sized project into multiple headings and still show now those particular aspects are a significant contribution.
I still think it is ridiculous that they didn’t simply make this an optional way to do the Biosketch so as to accommodate those people that needed to talk about non-published scholarly works.
Peer Review: Advocates and Detractors Redux
August 20, 2014
A comment on a recent post from Grumble is a bit of key advice for those seeking funding from the NIH.
It’s probably impossible to eliminate all Stock Critique bait from an application. But you need to come close, because if you don’t, even a reviewer who likes everything else about your application is going to say to herself, “there’s no way I can defend this in front of the committee because the other reviewers are going to bring up all these annoying flaws.” So she won’t even bother trying. She’ll hold her fire and go all out to promote/defend the one application that hits on most cylinders and proposes something she’s really excited about.
This is something that I present as an “advocates and detractors” heuristic to improving your grant writing, surely, but it applies to paper writing/revising and general career management as well. I first posted comments on Peer Review: Friends and Enemies in 2007 and reposted in 2009.
The heuristic is this. In situations of scientific evaluation, whether this be manuscript peer-review, grant application review, job application or the tenure decision, one is going to have a set of advocates in favor of one’s case and detractors who are against. The usual caveats apply to such a strict polarization. Sometimes you will have no advocates, in which case you are sunk anyway so that case isn’t worth discussing. The same reviewer can simultaneously express pro and con views but as we’ll discuss this is just a special case.
The next bit in my original phrasing is what Grumble is getting at in the referenced comment.
Give your advocates what they need to go to bat for you.
This is the biggie. In all things you have to give the advocate something to work with. It does not have to be overwhelming evidence, just something. Let’s face it, how many times are you really in position in science to overwhelm objections with the stupendous power of your argument and data to the point where the most confirmed critic cries “Uncle”. Right. Never happens.
The point here is that you need not put together a perfect grant, nor need you “wait” until you have X, Y or Z bit of Preliminary Data lined up. You just have to come up with something that your advocates can work with. As Grumble was pointing out, if you give your advocate a grant filled with StockCritique bait then this advocate realizes it is a sunk cause and abandons it. Why fight with both hands and legs trussed up like a Thanksgiving turkey?
Let’s take some stock critiques as examples.
“Productivity”. The goal here is not to somehow rush 8 first author papers into press. Not at all. Just give them one or two more papers, that’s enough. Sometimes reiterating the difficulty of the model or the longitudinal nature of the study might be enough.
“Independence of untried PI with NonTenureTrackSoundin’ title”. Yes, you are still in the BigPIs lab, nothing to be done about that. But emphasize your role in supervising whole projects, running aspects of the program, etc. It doesn’t have to be meticulously documented, just state it and show some sort of evidence. Like your string of first and second authorships on the papers from that part of the program.
“Not hypothesis driven”. Sure, well sometimes we propose methodological experiments, sometimes the outcome is truly a matter of empirical description and sometimes the results will be useful no matter how it comes out so why bother with some bogus bet on a hypothesis? Because if you state one, this stock critique is de-fanged, it is much easier to argue the merits of a given hypothesis than it is the merits of the lack of a hypothesis.
Instead of railing against the dark of StockCriticism, light a tiny candle. I know. As a struggling newb it is really hard to trust the more-senior colleagues who insist that their experiences on various study sections has shown that reviewers often do go to bat for untried investigators. But….they do. Trust me.
There’s a closely related reason to brush up your application to avoid as many obvious pitfalls as possible. Because it takes ammunition away from your detractors, which makes the advocates job easier.
Deny your detractors grist for their mill.
Should be simple, but isn’t. Particularly when the critique is basically a reviewer trying to tell you to conduct the science the way s/he would if they were the PI. (An all to common and inappropriate approach in my view) If someone wants you to cut something minor out, for no apparent reason (like say the marginal cost of doing that particular experiment is low), just do it. Add that extra control condition. Respond to all of their critiques with something, even if it is not exactly what the reviewer is suggesting; again your ultimate audience is the advocate, not the detractor. Don’t ignore anything major. This way, they can’t say you “didn’t respond to critique”. They may not like the quality of the response you provide, but arguing about this is tougher in the face of your advocating reviewer.
This may actually be closest to the core of what Grumble was commenting on.
I made some other comments about the fact that a detractor can be converted to an advocate in the original post. The broader point is that an entire study section can be gradually converted. No joke that with enough applications from you, you can often turn the tide. Either because you have argued enough of them (different reviewers might be assigned over time to your many applications) into seeing science your way or because they just think you should be funded for something already. It happens. There is a “getting to know you” factor that comes into play. Guess what? The more credible apps you send to a study section, the more they get to know you.
Ok, there is a final bit for those of you who aren’t even faculty yet. Yes, you. Things you do as a graduate student or as a postdoc will come in handy, or hurt you, when it comes time to apply for grants as faculty. This is why I say everyone needs to start thinking about the grant process early. This is why I say you need to start talking with NIH Program staff as a grad student or postdoc.
Plan ahead
Although the examples I use are from the grant review process, the application to paper review and job hunts are obvious with a little thought. This brings me to the use of this heuristic in advance to shape your choices.
Postdocs, for example, often feel they don’t have to think about grant writing because they aren’t allowed to at present, may never get that job and if they do they can deal with it later. This is an error. The advocate/detractor heuristic suggests that postdocs make choices to expend some effort in broad range of areas. It suggests that it is a bad idea to gamble on the BIG PAPER approach if this means that you are not going to publish anything else. An advocate on a job search committee can work much more easily with the dearth of Science papers than s/he can a dearth of any pubs whatsoever!
The heuristic suggests that going to the effort of teaching just one or two courses can pay off- you never know if you’ll be seeking a primarily-teaching job after all. Nor when “some evidence of teaching ability” will be the difference between you and the next applicant for a job. Take on that series of time-depleting undergraduate interns in the lab so that you can later describe your supervisory roles in the laboratory.
This latter bit falls under the general category of managing your CV and what it will look like for future purposes.
Despite what we would like to be the case, despite what should be the case, despite what is still the case in some cozy corners of a biomedical science career….let us face some facts.
- The essential currency for determining your worth and status as a scientist is your list of published, peer reviewed contributions to the scientific literature.
- The argument over your qualities between advocates and detractors in your job search, promotions, grant review, etc is going to boil down to pseudo quantification of your CV at some point
- Quantification means analyzing your first author / senior author /contributing author pub numbers. Determining the impact factor of the journals in which you publish. Examining the consistency of your output and looking for (bad) trends. Viewing the citation numbers for your papers.
- You can argue to some extent for extenuating circumstances, the difficulty of the model, the bad PI, etc but it comes down to this: Nobody Cares.
My suggestion is, if you expect to have a career you had better have a good idea of what the standards are. So do the research. Do compare your CV with those of other scientists. What are the minimum criteria for getting a job / grant / promotion / tenure in your area? What are you going to do about it? What can you do about it?
This echos something Odyssey said on the Twitts today:
and
are true for your subfield stage as well as your University stage of performance.
Something you'd learn in business school?
March 19, 2014
I notice that you go to RePORTER for information to help solve mysteries and gather data and strategic thinking respecting grant strategies and for a feel for national trends. Do you ever bring this up in conversation with POs? For example, “I noticed in RePORTER that only 5 R01s have been funded from this PAR, none from my IC of interest (which had a part in creating the FOA) …. what’s up with that?” Or is that too …. something … seems like something you’d learn in business school …. not grad school (where we learn to pipette and run gels).
The things that I talk about on this blog are things that I learned, sometimes the hard way, as a faculty level scientist who was expected to land major research grants to fund his laboratory. A few things I picked up as a postdoc, but my education really accelerated after my career world said “Okay, show us what you can do, junior.”.
Much of what I relate to you I learned in bits and pieces over a very long period of time. Just this very month, btw, I learned yet a new wrinkle on NIH behavior when it comes to grants. I am always learning new stuff.
Obviously, I think it is imperative for my continued career existence that I keep my head up about where the lab’s funding comes from. I blog because I think you Readers should do so as well. Most of this stuff isn’t rocket science, just information. Information that you will over time come to value, information that you will find to be incorrect for your situation and information that may never be of use.
It is my belief that the more academic folks who plan NIH extramurally funded careers know about the NIH system, the better for them. And even grad school isn’t too early to start to pick up the basics.
When it comes to Program Officers and, yes, Scientific Review Officers, my answer to erickttr is a simple “Heck yes!”. Even people who are part of the system don’t necessarily know everything about the system. Not even “their” piece of the system!
You may recall my various frustrations over the years with aspects of the NIH system that participants in the system seem to ignore. Rockey’s assertion about PhD job prospects. The amazing discovery that NI awards, prior to the invention of the ESI category, were going to highly established PIs who simply hadn’t been NIH funded yet. Program Officers who told people in soft-money jobs that “well, that’s not a very good job, you shouldn’t be taking those”. POs who tell investigators they just need to “write better grants”. SROs who were entirely unaware of the A2 traffic-pattern effect as it was developing (“What do you mean this study section rarely funds A0 applications?“)
The list goes on and on.
Areas of scientific study that are woefully underfunded by your favorite IC are no different. YOU, scientists, serve an educational purpose. You do this by virtue of the grant applications you submit. You do this by virtue of the reviews that you supply when asked to serve on study section. You do this in your annual Progress Reports.
And you do this by chatting up your friendly POs on the phone or at scientific meetings.
Part of your argument can be derived from RePORTER. Of course. Particularly when you want them to fund you to do X and there are hardly any grants funded on X at the moment. Maybe you can point to a study section which should be handling X but never seems to let any proposals out with a fundable score. Who knows, maybe you will eventually get a Program Announcement or Request for Applications funded.
Go RePORTing folks.
Addendum based on this (wisecrack?) comment from SidVic.
You are poorly served by these idiots, and it is shameful that the NIAAA portfolio doesn’t contain at least 4-5 projects addressing x and z. Hey you should really do your duty to humankind and pick up my grant…
Obviously you want to be polite. But the real point here is that you are playing the long game. When you front people with the deficits in their system, they are not going to immediately agree you are right and hand you a new grant award. Not the way the world works. You are trying to shape their own beliefs. This can take time. And you are also trying to give them (your advocates) the ammunition that they need to make their case with higher-ups. (When text that is suspiciously similar to your rantings shows up in the RFA, just quietly pat yourself on the back and consider it a job well done!)
Slowly extend middle finger
September 14, 2007
Unsurprisingly, you can make your life better in sports by learning new skills. It’s one of those fun little benefits that you can set out objectively to get better at something and within a short time have obvious results. Cool. How many aspects of your life offer this? Read the rest of this entry »