Data curation in the laboratory

April 26, 2019

A semi-thread from frustrated bioinformaticians emerged on twitter recently. In it they take shots at their (presumably) collaborators who do not take their requests for carefully curated and formatted data to heart.

Bioinformatician: Fill in this pre-formatted .csv file with your sample info so I can parse it automatically and do your analyses.

Biologist: Here is the .xlsx file you requested. I moved columns around and renamed them. BTW I did this in MS Excel on a Mac.

Bioinformatician: ☠️
— Eric Normandeau (@eric_normandeau) April 24, 2019

Naturally this led me to taunt the data leech OpenScienceEleventy waccaloons for a little bit. The context is probably a little different (i.e., it seems to reference established collaborations between data-generating and data-analyzing folks) but the idea taps on one of my problems with the OpenScience folks. They inevitably don’t just mean they want access to the data that went into your paper but ALL of your data related to it. Down to the least little recorded unit (someone in the fighty thread said he wanted raw electrophysiological recording to test out his own scoring algorithm or some such). And of course they always mean that it should be nicely formatted in their favorite way, curated for easy understanding by computer (preferably) and, in all ways, the burden should be on the data-generating side to facilitate easy computational analysis. This is one of the main parts that I object to in their cult/movement- data curation in this way comes with a not-insubstantial cost expended to the benefit of some internet random. I also object on the basis of the ownership issues, bad actors (think: anti-science extremists of various stripes including right wing “think tanks” and left wing animal right terrorists), academic credit, opportunity loss among other factors.

However, the thought of the day is about data curation and how it affects the laboratory business and my mentoring of science trainees. I will declare that consistent data collation, curation, archiving and notation is a good thing for me and for my lab. It helps the science advance. However, these things come at a cost. And above all else when we consider these things, we have to remember that not every data point collected enters a scientific manuscript or is of much value five or ten years down the line. Which means that we are not just talking about the efficient expenditure of effort on the most eventually useful data, we’re talking about everything. Does every single study get the full data analysis, graphical depiction and writeup? Not in my lab. Data are used at need. Data are curated to the extent that it makes sense and sometimes that is less than complete.

Data are collected in slightly different ways over time. Maybe we changed the collection software. Maybe our experiments are similar, but have a bit of a tweak to them. Maybe the analyses that we didn’t think up until later might be profitably applied to earlier datasets but…..the upside isn’t huge compared to other tasks. Does this mean we have to go back and re-do the prior analyses with the current approach? If we want to, this sometimes that requires that third and fourth techniques (programs, analysis strategies, etc) be created and applied. This comes with additional effort costs. So why would we expend those efforts for something? If there was interest or need on the part of some member of the laboratory, sure. If a collaborator “needs” that analysis, well, this is going to be case by case on the basis of what it gains us, the collaboration or maybe the funded projects. Because it all costs. Time, which is money, and the opportunity cost of those staff members (and me) not doing other tasks.

Staff members. Ah, yes, the trainees. I am totally supportive of academic trainees who want to analyze data and come up with new ways to work with our various stock-in-trade data sets and archive of files. This, btw, is what I did at one of my postdoctoral stops. I was working with a model where we were somewhat captive to the rudimentary data analyses provided by the vendor’s software. The data files were essentially undocumented, save for the configuration data, dates and subject identifiers. I was interested in parsing the data in some new ways so I spent a lot of time making it possible to do so. For the current files I was collecting and for the archive of data collected prior to my arrival and for the data being collected by my fellow trainees. In short, I faced the kind of database that OpenData people claim is all they are asking for. Oh, just give us whatever you have, it’s better to have anything even if not annotated, they will claim. (Seriously). Well, I did the work. I was able to figure out the data structure in the un-annotated files. This was only possible because I knew how the programs were working, how the variables could be set for different things, what the animals were doing in a general sense in terms of possible responses and patterns, how the vendor’s superficial analysis was working (for validation), what errors or truncated files might exist, etc. I wrote some code to create the slightly-more-sophisticated analyses that I happened to dream up at the time. I then started on the task of porting my analysis to the rest of the lab. So that everyone from tech to postdoc was doing initial analysis using my programs, not the vendor ones. And then working that into the spreadsheet and graphing part of the data curation. And THEN, I started working my way back through the historical database from the laboratory.

It was a lot of work. A lot. Luckily my PI at the time was okay with it and seemed to think I was being productive. Some of the new stuff that I was doing with our data stream ended up being included by default in most of our publications thereafter. Some of it ended up in its own publication, albeit some 12 years after I had completed the initial data mining. (This latter paper has barely ever been cited but I still think the result is super cool.) The data mining of files from experiments that were run before I entered the laboratory required a second bit of work, as you might readily imagine. I had to parse back through the lab books to find out which subject numbers belonged together as cohorts or experiments. I had to separate training data from baseline / maintenance studies, from experimental manipulations of acute or longitudinal variety. And examine these new data extractions in the context of the actual experiment. None of this was annotated in the files themselves. There wasn’t really a way to even do it beyond 8 character file names. But even if it had been slightly better curated, I’m just not seeing how it would be useful without the lab books and probably some access to the research team’s memory.

Snapping forward to me as a PI, we have somewhat similar situation in my lab. We have a behavioral assay or two run by proprietary commercial software that generate data files that could, in theory, be mined by anyone that was interested* in some aspect of the behavior that struck their fancy. It would still take a lot of work and at least some access to the superordinate knowledge about the studies a given subject/date stamped file related to. I am happy for trainees in my lab to play with the data files, present and past. I’m happy for them to even replace analysis and reporting strategies that I have developed with their own, so long as they can translate this to other people in the lab. I.e., I am distinctly unkeen on the analysis of data being locked up in the proprietary code or software on a single trainee’s laptop. If they want to do that, fine, but we are going to belt-and-suspenders it. There is much value in keeping a set of data analysis structures more or less consistent over time. Sometimes the most rudimentary output from a single data file (say, how many pellets that rat earned) is all that we need to know, but we need to know that value has been used consistently across years of my work.

I have at least two interests when it comes to data curation in my lab. I need some consistency and I need to be able to understand as the PI what I am looking at. I need to be able to go back to some half-remembered experiment and quickly whip up a preliminary data or slide figure. This leans towards more orthodoxy of analysis. Towards orthodoxy of data structures and formats. Towards orthodoxy in the graphs, for pete’s sake. My attempts to manage this into reality has mixed results, I will note. At the level of an individual staffer, satisfying some data curation goal of the PI (or anyone else, really) can seem like make-work. And it is definitely work to the ends of someone else, I just happen to be the PI and am more equal that anyone else. But it is work. And this means that short cuts are taken. Often. And then it is down to the effort of someone to bring things back up to curation standard. Sure it may seem to be “just as easy” for the person to do it the way I want it, but whaddayaknow, they don’t always see it that way. Or are rushed. Or mean to get to that at the end of the study but then forget. Tomorrow. When it is really needed.

I get this. It is a simple human reality.

In my lab, I am the boss. I get to tell staff members what to do and if they won’t do it, eventually, I can fire them. Their personal efforts (and mine for that matter) are supposed to be directed towards the lab good, first, and the institutional good second. The NIH good is in there somewhere but we all know that since a grant is not a contract, this is a very undefined concept.

There is very little that suggests that the effort of my laboratory staff has to be devoted to the good of some other person who wants access to our data in a way that is useful to them. In fact, I am pretty sure in the extreme case that if I paid a tech or trainee from my grant to work substantial amounts of time on a data analysis/curation project demanded of us by a private for-profit company solely for their own ends, this would violate the rules. There would probably be a technical violation if we did the same for a grant project funded to another researcher if the work had nothing whatever to do with the funded aims in my own lab that were paying the staff member’s salary.

Data curation for others’ ends costs. It costs time and that means that it costs money. It is not trivial. Even setting up your data stream within lab so that it could possibly be easier to share with external data miners costs. And the costs apply to all of the data collected, not just that that eventually, one day is requested of you and ends up in a paper.

*as it happens we just fielded a request but this person asked us to collaborate, rightfully so.

Posted by drugmonkey
Filed in Conduct of Science, GettinYourSkillzOn, Mentoring

	drugmonkey on NIH data on Discussion Rates a…
	drugmonkey on Research Opportunities for New…
	Anonymous on Research Opportunities for New…
	Anonymous on The 2nd Amendment and Idiotic…
	Laura on Disparity of NSF funding
	tmsandmri on NIH data on Discussion Rates a…
	Charles Reichhardt on Disparity of NSF funding

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

DrugMonkey