PLoS is letting the inmates run the asylum and this will kill them

February 25, 2014

The latest round of waccaloonery is the new PLoS policy on Data Access.

I’m also dismayed by two other things of which I’ve heard credible accounts in recent months. First, the head office has started to question authors over their animal use assurance statements. To fail to take the statement of local IACUC oversight as valid because of the research methods and outcomes. On the face of it, this isn’t terrible to be robustly concerned about animal use. However, in the case I am familiar with, they got it embarrassingly wrong. Wrong because any slight familiarity with the published literature would show that the “concern” was misplaced. Wrong because if they are going to try to sidestep the local IACUC and AAALAC and OLAW (and their worldwide equivalents) processes then they are headed down a serious rabbithole of expensive investigation and verification. At the moment this cannot help but be biased- and accusations are going to rain down on the non-English-speaking and non-Western country investigators I can assure you.

The second incident has to do with accusations of self-plagiarism based on the sorts of default Methods statements or Introduction and/or Discussion points that get repeated. Look there are only so many ways to say “and thus we prove a new facet of how the PhysioWhimple nucleus controls Bunny Hopping”. Only so many ways to say “The reason BunnyHopping is important is because…”. Only so many ways to say “We used optogenetic techniques to activate the gertzin neurons in the PhysioWhimple nucleus by….”. This one is particularly salient because it works against the current buzz about replication and reproducibility in science. Right? What is a “replication” if not plagiarism? And in this case, not just the way the Methods are described, the reason for doing the study and the interpretation. No, in this case it is plagiarism of the important part. The science. This is why concepts of what is “plagiarism” in science cannot be aligned with concepts of plagiarism in a bit of humanities text.

These two issues highlight, once again, why it is TERRIBLE for us scientists to let the humanities trained and humanities-blinkered wordsmiths running journals dictate how publication is supposed to work.

Data depository obsession gets us a little closer to home because the psychotics are the Open Access Eleventy waccaloons who, presumably, started out as nice, normal, reasonable scientists.

Unfortunately PLoS has decided to listen to the wild-eyed fanatics and to play in their fantasy realm of paranoid ravings.

This is a shame and will further isolate PLoS’ reputation. It will short circuit the gradual progress they have made in persuading regular, non-waccaloon science folks of the PLoS ONE mission. It will seriously cut down submissions…which is probably a good thing since PLoS ONE continues to suffer from growing pains.

But I think it a horrible loss that their current theological orthodoxy is going to blunt the central good of PLoS ONE, i.e., the assertion that predicting “impact” and “importance” before a manuscript is published is a fool’s errand and inconsistent with the best advance of science.

The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few. The scope of the problem hasn’t even been proven to be significant and we are ALL supposed to devote a lot more of our precious personnel time to data curation. Need I mention that research funds are tight and that personnel time is the most significant cost?

This brings us to the second problem. This Data Access policy requires much additional data curation which will take time. We all handle data in the way that has proved most effective for us in our operations. Other labs have, no doubt, done the same. Our solutions are not the same as people doing very closely the same work. Why? Because the PI thinks differently. The postdocs and techs have different skill sets. Maybe we are interested in sub-analysis of a data set that nobody else worries about. Maybe the proprietary software we use differs and the smoothest way to manipulate data is different. We use different statistical and graphing programs. Software versions change. Some people’s datasets are so large as to challenge the capability of regular-old, desktop computer and storage hardware. Etc, etc, etc ad nauseum.

Third problem- This diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. Who wins? Do we force other labs to look at the damn cumulative records for drug self-administration sessions because some old school behaviorists still exist in our field? Do we insist on individual subjects’ presentations for everything? How do we time bin a behavioral session? Are the standards for dropping subjects the same in every possible experiments. (answer: no) Who annotates the files so that any idiot humanities-major on the editorial staff of PLoS can understand that it is complete?

Fourth problem- I grasp that actual fraud and misleading presentation of data happens. But I also recognize, as the waccaloons do not, that there is a LOT of legitimate difference of opinion on data handling, even within a very old and well established methodological tradition. I also see a lot of will on the part of science denialists to pretend that science is something it cannot be in their nitpicking of the data. There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted! That’s fine for post-publication review but to use that as a gatekeeper before publication? Really PLoS ONE? Do you see how this is exactly like preventing publication because two of your three reviewers argue that it is not impactful enough?

This is the reality. Pushes for Data Access will inevitably, in real practice, result in constraints on the very diversity of science that makes it so productive. It will burn a lot of time and effort that could be more profitably applied to conducting and publishing more studies. It addresses a problem that is not clearly established as significant.

Posted by drugmonkey
Filed in Academics, Open Access, PLoS ONE, Science Publication, Scientific Publication

137 Comments »

137 Responses to “PLoS is letting the inmates run the asylum and this will kill them”

Odyssey Says:

February 25, 2014 at 8:29 am
Read the second last paragraph of their policy update. My interpretation is that kind of negates the reasons for having the policy in the first place.

LikeLike
rxnm Says:

February 25, 2014 at 8:35 am
Actually, I think it is not cheatfucks they are after. If they were, this obviously does nothing. Dumb cheatfucks cut and paste their Western bands, smart cheatfucks cheat by leaving out everything that doesn’t produce their desired result. This policy won’t help catch either kind.

No, this is driven by the Mountain Dew chuggers sitting at the workstations in the windowless offices. See, they have these fantastic models that, once they make 163 baseless assumptions, hop just like a real bunnies. Now they need all of your data to feed their models and finally we won’t need you meat puppet pipet monkeys to, you know, really understand (in the physics sense!!!!) bunny hopping. Because the model hops like a bunny, even with data from actual hopping bunnies in it. Explained!

Because these people are hyper-aware of a few types of data that they really really want and only dimly aware of the existence of all other kinds of science, the policy is tailored in a bizarre way that makes no sense for the majority of P1 authors.

LikeLiked by 2 people
Bill Says:

February 25, 2014 at 8:47 am
The first problem with this new policy is that it suggests that everyone should radically change the way they do science

If it is so much work to provide access to that data that it represents a “radical” change in the way you work, you’re not actually doing science anyway. You’re storing it and working with it, aren’t you? If someone disagrees with the way you’re handling the data, how is that a bad thing? (And no, I don’t mean a poetry major turned editor — though I fail to see why a poetry major couldn’t raise valid questions — a peer, working in your field and reading your paper — either as a reviewer, or after it’s published.)

They could replace the entire policy with one sentence: “PLOS does not accept references to “data not shown”” and it would be saying the same thing.

Let’s turn it around: in this Retraction Watch age, why should any journal accept any result supported by “data not shown”?

This Data Access policy requires much additional data curation which will take time

No it doesn’t. Provide the data in the format in which you use it; on rare occasions, the editor or reviewers will actually take a look at it, in which case it won’t (or it shouldn’t, see first point above) take much back-and-forth to sort the kinks out.

Who annotates the files so that any idiot humanities-major on the editorial staff of PLoS can understand that it is complete?

What gives you the idea that “idiot humanities-majors” will be looking at it in the first place?

I also see a lot of will on the part of science denialists yadda yadda blah blah

Not this tired old chestnut again. Every time anyone suggests letting someone look at the underlying data you cry wolf about the Terrible Awful No-good Bad Things that anti-scientists will do if they get their hands on the data. You talk about “paranoid ravings” but your own fears on this issue are grotesquely overblown.

LikeLiked by 1 person
caydenberg Says:

February 25, 2014 at 9:01 am
You’re paid by the public to do science, so the outcomes of your research should be public. This includes your awesome ideas and your eleven page Discussion, but it also includes plasmids, cell lines, reagents, code and data. I have spent way too much of my time as a bench trying to reinvent what other people have done, because influential papers increasingly seem to be castles with no foundation.

LikeLike
drugmonkey Says:

February 25, 2014 at 9:10 am
How about I send you the raw files from one of my behavioral experiments and you work forward from there to put my data in a format anyone can understand? And we carefully track how much time it takes for you to come up with something meaningful? If you do.

LikeLike
drugmonkey Says:

February 25, 2014 at 9:12 am
Point being, if I send you anything that is transformed, re-annotated, analyzed or processed, it is the same as “data not shown” to the same extent you are criticizing the published figures or tables.

LikeLike
drugmonkey Says:

February 25, 2014 at 9:18 am
Idiot humanities majors are the ones causing the first two problems, bill. I have little reason to think anything other than that PLoS one decision making is being driven by nonscientist thinking.

Sorry but anyone who countenances this “self-plagiarism” bullshit is not a scientist. Full fucking stop. I don’t care what their training or credentials or history is, they hae fundamentally missed the point of science.

LikeLike
@GermsAndNumbers Says:

February 25, 2014 at 9:47 am
My issues with the PLOS data policy, beyond the one’s you’ve outlined above:

1. A certain vagueness as to what is “the data”. Just the set I used for the analysis? The logs from the simulation used to make the set used for the analysis? Something else? As you said, it’s then a matter of how the providing study team processes their data, and we’re back to square one, except now we all *think* we don’t have a problem because we have a “data access policy”.

Any problem where I can construct an argument where complying with data access is trivial and useless, and also an argument that it becomes nearly a full time job indicates a problem with how the policy is written.

2. “Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.”

So I’m going to convince my department to create some sort of data access panel just so I can submit a paper to PLoS? This…doesn’t so much feel like something that’s going to happen, an absolutely puts people without the political power to make their department do something at a disadvantage.

3. All of their language treats data access as a purely foregone conclusion – ask and ye shall receive, with no mechanism for vexatious requests (Andy Schlafly) or groups that might not wish their data be available to any and all comers (any minority group ever).

That’s going to get tangled up in human subjects protections all over the place.

LikeLiked by 1 person
bashir Says:

February 25, 2014 at 9:56 am
It’s difficult to make consistent policies what will work well across all of science. Something I think these “change how science is done!” folks often miss. Science is very heterogenous. PLOS in my experience clearly has some blind spots (look at PLOS ONEs editor list). Perhaps this is due to what rxnm mentioned. Researchers who work with humans (yes they exist) are going to have a lot of issues balancing privacy, HIPPA, etc. That seems like a pretty big swath to me, not simply “a few individual exceptions”.

LikeLiked by 2 people
@GermsAndNumbers Says:

February 25, 2014 at 9:59 am
“No, this is driven by the Mountain Dew chuggers sitting at the workstations in the windowless offices. See, they have these fantastic models that, once they make 163 baseless assumptions, hop just like a real bunnies. Now they need all of your data to feed their models and finally we won’t need you meat puppet pipet monkeys to, you know, really understand (in the physics sense!!!!) bunny hopping. Because the model hops like a bunny, even with data from actual hopping bunnies in it. Explained!

Because these people are hyper-aware of a few types of data that they really really want and only dimly aware of the existence of all other kinds of science, the policy is tailored in a bizarre way that makes no sense for the majority of P1 authors.”

I am one of these “Mountain Dew chuggers sitting at a workstation in a windowless office”. For the most part, we can harvest what we need from your actual papers, and since most of our requests for data end up with more citations from you, there’s actually remarkably little resistance. I think I’ve had a request refused once, and that was because their own pet team of Mountain Dew chuggers was already on it.

I think you’re barking up the wrong tree. This has much more the feeling of someone with a narrow view of what ‘data’ constitutes, who wants to be able to take your data, click ‘Run’ in their favorite stats package, and walk away feeling good that because they got the same number, all is right in the world and we are safe from the evils of scientific fraud.

LikeLike
AcademicLurker Says:

February 25, 2014 at 10:08 am
Sorry but anyone who countenances this “self-plagiarism” bullshit is not a scientist. Full fucking stop.

Agreed. Obviously double publishing the same paper (or nearly) is wrong. But if you work in the same field for years, the introduction sections of your papers will likely get a bit repetitive, and devising new ways of saying “protein X plays an important role in the regulation of process Y” is not a particularly worthwhile use of your time.

LikeLiked by 1 person
Bill Says:

February 25, 2014 at 10:42 am
Sorry but anyone who countenances this “self-plagiarism” bullshit is not a scientist. Full fucking stop.

Oh, I agree 100%. I meant to say that above — I absolutely agree with you about the IACUC madness and the “self-plagiarism” inanity.

But those are entirely separate problems from the “data not shown” issue.

LikeLike
bj Says:

February 25, 2014 at 10:46 am
I’ll take your data file and try to make sense of it. I think that would be a fun experiment. Willing to offer you a data file, too, but I think you don’t have the time to spend on it. I think it’s quite possible that you would be right that the information would be useless without significant curation on your part, but I think it’s a worthwhile experiment to do, for a published result, to see whether someone else can make sense of the raw data that underlies it, without any additional information.

“self-plagiarism” is ridiculous in science, and does smack of training in a discipline in which how you say something is your novel creative input (rather than what you say and what you did and what you found). But, do we have to use the word “idiot” to describe that mismatch between science and other disciplines?

LikeLike
Bill Says:

February 25, 2014 at 10:49 am
I think you are being a bit disingenuous when you say How about I send you the raw files from one of my behavioral experiments. If I’m really into your flavour of bunnyhopping, I probably know how to work up those files — and it shouldn’t take much effort on your part to tell me (or include annotations or examples or whatever to show me) how you worked them up. You’re being tasked with providing it in a form that “one skilled in the art”, as our patent lawyer friends would put it, can use — not in a form that Joe the Plumber can understand at a glance.

I see nothing in the PLOS policy that says they will minutely examine each and every data file; how could they? As with most science most of the time, they’re going to take it on trust that if you say BunnyHopper.fck contains the raw data for your paper, then that’s what’s in there. You can deposit garbage files if you want to take the risk, just like you can photoshop your westerns now if you want to take the risk.

IF it so happens that a reviewer or editor has a question the data might answer, then they can dig through it. If they have the inclination, they might give it a sniff just to see if it smells about right. And a year from now, when I discover Bunnyjumping, I can go back to your paper and use the data to show that what you were actually seeing was jumping, not hopping.

LikeLiked by 1 person
drugmonkey Says:

February 25, 2014 at 12:03 pm
If I’m really into your flavour of bunnyhopping, I probably know how to work up those files

That is where you would be wrong, though. And I’m actually relatively good about this sort of thing…I don’t use a lot of home-brewed software for data collection anymore. And I don’t let the postdocs get too far away from my level of comprehension. But I’d bet there’s more than one lab where the PI literally could not work their way forward from the raw data collection files to a simple thing like “lever presses”.

As with most science most of the time, they’re going to take it on trust that if you say BunnyHopper.fck contains the raw data for your paper, then that’s what’s in there.

then what is the point? Why not just take it on trust that I will cough up the raw data on request?

You can deposit garbage files if you want to take the risk

Whut? Why would I want to do that? I don’t want to deposit any files at all. Why is this so hard to grasp?

And a year from now, when I discover Bunnyjumping, I can go back to your paper and use the data to show that what you were actually seeing was jumping, not hopping.

Yep, or you could shoot me an email and I’ll send it to you. Like people do for me when I have a serious question about something.

LikeLiked by 1 person
drugmonkey Says:

February 25, 2014 at 12:09 pm
the introduction sections of your papers will likely get a bit repetitive, and devising new ways of saying “protein X plays an important role in the regulation of process Y” is not a particularly worthwhile use of your time.

The problem is, we are beset by people in scientific publishing that do not seem to understand this. While the unique contribution in humanities may be the precise combination of words and phrases used, this can be entirely irrelevant to the unique contribution of a scientific paper. The idea that one is somehow cheating by copying oneself, even if one is restating the same hypothesis and/or conclusions over and over again is absurd. If someone publishes the exact same data figures re-phrasing of the rest of the manuscript then this IS an issue of note, even if it passes some automated word-comparison filter.

But those are entirely separate problems from the “data not shown” issue.

No, I wish to make the point that PLoS is going off the deep end with this post. It wasn’t only about my distaste for the data-depositing issues.

LikeLike
Jim Woodgett (@jwoodgett) Says:

February 25, 2014 at 12:12 pm
Methinks you doth protest too much drugmonkeyblog. Yes, the open requirement for data could be interpreted as a massive imposition and lead to mountains of essentially uninterpretable primary data having to be kept on the remote chance someone wishes to sift through it. If you’ve ever had to do forensic analysis of primary data you’ll know it’s a thankless and time-consuming task. However, the PLoS policy has an important clarification:

“…the Data Policy states the “minimal dataset” consists “of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety.” This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analyses presented in the paper.”

See: http://www.plos.org/update-on-plos-data-policy/

Is this not reasonable? Good practice requires we maintain data to backup published findings. It doesn’t mean everything has to be kept, only that from which the published findings were drawn. It’s hard enough to get published reagents from some labs and as we’ve seen numerous times, essential details are often missing from published units. There absolutely is a cost to storing this information and making it available. There always has been. But it should be a reasonable cost. There’s no point storing petabytes of raw data. They aren’t suggesting Next Gen image files be kept, just the sequence data.

There are also several initiatives that address the problem of data storage. The protein mass spectrometry community generates a lot of data and they are working to make it available: http://massive.ucsd.edu/ProteoSAFe/static/massive.jsp

Many labs store data but in a haphazard manner. Perhaps we should be taking this more seriously? I do agree that the type of data needs to be matched to the level of utility. But I’ve seen plenty of retractions due to inability to provide primary data upon request. of which some are likely due to poor organization rather than intent to deceive.

LikeLiked by 1 person
drugmonkey Says:

February 25, 2014 at 12:17 pm
This has much more the feeling of someone with a narrow view of what ‘data’ constitutes, who wants to be able to take your data, click ‘Run’ in their favorite stats package, and walk away feeling good that because they got the same number, all is right in the world and we are safe from the evils of scientific fraud.

Oh nonsense. This has the smell of people who think they are going to be able to root out (or at least prevent) some specific type of fraud.

I predict that what will be much, much more frequent than finding actual fraud is finding corners and alleys over which to kvetch over legitimate decision making about how to properly analyze and present data. The vast majority of the results will be “See! The data *don’t* actually support the authors’ claims!111!!!!!”.

LikeLike
drugmonkey Says:

February 25, 2014 at 12:31 pm
Methinks you doth protest too much drugmonkeyblog

Are you accusing me of faking data Jim?

LikeLike
Jim Woodgett (@jwoodgett) Says:

February 25, 2014 at 12:43 pm
Never! But I do think you (and others) are taking an extreme position unnecessarily. If the PLoS policy is taken literally, then it is impractical and will result in the implosion of the journal. Is that likely to be the outcome? No. They consulted and I kinda assume they are reasonable people trying to find a way to increase reproducibility. If someone cannot repeat something, it hurts science. They may be incompetent but that would be revealed by others being successful. Seems to me that it would be far more constructive to work with PLoS to improve understanding/practicality than to simply dismiss the effort to increase transparency/reproducibility.

LikeLike
drugmonkey Says:

February 25, 2014 at 12:48 pm
I see very little way around this other than to require a lot of time and effort that we do not expend at present. In the current climate of grant funding, well, this is not going to go over well. It further isolates PLoS as the province of the New World Order Waccaloons who wish to bother with such orthodoxy.

I think this works against the goal that I entirely support, i.e. the PLoS ONE mission to divorce manuscript acceptance from any sort of evaluation of “importance”.

LikeLike
Joe Says:

February 25, 2014 at 12:54 pm
Is PLoS making enough money that they can get away with this kind of thing? I’ve got plenty of choices for places to publish. Why would I choose to publish somewhere that makes me jump through so many hoops?

LikeLike
Jim Woodgett (@jwoodgett) Says:

February 25, 2014 at 1:03 pm
Joe, want to make a bet that our funders will make this a requirement in the next couple of years (along with zero consideration of the cost)? That’s why dismissing it is not the way forward. It is well intentioned but needs to be practical.

LikeLiked by 1 person
rxnm Says:

February 25, 2014 at 1:42 pm
“If the PLoS policy is taken literally, then it is impractical and will result in the implosion of the journal.”

How should we take it then? And why on earth should I bother with PLOS, given the ambiguity and internal inconsistency of this policy?

It demands that your data allow someone to “reanalyze it” but also “spreadsheets of measurements” are ok? In the context of behavioral coding, image analysis, etc, this makes no sense at all. If you want a list of values that I averaged and calculated the standard error for, fine, but who fuckin cares? That’s not “reanalysis,” that’s “checking the math.” Clearly their ambition here (and their curation and DOI assignment policies) is something more.

I think the people most upset are those, like DM, who in principle agree with PLOS’s mission (I am on the P1 AE board). This policy is a shit show in a minefield. It’s so vague, how would you ever even know if you were in compliance with it? The editors sure as shit aren’t going to check… do we ask peer reviewers to?

LikeLiked by 1 person
ferniglab Says:

February 25, 2014 at 1:47 pm
Data without meta data are meaningless. The funders require, at least in the UK, that data to be maintained for at least 5 years. This means accessible and usable by the community.

You try using publicly available micro array experiments to do a meta analysis to see if a hypothesis is worth pursuing and you often (but not always to give credit to the few doing their work properly) hit a wall – the meta data are poor and the microarray data belong in the bin.

Same for mass spec data. There is a growing movement to re-analyse spectra for posttranslational modifications. Where are those data? Often hidden in labs, on servers about to die. Universities are happy to collect the grant income and the kudos of papers, but often less willing to invest some of the overhead into proper data curation.

Some communities have long-established standards, e.g., crystallography, gene sequencing. Others have more complex data and standards for storage, etc., are only beginning to be established (e.g., Omero for imaging). so this is a good move by Plos and will stimulate the community to come up with agreed standards., which, as ever, will evolve over time.

LikeLiked by 1 person
Jim Woodgett (@jwoodgett) Says:

February 25, 2014 at 2:30 pm
@rxnm There is clear intent in the policy – that’s how you should take it – best intent. My lab spends a heck of a lot of time, money and energy shipping mice around the world (sometimes more than once to the same group because they lost the strain the first time). It’s the price of science. Clearly, some types of dataset are very difficult to provide in a meaningful way. Do you think maybe that PLOS will recognize that? There’s a smell test for reasonable accommodation which we all know how to administer but elaborating on every exception or detail would create an iTunes-like terms of service. Yes, there are few details and some significant concerns, but these will be worked out. As they said in the initial stages (of consultation), they have an implicit policy of access to data. This makes it explicit. The policy perhaps should say “every reasonable effort” (to avoid people dodging it simply for laziness).

Put another way, how do you think it looks when a scientist says s/he’s not going to make the data upon which the results in the paper were based available to other scientists? Defensive? Dodgy? Holding back?

LikeLiked by 1 person
Zxh Says:

February 25, 2014 at 2:55 pm
Other journals are available.

LikeLike
Bill Says:

February 25, 2014 at 3:01 pm
But I’d bet there’s more than one lab where the PI literally could not work their way forward from the raw data collection files to a simple thing like “lever presses”.

And this is a good thing? Also, if the PI can’t parse the data she probably didn’t write the paper either, so it’ll be the postdoc doing all the work as usual — in this case, data deposition work.

then what is the point? Why not just take it on trust that I will cough up the raw data on request?

Because once you’ve coughed it up into a repository, it’s there for anyone and (LOCKSS willing) forever. There’s no “oh sorry, my email was down” then “I was on holiday” then “oh sorry was that the wrong disc” and so on. I assume YOU are genuinely happy to cough up data and details but a great many labs aren’t. Delaying provision and handing over unreadable crap and outright refusing to share data are a far cry from falsifying that data in the first place; many labs who wouldn’t dream of the latter have an unwritten but clear policy of the former. PLOS is simply saying that they won’t stand for that crap any longer.

I keep hearing a lot of special pleading for behavioural shite, to which I say: (a) every field is a special snowflake with special snowflake data; and (b) what Jim said (2:30 pm above). Follow the spirit of the policy and there’s no need to nail down every last letter.

I won’t claim that every lab who refuses to deposit data is making shitte uppe, but I will say outright that I personally treat their work as suspect. You’ve railed against the cheatfucks; you’ve played #OverlyHonestMethods; you know what goes on. If Bob Chickenshit won’t share the data underlying his latest groundbreaking claims — why the fuck not?

LikeLiked by 3 people
bj Says:

February 25, 2014 at 3:02 pm
Gazzaniga tried to do this with a journal 10+ years ago, requiring that brain imaging data be made available in order to be published in the journal. The idea imploded in about a microsecond, and I think they made the requirement voluntary, and about 2 people deposited their data.

A main objection then was that the data sets were deep, and that folks hoped to get multiple papers out of them, and that small labs would be disadvantaged since they didn’t have the manpower to completely analyze their data before their first paper was published. I didn’t think that was a very ethical reason, but it was a pretty practical one, since no one gets very much credit for collecting the data, and the currency is the publications.

I don’t actually think that data would/is very useful now, because there are too many quirks in it that require too much documentation and people would be prone to making bad analysis assuming that the complexity of behavioral data can be completely characterized and stored in the data set. But, I still think that central repositories would be a good thing. I just wouldn’t believe reanalysis of that data without further exploration.

LikeLiked by 2 people
Jim Woodgett (@jwoodgett) Says:

February 25, 2014 at 4:09 pm
It’s the principle. PLOS isn’t going self-destruct by enforcing something unenforceable. Definition of a minimal dataset is operational and we’ll learn what is acceptable. It shouldn’t be a case of dumping data – it should be a case of providing sufficient data and annotation to assist replication.

LikeLiked by 1 person
Eric Lofgren (@GermsAndNumbers) Says:

February 25, 2014 at 4:34 pm
“Oh nonsense. This has the smell of people who think they are going to be able to root out (or at least prevent) some specific type of fraud.

I predict that what will be much, much more frequent than finding actual fraud is finding corners and alleys over which to kvetch over legitimate decision making about how to properly analyze and present data. The vast majority of the results will be “See! The data *don’t* actually support the authors’ claims!111!!!!!”.”

Whether you think it’s people looking for quick and dirty replication, or hunting for a particular breed of “fraud”, I think you’d agree that either are more likely than a sinister cabal of caffeine addled theoreticians after your precious data and/or body fluids?

That was what the comment I was replying to suggested.

LikeLike
Dr Becca Says:

February 25, 2014 at 5:02 pm
Even before all of this business, my interest in publishing in PloSONE was minimal. I think that’s now dipped into negative values.

LikeLike
dr_mho Says:

February 25, 2014 at 5:14 pm
so…i run a physiology lab… we collect multiple trials of data for single neurons. some trials are noted in the lab book to delete (e.g., idiot walked in and turned the lights on, skip this trial). then, these are averaged to create a mean response for that cell… then specific parameters (e.g., peak amplitude) are derived from the average traces and stats are done across the population for these values and published…

so…what would i be expected to publically archive, the original traces in proprietary acquisition software format (which are only useful with detailed copies of our lab notebooks)?…only the trials that go into the averages?…the averaged traces?… the amplitude values that went into the stats?

this can also become gigabytes of data, so who is going to pay for the online storage? Dryad only allows 10GB for their low-end cost… why should I be obligated to pay more?

taxpayers “pay” us to produce findings, not individual data points.

LikeLiked by 2 people
drugmonkey Says:

February 25, 2014 at 5:24 pm
All of the traces! And how can you prove someone walked in and it had any effect huh? Huh? WHAT ARE YOU HIDING??!!!??!!

LikeLike
dr_mho Says:

February 25, 2014 at 5:25 pm
lmao, exactly…

LikeLike
drugmonkey Says:

February 25, 2014 at 5:41 pm
Seriously, you have to wonder what sort of science these people do…

LikeLike
Dennis Eckmeier (@DennisEckmeier) Says:

February 25, 2014 at 5:46 pm
I think on point drugmonkey is making is valid above all others: Not even people in our own fields will be able to just take raw data and make sense of it. Because some fields don’t have standardized tests. Drugmonkey didn’t get into details, but I think many scientists live in a different science world than him or me (and our worlds are also quite different).

As a neuroethologist, who works on the neuronal basis of natural behaviors, all my experiments are non-standard and the setup, data acquisition and data analysis is customized from scratch.

I published a study in PLoSone based on the trajectory and orientation of some birds’ heads in flight in 3D. Each trajectory was reconstructed from 2D high speed videos. Since I did everything ‘manually’ I have the raw 2D movies on file.

Am I supposed to share these videos? You think my fellow scientists will be able to just easily use those?

We had to track the heads of birds in all 200 high speed videos – manually. Then we had different people track the same video to figure out what frequencies in the trajectories were artifacts caused by the tracking method itself and smoothed those out. Not before then we transformed the 2D tracking information into 3D data and to do that I needed transformation matrices to account for optical distortion due to perspective for each movie.

Am I supposed to share only these trajectory data? Off-the-shelf high speed 3D motion trackers give you only the final trajectories and you have no chance to recreate the data to this point from the ‘raw data’. These are still ‘raw data’!

So, now I have the ‘read out’ but I haven’t even started analyzing it. Trajectories were transformed to motion vectors, the angle between two points on the head were calculated to get the head orientation, and based on that, head turn velocities were analyzed until we found that and how the data were quantifiable. THEN we were able to quantify and compare different groups.

Am I supposed to share these final trajectory data? These are the ones I used to get to my conclusion… but they are, as one says, ‘highly derived’ measures. On the other hand, studies using standardized methods just jump in at this point – AFTER the quantification – lever presses, choices made and other binary stuff.

The point is: I can give you my raw data. We collected that in three days. No problem. But will anybody be able to make sense of it? No.

It took a year to get from there to the result and the raw data are still pretty raw after a lot of manipulation. What is the point where we say ‘okay, that’s the raw data you need to provide’?

I would have to spend an IMMENSE amount of time to make my raw stuff comprehensible even to colleagues closest to my work.

LikeLike
Dennis Eckmeier (@DennisEckmeier) Says:

February 25, 2014 at 5:51 pm
@dr_mho: And that’s the second part of my work. At least we all know what a spike is.

LikeLike
Tim D. Smith (@biotimylated) Says:

February 25, 2014 at 5:54 pm
The obvious answer, I think, about what the data policy obligates you to do is to do something you think is reasonable; try and imagine what kinds of analysis might be feasible and interesting to others and make a reasonable effort to facilitate that interest. Provide data in a way that achieves what you think is a balance between what is easy for you to offer and what is accessible to others.

I’ve had a request for data that went into a figure denied before and I’ve had other requests ignored. Publishing at least some of the data with the paper is a good insurance policy against PIs who won’t share after all or don’t check their email or get hit by busses or don’t know what happened to the DVD-R with their postdoc’s files or didn’t back up their hard drives well enough.

Drugmonkey expresses confidence that “psycho waccaloons” will demand that all published bunny-hopping experiments be conducted according to some Protocol 27B/6 to facilitate reanalysis. I agree, I suppose, that the success of the policy will depend on how the editorial office and the community respond. I have at least some faith that researchers will use their judgement, compromises will be struck, and that eventually consensus will start to emerge about what enforcement of the policy should mean for each field. I may be badly misguided! I guess we’ll find out.

I’m not sure I see that cost is a strong argument, except for truly frightening data sets; there are public repositories like figshare that offer free unlimited public data storage, additional storage on Dryad is inexpensive, and everyone offers fee waivers.

LikeLiked by 1 person
Ian Dworkin Says:

February 25, 2014 at 5:58 pm
I am really not sure why this is even an issue. I posted about it, instead of trying to put it in the comments here.
http://genesgonewild.blogspot.com/2014/02/why-would-any-scientists-fuss-over.html

LikeLike
bsci Says:

February 25, 2014 at 6:01 pm
As Bashir hints at, this policy also shows no awareness about human subject’s protections. You cannot share data unless people give consent and it’s not always easy to get every internal review board to allow data sharing for every study. Even then, a minimal data set, by the definition of PLoS, might include data that are too specific to be sharable. Also, even when human data sharing is allowed, there might be data use restrictions on how it can be shared with & under which contexts. Any such restrictions seem to be forbidden by the PLoS policy.

Also, BJ is referring to Gazzaniga’s efforts to require fMRI data sharing for papers accepted in the Journal of Cognitive Neuroscience. While there was some unwillingness to share, the bigger issue was the simple practicality. It’s one thing share data, but putting ones data into a universally readable and understandable template is a massive amount of work. There is a good review on the history of the FMRIDC and how the next generation has been doing better: Van Horn, J.D., Gazzaniga, M.S., 2012. Why share data? Lessons learned from the fMRIDC. NeuroImage.
While the mechanics of fMRI datasharing have gotten better, that article shows how much work & mistakes went into getting the field to that point. This is an important effort, but a broad, blanket policy from a journal for all types of data is still quixotic.

LikeLiked by 1 person
Dennis Eckmeier (@DennisEckmeier) Says:

February 25, 2014 at 6:05 pm
@bsci: It does say ‘as far as legally possible’, no?

LikeLike
Fan Fiction | cremation, ron Says:

February 25, 2014 at 7:22 pm
[…] data relevant to every paper must be accessible in a stable repository, with a DOI and everything. Some discussion of this is going on over at Drugmonkey, and this is a comment that got out of hand, so I posted it here […]

LikeLike
Michael Eisen Says:

February 26, 2014 at 4:38 am
Let me just say at the outset I completely agree with you about the animal use and self-plagiarism issues. I’m not sure where this is coming from, but I’ll do my best to stamp it out. I will, however, say that our editors are all PhD or MD scientists, so this is not coming from “the humanities”. This doesn’t make these actions any less ill-advised, of course.

Now, for the data policy. This is a useful discussion. Obviously, we want to get this right, so all of your comments are helpful. But I want to try to explain where this is coming from.

First off. This is not some new thing. PLOS – and virtually every other journal and funding agency – have long had data policies that say, more or less, “you have to make your data available at the time of publication”. But these policies are very poorly spelled out and, as a consequence, essentially toothless.

Over the past few years PLOS has had several instances where authors have – either actively or passively – refused to make available data that most reasonable people would consider to be essential to interpreting the results of their experiments. In reviewing these cases, we realized that our data policy was poorly spelled out, and some of editors began to try and change this.

I don’t completely agree with everything that’s said in the policy, but I think it’s being unfairly criticized here.

1) This is not a “blanket” policy that demands specific actions on the part of authors. The heart of the policy is to have a data access section in the paper to explain to readers and reviewers what data is being made available and how it can be accessed. I don’t think anyone can complain about trying to clarify what is often a very murky area in most papers.

2) The question of what constitutes acceptable data sharing is more complicated, and this is something where PLOS will have to be careful not to be too heavy-handed. It is very much in keeping with our overall mission to encourage widespread and efficient sharing of data, and not just the text of papers. But, obviously, we have to work with authors and others to make sure this doesn’t become too burdensome or intrusive. This is obviously something that requires judgment on the part of editors who know their fields, and it will probably take some time and lots of discussion to arrive at the right balance.

3) Some of the statements here about the policy not considering things like human subjects are simply false. The statement explicitly deals with human subjects in a way that is in line with the policies laid out by the NIH and others.

I hope you all can agree with PLOS’s goals here, which are to reiterate the importance of making data, and not just text, freely available, and to clarify for both authors and readers what this means. I know there are disagreements here – e.g. between me and drugmonkey – about what it means to share the data from an experiment. I take an expansive view and he has a more narrow one. And there are clearly field-specific differences that need to be dealt with in a non heavy-handed manner. But I want to emphasize again that this policy is not trying to offer a blanket way to treat all experiments – rather it is primarily aiming for clarity. And I hope we can all work together to get this right as we move forward.

LikeLiked by 2 people
rxnm Says:

February 26, 2014 at 6:16 am
Mike, I think what is unclear is “authors must make all data publicly available, without restriction, immediately upon publication of the article.”

To me and I think most in my field, “data” means something specific: physiological and behavioral recordings, 4D datasets, etc.These are huge files, and as noted the idea of making them “stand alone” with sufficient metadata and annotation is in some kind of DOI-able repository crazy.

Elsewhere, the policy says “spreadsheets of original measurements,” which makes it sound like we DON’T have to make “all data publicly available” but just some mid-level analysis… because to anyone who thinks or cares about this, once it’s numbers in a spreadsheet, you’ve already done the important part of the analysis.

So the policy seems internally conflicted. What is the point? What would someone even do with your already-analyzed neural recordings or behavior videos or fluorescence intensity readings?

For this to be remotely workable, this policy needs to lay out clearly EXACTLY what is expected for different types of experimental data. It CANNOT say “all data” when it means “quantitation of your data that isn’t explicitly in figures.”

Which brings me to: who is going to oversee this or enforce this? AEs? Reviewers? PLOS staff? My guess is “no one,” so you better set up some kind of call center to deal with the deluge of crazy emails from stats scolds and nitpickers.

My full thoughts are here. http://rxnm.wordpress.com/2014/02/25/fan-fiction/

LikeLike
Ian Dworkin (@IanDworkin) Says:

February 26, 2014 at 7:01 am
As a community, Evolutionary Biology moved to such a data sharing approach a few years ago, with almost all of the journals in the field having a largely similar data sharing policy (usually using DRYAD). I think most have found the burden of getting data into the repositories low, and the benefits potentially very high.

I really do not understand why you think this will be a burden for your behavioural data? What kind of data is it?

LikeLike
drdrA Says:

February 26, 2014 at 7:14 am
“Over the past few years PLOS has had several instances where authors have – either actively or passively – refused to make available data that most reasonable people would consider to be essential to interpreting the results of their experiments.”

What happens in these cases- do you make them retract the paper if they don’t cough up the original data?

My understanding at the society journal for which I am an editor is that authors must make original data available if the editor asks for it. In my (admittedly) limited experience, the editor asks for the data if someone raises a red flag about the paper (i.e. reviewer or editor notes something off about the figures, seen the exact same figure published in another paper, etc). Not that she couldn’t ask for it for other reasons- but I’ve not seen that happen.

I get that the intent of the PLoS policy is different (or is it)- more expansive than rooting out fraud…and I’m generally on board with sharing our data (after publication) with anyone who asks for it. But I also know it is never as easy as it that makes it sound.

LikeLike
jojo Says:

February 26, 2014 at 8:04 am
Can someone please explain how PLoS’s new policy differs from, say, the similar NIH policy that data funded by NIH grants must be made publicly available?

As a genomics person it seemed like nothing new… We already do all of that as a matter of course.

LikeLike
bsci Says:

February 26, 2014 at 8:17 am
Michael, The human subject’s clauses in the data use policy makes no sense to me. As far as I can tell, the options are to make deidentified human data completely accessible (very often legally problematic) or have some complex intermediary people designated to ensure data access. To quote, “Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.” There is no awareness of fairly common data use agreements which allow access for specified purposes only and don’t require some complex structure of intermediaries. For example, the PLoS policy, as written, wouldn’t let researches to agree to something like http://ndar.nih.gov/ndarpublicweb/Documents/NDAR%20Data%20Access%20Request%20DUC%20FINAL.pdf before obtaining access directly from the experimenter. This is, obviously, correctable, but shows a lack of awareness of the issues.

LikeLike
rxnm Says:

February 26, 2014 at 8:31 am
Jojo, as far as I know the NIH has no such policy for “data.” It has it for papers (the results of analyzing data) and for some very limited kinds of data (e.g. sequences).

LikeLike
Jonathan Badger Says:

February 26, 2014 at 9:23 am
I’m not a big fan of accusations of “self-plagiarism” except in extreme cases of trying to publish the same paper twice — as DM says, there are real reasons why you might want to use the same Methods section over again. However, I just don’t get the worry over Data Access. I thought that is how scientists behave *now*. In genomics, it’s expected that all sequences discussed are deposited in GenBank (or similar). Otherwise what’s the point of the paper? It’s actually frightening that people in other fields apparently don’t get the need for data to be available.

LikeLiked by 1 person
Gina D. Says:

February 26, 2014 at 9:39 am
In genomics, we’ve been depositing our data for the last decade.

I believe this habit began when people started looking at early microarrays. These were often printed within labs with pins that broke. When people started looking at the arrays they found that a lot of the published data came from microarrays had smudges on them that were corrupting the data. This led to the MIAME standards forcing public deposit of raw data, the development of methods for quality assessment, and the culture of making raw data freely available, which has carried through to sequencing.

Depositing data takes about a day, for an experiment that may have run for years, and it pretty much guarantees you’ll garner citations from people who recycle their data.

The benefits of this approach are so obvious to people in our field that it actually never occurred to me that this would controversial in other fields, especially drug discovery.

And, frankly, I’m looking a little askance at my prescription bottle.

LikeLiked by 3 people
AcademicLurker Says:

February 26, 2014 at 9:57 am
In genomics, it’s expected that all sequences discussed are deposited in GenBank

Sequences have a standardized and agreed upon format, and have for many years. Likewise, repositories such as GenBank have been around for a while. Also, sequencing (and structure solving) have the property that many other investigators are going use those sequences and structures directly in their own work, in a way that no one is going to use, say, some raw patch clamp trace from PhysioProff’s lab directly in their own work.

Even in your case, are you really making “all the data” available? What if people wanted all of the the raw reads form your sequencing machines? The sequence you deposit in GenBank is as much a result as it is data.

LikeLike
Jonathan Badger Says:

February 26, 2014 at 9:58 am
@rxnm
What would people do with your already analyzed data? Seriously? Whenever I see a genome paper of an organism I’m interested in, the *first* thing I do is download the data and see if certain genes/pathways that I’m also interested in are there. Yes, the analysis of the authors might be interesting to a degree, but they might be focused on the same topics I’m interested in. The raw data is far more interesting. Data can be reused infinitely. I don’t see why that would be any different in behavior research other than facilitate the Marc Hauser’s of this world.

LikeLiked by 1 person
Jonathan Badger Says:

February 26, 2014 at 10:01 am
@AcademicLurker
Actually, these days with the Short Read Archive, the raw reads are *exactly* what is expected to be deposited.

LikeLike
drugmonkey Says:

February 26, 2014 at 10:06 am
I thought that is how scientists behave *now*. In genomics, it’s expected that all sequences discussed are deposited in GenBank (or similar). Otherwise what’s the point of the paper? It’s actually frightening that people in other fields apparently don’t get the need for data to be available.

It boggles my mind how (apparently) ignorant many scientists are of the diversity of science methods, data, etc.

LikeLike
AcademicLurker Says:

February 26, 2014 at 10:14 am
Actually, these days with the Short Read Archive, the raw reads are *exactly* what is expected to be deposited.

Fair enough, but the SRA is maintained by NCBI, presumably at a fairly non-negligible cost. Is PLOS going to set up and maintain such an archive? At least the SRA deals with only one type of data (sequencing) where the kind of database PLOS seesm to be thinking of would have to host a huge array of different types.

In any case, even with only sequencing data to deal with, here’s a thread full of people describing their frustrations with the SRA:

http://phylogenomics.blogspot.com/2011/02/though-i-generally-love-ncbi.html

LikeLike
Jonathan Badger Says:

February 26, 2014 at 10:18 am
For those who complain that standard data formats don’t exist in their field, well, they didn’t in genomics once upon a time either. They weren’t handed to us on stone tablets from on high — we created them. Instead of complaining, create a standard and use it. In the beginning there will be multiple formats, and it will be a pain, but eventually one or two will win out and everyone will benefit.

LikeLiked by 1 person
rxnm Says:

February 26, 2014 at 10:21 am
“I don’t see why that would be any different in behavior research other than facilitate the Marc Hauser’s of this world.”

Great… let us know when you’ve come up with a cost effective and standardized way hosting, curating, and annotating ~1TB/paper of behavior videos of monkeys, rats, mice, fish, flies, and worms. Or ephys traces from all known neuronal types, with metadata that would allow experimental replication.

Sequences exist *in nature* as semi-stable biological entities. Thus it is useful to continually and iteratively refer to them and compare them. The raw physiological events that happened in 10 or 20 hippocampal neurons in a particular mouse a year ago are not like this. If you want to test the conclusions of the paper, you get some more mice. If you want to compare their experimental manipulation to yours, you have to do the controls in your lab while you are collecting your data. Published recordings done on different equipment by different people in a different mouse colony are NEVER an adequate control or useful comparison in physiology.

DNA sequence is an incredibly simple form of information. Behavior and physiology measurements are not.

I am happy to share, I am not happy to spend the time and money to figure out how to host TBs of data in compliance with this policy.

LikeLike
jojo Says:

February 26, 2014 at 10:32 am
@rxnm

It was my impression that data underlying published results funded by NIH have to be made freely available in a timely fashion, and the plan for sharing this data has to be outlined when the grant is submitted. I’ve also written data sharing plans for NSF grants. Here’s some text from NIH’s data sharing policy.

“In NIH’s view, all data should be considered for data sharing. Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. To facilitate data sharing, investigators submitting a research application requesting $500,000 or more of direct costs in any single year to NIH on or after October 1, 2003 are expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible.”

It just doesnt’ sound that different from what PLoS is proposing.

@AcademicLurker

Yes, it’s expected that *at the very least* you make available the raw reads (either through your own website or SRA). It’s standard practice to also make available the major assemblies created with the reads (in whatever form is appropriate – genotypes/de novo assemblies/reference-based assemblies), but you can usually get away with just the raw reads.

@drugmonkey

Well, a lot of my colleagues in organismal biology put their field and behavior data into Dryad. See also Ian’s comment above.

LikeLike
Tim Vines (@TimHVines) Says:

February 26, 2014 at 10:39 am
This post is wrong on a number of fronts.

It very likely won’t kill PLoS to have a data sharing policy. We (Molecular Ecology) have operated a draconian data sharing policy for the last three years, and we’re still getting the same number and quality of submissions as we did before we introduced the policy. The papers with data are more useful for the community too, because the results can be verified by readers and the data used to test new ideas. If you don’t want to share, that’s fine, but don’t expect to be able to publish in a journal with high standards.

“The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few.” This is nonsense. Scientists should have the data they used for their paper already sitting in a well organised folder anyway, and it’s hardly a big deal to move it to a folder on a public database. Sure, scientists that have data scattered all over the place will suffer a bit, but this is a good thing: data management is a core skill for scientists and the PLoS policy will reinforce this.

“This Data Access policy requires much additional data curation which will take time.” This is partly true – making sure that authors have archived all of their data can be time consuming. There are a few workflow changes that can make save a lot of time. a) the data statement has to be included with the original submission of the paper; b) the reviewers are asked to comment on whether it lists all the data; and c) the authors asked to add any missing datasets. This ensures that people with expertise are making judgments on what gets archived.

“This diversity in data handling results, inevitably, in attempts for data orthodoxy” We don’t see this problem at all. Authors just archive the raw data in the format they’ve saved it in, and hopefully include the input files that they used in their analyses. The only orthodoxy is in favour of sharing all the data – if in doubt, provide the raw data and the downstream dataset that went into the analyses.

“I grasp that actual fraud and misleading presentation of data happens. But I also recognize, as the waccaloons do not, that there is a LOT of legitimate difference of opinion on data handling, even within a very old and well established methodological tradition.” OK, but I don’t get why this is an argument against data sharing. Your share the data in the format you used it in, and I’ll share it in mine.

“I grasp that actual fraud and misleading presentation of data happens.” What you probably haven’t seen is the ridiculously high proportion of papers (~25%) where the data that gets shared doesn’t match the data in the article. This sloppiness needs to be dragged into the light and science made to clean up its act.

Lastly, if you’re confident that you’ve done a good job of data collection and analysis, then why be afraid of sharing the data? Refusal to share just suggests that you’re not actually sure about the results in your paper.

LikeLiked by 3 people
Full Monty – Ocasapiens - Blog - Repubblica.it Says:

February 26, 2014 at 10:55 am
[…] riguarda cose molto diverse tra loro, e non si capisce quali devono essere accessibili e quali no. DrugMonkey trova che PLoS stia esagerando. Un po’ sì, e non risolve il problema più grave mi sembra: […]

LikeLike
Neuro Polarbear (@NeuroPolarbear) Says:

February 26, 2014 at 11:09 am
The Hauser case is interesting because in that case the “raw data” can refer to two things: (1) the videos of the macaques looking at things (hundreds of hours of high resolution videotape, needs to be high res to see the gaze direction) and (2) excel spreadsheets annotating where those macaques looked. The accusation against Hauser was that the data transcription process was corrupted, so in detecting fraud, the excel spreadsheets would be useless. The raw videos would be needed, although that would take dozens of man-hours and some formal training to analyze.

The distinction between (1) and (2) is exceedingly important because the two different data formats are entirely different uses for the community. The problem is that PLoS appears to have not thought much about the fact that “raw data” can mean very different things and encompasses (1), (2), or both.

LikeLike
rxnm Says:

February 26, 2014 at 11:19 am
@jojo I think “available on request” would meet the NIH policy, particularly for large/cumbersome data sets with no standard formats or repository. It would not fit the PLOS policy

LikeLiked by 1 person
jojo Says:

February 26, 2014 at 11:47 am
@rxnm

Ok, but in that case doesn’t that mean that if supported by the NIH, you would still have to store your raw data locally and indefinitely so that it could be made available upon request? That seems at least as cumbersome as hosting or uploading the data somewhere.

LikeLike
Terry McGlynn Says:

February 26, 2014 at 12:28 pm
re: “Well, a lot of my colleagues in organismal biology put their field and behavior data into Dryad.”

“A lot” might be a large value in absolute terms, if you have a huge number of colleagues. But most people don’t. Just go through most published papers that have public data. It’s still a rarity, outside the very few journals which require it (like Ecol Monog.)

LikeLike
Ian Dworkin (@IanDworkin) Says:

February 26, 2014 at 12:35 pm
Terry,

I would hardly call this “very few journals”
http://datadryad.org/pages/integratedJournals

LikeLiked by 1 person
Dave Says:

February 26, 2014 at 12:56 pm
Can’t publish a genomics paper without a GEO submission number these days. It’s a formal requirement for most good journals in this area. It works for NGS data sets because the raw data format and analytical pipeline are fairly standardized. I can’t see this model working well for other types of data, however, and others have provided ample examples of these above.

But it is a fairly significant task to prepare large NGS data sets for a GEO submission, but one positive is that if your data is high quality, other labs will download it and use it, and this helps beef up citations and, of course, ones reputation. There have also been several “meta-analyses” of NGS data set quality on the GEO site that have been very useful to the field.

Look, these policies are designed with the Western blot fraudster in mind. They want to catch image fraud, primarily, and want to avoid being a headline on Retraction Watch.

LikeLike
drugmonkey Says:

February 26, 2014 at 1:26 pm
you would still have to store your raw data locally and indefinitely so that it could be made available upon request?

What? No statute of limitations?

LikeLike
drugmonkey Says:

February 26, 2014 at 1:29 pm
Sure, scientists that have data scattered all over the place will suffer a bit, but this is a good thing: data management is a core skill for scientists

Your assertion does not make any of this so. And if scientists are managing their data in a way that is efficient and sufficient for normal, current publication standards, then they are expressing the core skill. Your pique that they are not doing it the same way that you manage data is of minimal import.

LikeLiked by 1 person
Tim Vines (@TimHVines) Says:

February 26, 2014 at 1:44 pm
@drugmonkey It would be great if all scientists were “managing their data in a way that is efficient and sufficient for normal, current publication standards”. Like I said, in my experience ~25% of the datasets you either find in online repositories or obtain from the author don’t match the data in the paper. That’s not meeting current publication standards.

LikeLike
anon Says:

February 26, 2014 at 2:42 pm
LOL. This data-sharing stuff would have worked okay in my previous lab. A single experiment in my current lab can generate a 100GB file. Who exactly will host that? The metadata is smaller but is all processed with custom in-house software, and is in its own file format. While the software is freely available, it’s not easy to learn to use. Good luck hosting the raw data or trying to do anything meaningful with the metadata. What a waste of time.

LikeLike
tomjpollard Says:

February 26, 2014 at 5:53 pm
@drugmonkey

“Maybe we are interested in sub-analysis of a data set that nobody else worries about.”

And what if you share the data? Maybe someone else might do a sub-analysis that your lab wasn’t worried about.

LikeLike
bsci Says:

February 26, 2014 at 6:04 pm
It’s a bit amusing to here all the genome-type people here saying automated datasharing is easy yet refusing to listen to the people explaining why it isn’t as easy to share other types of data. Sequence data also benefits from the fact that most people are going to request the same thing. This isn’t true in other areas.

For example, given the NIH rules on datasharing by request, every so often, I have someone ask for some raw data from one of my papers. I ask specifically what they’re looking for & I send it to them & with comments they can understand. It is much easier, more useful, and quicker to give people the specific data they need by request than to publicly bank data in a way that tries to account for everything anyone might ever ask for.

There are huge efforts to build larger scale data sharing systems in my field and I’ve very supportive of those efforts, but the efforts are still in adolescence & will take a good few more years before they are universally practical and useful.

LikeLike
jipkin Says:

February 26, 2014 at 7:17 pm
“It is much easier, more useful, and quicker to give people the specific data they need by request than to publicly bank data in a way that tries to account for everything anyone might ever ask for.”

This only works if data is maintained properly in-house. What happens when students/postdocs leave and the PI doesn’t know exactly what data to send, or where it is, or how it was processed? Or if the hard drive failed?

There must be some happy medium between relying on highly-variable everyone-does-their-own-thing data policies and some form of standardization with regards to long-term storage and sharing. (Of course, there would certainly be individuals solutions for each type of data).

As datasets get larger and larger, I wouldn’t be surprised to see the things people are flipping the table over here (OMG 1 TB of physiology data!!1!) become non-issues as everyone starts backing stuff up in the cloud anyway.

LikeLike
bsci Says:

February 26, 2014 at 7:39 pm
@jipkin, I absolutely agree that it’s important to have good management of data in-house (including archival backups in case of drive failure) and every lab falls short on this to some extent. For better or worse, the person who best understands a data set is the person who collected/analyzed it. Good data management prevents too much knowledge loss when a person leaves, but there will always be some loss.
As for the size of data, I’ve generally found that the amount of data I collect has increased quite consistently with the amount of storage that’s practical. If I’m complaining about 1Tb today, I’ll probably be complaining about 5Tb in a decade & the only difference between local vs cloud is who owns the Petabyte drives & keeps them secure.

LikeLike
Jonathan Badger Says:

February 26, 2014 at 8:27 pm
@bsci “Sequence data also benefits from the fact that most people are going to request the same thing. This isn’t true in other areas”

It wasn’t true in sequence either before people made it available. Who knows what people could do if they had the data available? I’m currently reading Hallam Steven’s history of bioinformatics (“Life out of Sequence”) and was struck how much resistance people like Margaret Dayhoff got when she created the first sequence database (The Atlas of Protein Sequence and Structure) in 1965. People just didn’t get it. They thought it was a stupid case of unoriginal stamp collecting — why should I care about sequences from other groups? Of course Dayhoff had an idea from the start — she was interested in protein families and evolution — but it took a long time for others to get it. And sadly she eventually lost funding and the modern GenBank actually descends from another project in Los Alamos.

LikeLiked by 1 person
Terry McGlynn Says:

February 26, 2014 at 8:33 pm
Ian: Your link didn’t say how many of those journals required data archiving, just the number affiliated with Dryad.

LikeLike
Terry McGlynn Says:

February 26, 2014 at 8:42 pm
Moreover, I realize that the statement wasn’t about the number of journals that support archiving, but what proportion of colleagues archive their data with Dryad. Those would be interesting data. I would imagine that most papers published nowadays do not have the data archived in Dryad.

LikeLike
drugmonkey Says:

February 26, 2014 at 8:54 pm
tjp-
The point was about someone not ever thinking to archive or maybe even collect the data in a way that facilitated another lab’s interests. And now they would be forced to collect and / or analyze by some external orthodoxy or be accused of hiding something.

LikeLike
tomjpollard Says:

February 27, 2014 at 1:21 am
@drugmonkey

Ah, I see. Nobody is forcing you to collect metadata that isn’t relevant to your research, but if software and data manipulation methods have the potential to affect results, then this information absolutely should be captured.

From your article and comments it seems to me that your paranoia of being “accused of hiding something” or having methodology being questioned is blinkering you from the positives of data sharing.

It’s hardly new for a journal to stipulate that data needs to be made available to reviewers on request, so taking this one step further shouldn’t be such a radical change. It’ll be a learning experience for everyone, but in the long run I really believe it’s a good thing.

Kneejerk responses to data sharing initiatives have caused a lot of damage in the UK, particularly in healthcare. My PhD is based on open clinical data, so I’m enormously grateful for the people who have shared.

Tom

LikeLike
Ian Dworkin (@IanDworkin) Says:

February 27, 2014 at 6:39 am
Terry

I think the first column (integrated journals) serves as a useful proxy for journals with an archiving policy. No doubt an imperfect correlation though. I am not sure if there is some list available detailing what journals have what data archiving policy. I took a very quick look but did not find it. Maybe someone who works at DRYAD or DataOne might know?
While I have read the explicit data archiving policies for journals I publish in or read, I certainly have not read them all.

I also have no way of being able to answer your second question. Each journal has different policies, and probably confirm data submission to different degrees. If the question is “what proportion of papers published in journal X have appropriate data packages”? I bet that can be answered.

I am not sure how useful this will be (I just saw it this morning, and have not had a chance to really read it).

http://figshare.com/articles/A_review_of_journal_policies_for_sharing_research_data_across_disciplines/945108

I am also going to respond to your other point on rxnm’s blog. Then I have to get back to working on a talk for monday!

LikeLike
Ian Dworkin (@IanDworkin) Says:

February 27, 2014 at 6:40 am
Terry,

Sorry, I meant Erin’s blog!

LikeLike
bsci Says:

February 27, 2014 at 7:40 am
@Jonathan Badger, I understand that the current state sequence sharing doesn’t represent earlier years in the field, but that type of data sharing matured over 30-40 years. There were wrong turns & wasted efforts. There were also successful analyses of shared datasets that convinced people what was possible and what types of standardization was needed to benefit the most people.

My own area of research, fMRI imaging, is only about 15 years into figuring out data sharing. There is a lot of enthusiasm (and skeptics) and some notable success stories. The biggest successes are in resting-state fMRI were you don’t need to share much information about what was happening during the scan (no stimulus presentation, behavioral responses, etc). This type of sharing will keep getting better, but even for the people who support it, it’s nowhere near the point were researchers can default to sharing data. This is just one of many research areas that PLoS is trying to mandate open data sharing immediately.

LikeLiked by 2 people
Bill Says:

February 27, 2014 at 10:48 am
From your article and comments it seems to me that your paranoia of being “accused of hiding something” or having methodology being questioned is blinkering you from the positives of data sharing.

This is what keeps coming to my mind. Further: if your data and methodology are so complex that you are worried about the ease of bad-faith nitpicking, or you think it will be a lot of work to share, then you probably should do that work precisely because the data might not mean what you think they mean. Maybe you got something wrong in all that filming and tracking and smoothing and transforming and vectoring; maybe sharing it will help you see that, or allow someone else to see it.

LikeLike
drugmonkey Says:

February 27, 2014 at 12:50 pm
Do you really not grasp that every time one has a new trainee, one goes through some measure of this process? And yes, the data mean what I think they mean. Yes, a lot of stuff is not obvious to newcomers to the field who have come to me to learn something about it. Not all of our approaches are obvious for a trainee coming from a peer lab that just plain does some things differently. So we go through it. Sometimes I adopt a new way of doing things based on the trainees’ input.

And that’s the good faith nitpicking with someone smart, has at least related skills/training and who is highly motivated to understand.

This very thread demonstrates how little motivation or capability y’all OPEN EVERYTHING waccaloon acolytes have to understand any way of doing science that is outside of your narrow little experiences. Imagine how much fun the topic denialists will be. Ever seen my drug threads? familiar with ARA nutters and their attempts to prove animal research “flawed”? familiar with climate science? Republican attacks on sex-worker and HIV and drug and fruitfly research that they hate? Anti-vaccaloons?

Do you have ANY imagination at all?

maybe sharing it will help you see that, or allow someone else to see it.

Yep. Which is why I am in full support of providing the data upon credible request. This is, so far as I am aware, the current operating standard for most journals. It has worked fine for me when I have wanted to see someone’s data. The only time I’ve refused was when someone wanted to do some Mt Dew style modeler wankery over a VERY large set of essentially unannotated files (due entirely to the limitations of a set of old and primative software packages) and it was going to take something serious in terms of my time to satisfy. Like several weeks of dedicated work. And I didn’t refuse so much as say “when you want to write a grant that provides the salary support to do this work, let me know”. Naturally I didn’t hear back from the leech, which allowed me to draw some conclusions about how serious this person was.

LikeLike
Bill Says:

February 27, 2014 at 1:31 pm
little motivation or capability […] to understand any way of doing science that is outside of your narrow little experiences

You keep saying this; people keep providing you with counter-examples; and you continually return to talking about YOUR teeny little bunny patch. It’s not your interlocutors who fail the imagination test here.

yes, the data mean what I think they mean

It is inconceivable to you that you could be wrong? Really?

LikeLike
drugmonkey Says:

February 27, 2014 at 2:45 pm
you seem to miss the burden of proof and directionality of the argument here Bill. there is no “counter-example”. There is my example of why this is a stupid and unnecessary burden. Any pet counter example of yours doesn’t make this not so. nor does your flapping on about how I “should” collect data in a particular way change a damn thing.

I totally understand the upside of the PLOSFAIL policy and indeed I think that it’s great. If you want to deposit your data in an open format, you go right the heck ahead. I’ll pat you on the back. Great job. Your benefit tells us exactly nothing about the burden and benefit for other people. Only they can do that. You are just deaf to it, is all. Theology is like that sometimes.

It is inconceivable to you that you could be wrong? Really?

Nope. I just usually am not when it comes to the mechanics of how we should approach our data. I’ve been at this for awhile you know. But we test that every time we’re going through datasets in the lab and some trainee has a brilliant new insight or whatever. Also when they or I am crafting a paper and completing analyses that we think make sense for some reason or other.

It’s called science. It works.

LikeLike
Comradde PhysioProffe Says:

February 27, 2014 at 5:42 pm
It’s called science. It works.

This is what I am continually befuddled by. Where is the evidence that everything is horribly “broken” in the way science is done? Yes, there are issues with the financial resources devoted to science and how they are allocated, but I just don’t get where the idea comes from that all kinds of shit in science is “broken” and needs to be “fixed”. Best that I can tell, the scientific enterprise is continuing to do a pretty goddamn motherfucken good job of producing a massive stream of fascinating (and relevant to human goals) discoveries on a daily basis.

LikeLiked by 2 people
tomjpollard Says:

February 28, 2014 at 3:26 am
@ComraddePhysioProffe
@drugmonkey

“It’s called science. It works.”

Bit more complicated than that though, isn’t it?

Some people would say there are issues to address:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/

LikeLike
tomjpollard Says:

February 28, 2014 at 3:38 am
PS. @drugmonkey

Isn’t there a level of abstraction between the raw data and the paper that would be useful for your own records, as well as for others?

It would be interesting to see a breakdown of the discipline-specific problems if you have time to blog them, and the discussion would be helpful for everyone.

LikeLike
drugmonkey Says:

February 28, 2014 at 4:50 am
Useful? Presumably if there is something that a lab thinks makes sense to expend its valuable effort on, they are already doing it. You seem to be saying labs should do more data curating to satisfy the desires or demands of some

LikeLike
drugmonkey Says:

February 28, 2014 at 4:52 am
other parties. And then somehow such labs are supposed to miraculously find that it is all to their good. Despite the fact that they are telling you at the outset it is a waste of time.

Whither comes this arrogance that you know what is good for every other person’s lab operations?

LikeLike
drugmonkey Says:

February 28, 2014 at 4:58 am
“A breakdown”? Yeah, let me get right on that… Oh wait, I did. I have mentioned several examples from my perspective and many others have as well. Here in this thread, on the Twitts and on other blogs.

But what? You are too lazy to curate the examples into some homogenous list that makes sense *to you*? And apparently unwilling to let the natural process of reading comments evolve your thinking unless you get a “breakdown” created by someone else’s effort?

I think I am seeing a pattern here.

LikeLike
Friday recommended reads #19 | Small Pond Science Says:

February 28, 2014 at 5:00 am
[…] journal PLOS ONE is now requiring public access to raw datasets for all of the papers they publish. Not everyone is pleased. There are some really interesting points in this post, and some contentious comments that you […]

LikeLike
What we’re reading: Sex and the single endogenous retrovirus, extinction by hybridization, and the PLOS data-sharing policy | The Molecular Ecologist Says:

February 28, 2014 at 5:00 am
[…] policy that appears to require sharing “raw” data. This has made some people very angry, while others are all in favor. But regardless, that data archiving mandate won’t be a […]

LikeLike
tomjpollard Says:

February 28, 2014 at 7:23 am
It’s very noisy data, but I’ll try to pull out the details.

LikeLike
drugmonkey Says:

February 28, 2014 at 7:48 am
Noisy data?

Please tell me you recognize the obvious parallels here?

LikeLike
BioDataSci Says:

February 28, 2014 at 8:50 am
Perhaps the concept of a “data publication” (e.g., in the journal Scientific Data), which is looking to be more prominent in the near future, will serve as an incentivized mechanism for researchers to put in the extra effort to standardize and publish their data sets in parallel with their papers. But this depends on how much university administrators and funding bodies will value such publications.

LikeLike
tomjpollard Says:

February 28, 2014 at 8:57 am
I get the parallels and I’m not deaf to your views, even if you think I am. I’m also genuinely interested in understanding how data sharing differs across disciplines. There are certainly unsolved issues and I appreciate that data management can be time consuming.

Where we differ, I think, is that (a) I believe there is greater value than you do in sharing data and (b) I know the PLOS team understand the environment and recognise there are challenges to overcome.

So, for example, if I approach the PLOS editors with my new dataset, explaining that it can’t be completely opened up because it contains too many profanities for public consumption, as well as words like ‘PhysioWhimple’ that they won’t understand, then I believe that they’ll help to find a solution.

That’s why I think #PLOSfail people are making too much of a fuss. Let’s try it and see what happens. If it doesn’t work then I’ll happily(ish) sit back and watch people rant.

LikeLike
Bill Says:

February 28, 2014 at 10:05 am
Eh, fukkit. All you Special Snowflakes, just don’t publish in PLOS. If your data needs to be manually picked over and massaged into shape to such an extent that you don’t dare share it, I doubt it actually means much of anything anyway.

As many have pointed out, PLOS ONE in particular is a victim of its own success; perhaps it would be good for all concerned if a couple prolific Special Snowflake fields just went elsewhere en masse.

LikeLike
drugmonkey Says:

February 28, 2014 at 10:05 am
Let’s try it and see what happens.

You are not listening to the observation that this takes a lot of time and effort. Again the fundamental problem here is that you wish to impose your whims on all of science (that wants to submit to PLoS, of course). Just for funsies. And to test your pet assertion/hypothesis.

I know the PLOS team understand the environment and recognise there are challenges to overcome.

That is not my understanding at all. I see no evidence so far that they grasp the points being raised in opposition to this. They are in rebut and deny mode, similar to your stance.

then I believe that they’ll help to find a solution.

To: “Dudes, that is a pain in the ass for no possible gain. I will provide my data on request just like I promise for every other damn journal”.

Really? What is going to be their solution to this?

LikeLike
drugmonkey Says:

February 28, 2014 at 10:08 am
just don’t publish in PLOS

Exactly. And thus dies the momentum PLoS ONE was building to normalize their highly meritorious central mission of divorcing publication from the a priori prediction of importance and impact.

Some of us call that cutting off the nose to spite the face. Alternately, “stepping on one’s own junk”.

LikeLiked by 1 person
tomjpollard Says:

February 28, 2014 at 10:21 am
“Dudes, I will provide my data on request just like I promise for every other damn journal” … What is going to be their solution to this? ”

I imagine they’ll point you to the following section of the policy, which maybe got lost in the noise?:

*********
http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/

Acceptable data-sharing methods:

Data made available to all interested researchers upon request.

Data Availability Statement must specify “Data available on request” and identify the group to which requests should be submitted (e.g., a named data access committee or named ethics committee). The reasons for restrictions on public data deposition must also be specified. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.
*********

Tom

LikeLike
Bill Says:

February 28, 2014 at 10:24 am
And thus dies the momentum PLoS ONE was building

Only if, as you continue to assert, some large fraction of scientists are Special Snowflakes. I predict you won’t be missed.

You also continue to state that data curation for deposit takes a lot of resources. I don’t get this. You are storing and analysing the data — why are you not already capturing that workflow for your own internal QA/QC, or even troubleshooting? How do you know that there wasn’t some hiccup in all that film viewing and manual tagging and algorithmic smoothing and vector transforming and so on?

LikeLike
drugmonkey Says:

February 28, 2014 at 10:40 am
identify the group to which requests should be submitted (e.g., a named data access committee or named ethics committee). The reasons for restrictions on public data deposition must also be specified. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

That is precisely the part that draws my greatest ire. This is a burdensome change. Said committees must be convened. Then they must “ensure” data access. How are they to do that, save by demanding the authors do data deposition with them? It’s the same thing, not an out-clause.

LikeLike
drugmonkey Says:

February 28, 2014 at 10:43 am
I don’t get this.

Because you are not listening. Rather, because you refuse to get this.

Who the fuck are you to tell me or anyone what is “burdensome” in my research?

Especially after you have received many examples and not just from me in this discussion. And yet you continue to say “nu-uh, I know better and I know you are just being lazy”.

This turns you from being a merely blinded and deluded true belieber acolyte into an actively denialist asshole.

LikeLike
Bill Says:

February 28, 2014 at 12:34 pm
Asshole?

You keep accusing me of the very thing that you are doing. If you’d been even trying to hear me, you’d have noticed that I keep ASKING FUCKING QUESTIONS (that you don’t answer). I do that because I’m not trying to be the boss of you, I’m trying to understand your point of view.

Well, I was until it got personal, then I lost interest. It’s not as though the rat-diddling that you call science is worth archiving anyway.

LikeLike
DJMH Says:

March 1, 2014 at 7:44 pm
It’s the science equivalent of an unfunded mandate. I would have to pay for archiving in any reliable form, spend time pulling together the raw data, which in fact is annoying because I would have to sift through which recordings or images were used in the final analysis, and also figure out if the directive included things like archiving the exact code I used for analysis. If all science wants to get together and agree this is worthwhile, then all journals will demand it and I will comply. Until then it’s a reason to avoid PLoS.

LikeLiked by 1 person
tomjpollard Says:

March 2, 2014 at 2:17 am
“Until then it’s a reason to avoid PLoS. ”

Plenty of other journals have robust data policies, so you’ll need to watch out for those too.

e.g. from Nature:

http://www.nature.com/authors/policies/availability.html
****
Availability of data and materials

An inherent principle of publication is that others should be able to replicate and build upon the authors’ published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript, including details of how readers can obtain materials and information. If materials are to be distributed by a for-profit company, this must be stated in the paper.

Supporting data must be made available to editors and peer-reviewers at the time of submission for the purposes of evaluating the manuscript. Peer-reviewers may be asked to comment on the terms of access to materials, methods and/or data sets; Nature journals reserve the right to refuse publication in cases where authors do not provide adequate assurances that they can comply with the journal’s requirements for sharing materials.”
****

Tom

LikeLike
drugmonkey Says:

March 2, 2014 at 7:59 am
That is still on-request. It is not a universal mandate for third-party guarantee. The latter is what I am objecting to.

LikeLike
jipkin Says:

March 2, 2014 at 4:32 pm
DM I wonder what your success rate is in convincing your argument opponents to abandon their positions and see the merit in yours. (or even sway the fence-sitters that are watching from the sidelines). You do have quite the style hahaha

LikeLike
drugmonkey Says:

March 2, 2014 at 10:22 pm
Why do you think convincing my opponents is the goal here?

LikeLike
jipkin Says:

March 2, 2014 at 10:51 pm
I don’t think you’re trying to convince bill et al. of anything (or ethan in a twitter argument) or you probably wouldn’t call them assholes so much haha. I hope you’re not trying to convince fence-sitters either, because your approach appears nonproductive and offputting, at least to this fence-sitter.

So I reckon you’re not consciously trying to convince anyone, eh? I dunno. I gather you enjoy vigorously arguing about things you care about, and I suspect it wouldn’t feel right to you to argue in any other way.

LikeLike
Chris Says:

March 3, 2014 at 2:13 am
‘There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. ‘

What utter reactionary bollocks. And complete fiction — find me one legit person that ever said that.

If you could put as much effort into thinking about sharing as you did into constructing this moany teenage diatribe then we’ll be fine.

LikeLike
February highlights from the world of scientific publishing | sharmanedit Says:

March 3, 2014 at 2:52 am
[…] up and take notice because I started seeing posts and tweets strongly criticising the policy. The first to appear was an angry and (in my opinion) over-the-top post by @DrugMonkeyblog entitled “PLoS is […]

LikeLike
I own my data, until I don’t. | Small Pond Science Says:

March 3, 2014 at 5:00 am
[…] of sudden, I’m less excited about submitting to this journal. I’m not the only one to feel this way, you […]

LikeLike
Life and Data » Some comments on researchers that do not want to share data Says:

March 3, 2014 at 5:02 am
[…] researchers that want to analyze someone else’s data “leeches.” Drugmoney posted several objections to the new […]

LikeLike
drugmonkey Says:

March 3, 2014 at 6:54 am
I am first and foremost about the fullest possible airing of positions. My style does just fine at that.

LikeLike
anonymous postdoc Says:

March 3, 2014 at 8:54 am
Re: Chris, March 3, 2:13am

I can find you such a lab, that would make such a claim, about a number of behavioral tasks. I can find you many such labs. The closest to me that would make these specific claims about fear conditioning is about 35 feet away. There’s another one across the street. There’s another one a quarter mile from here. That is at one institution.

Behavioral paradigms aren’t standardized in the way that nucleotides or amino acids are- there’s no one up-or-down “answer” to the assay. People tend to recognize that one kind of measure tends to be reported for a given behavioral task, but only the inexperienced or the stupid would claim that this measure reflects the totality of information the animal is communicating during this task. Is freezing during the first minute or the last minute of the 5 minute test more meaningful? Depends on the species being tested, the shock paradigm, the sex of the animal, whether the animals perceive the chamber as “different enough” from a previous context…etc. Maybe the animals aren’t freezing as much, but are running around like crazy trying to escape the chamber (depends on the experimenter sometimes) but their fear is communicated in the number of fecal boli they are crapping out. Or maybe the animal is asleep. Perhaps an endocrine measure or heartrate should be considered, to get an autonomic readout of the animal’s psychological state. And that’s just fucking fear conditioning, where everyone mostly just measures freezing and calls it a day.

Not only would it be monstrously difficult to report all these variables, I don’t even know if I have thought of them all. The sum total of information you may be able to learn from an animal in a given behavioral task depends at least as much on the creativity and observational skill of the scientist as it does on using some standard measure. This is not something that it would be useful to report because it many of these variables are currently subjective or anecdotal. The sheer magnitude of independent and dependent variables that could be in play are enormous.

In conclusion, Chris, and everyone else saying PLOS is being reasonable, you are either inexperienced or stupid, and either way you can go fuck yourselves.

LikeLike
tomjpollard Says:

March 3, 2014 at 9:13 am
@anonymous postdoc

Whatever observations you make, whether they are hard or soft, will be noted down, right? That’s your data.

LikeLike
chrisftaylor Says:

March 3, 2014 at 10:15 am
Okay that’s all lovely but irrelevant to my point. Who of those drafting guidelines has claimed that the way you research will be controlled? I’ve read a lot of reporting guidelines and none* do anything but say ‘if you did this, tell us about that’. * There’s one exception (protein function evaluation) where they argued for community-standard assays, but that just led to a fight.

Furthermore, (1) if it’s too hard, you won’t have to do it as long as that is the consensus of your peers (common sense x2) and (2) ‘all’ (which as you say is impossible to define) or nothing is a ridiculous position to adopt.

Also, if you could also give any evidence of any kind of inexorable march of metadata specifications (the only ones enforced are for clinical trial reporting and the like, and MIAME for microarrays — that’s it with iirc one niche exception). Given that the PLoS guidlines are largely about whether to report, and where to shove it, NOT about metadata (you do see the distinction?) then I suspect the bulk of your post is moot anyway.

Also I know I said bollocks — my bad for posting from the hip — but can we tone this down a bit?

LikeLike
jipkin Says:

March 3, 2014 at 1:31 pm
Well at the risk of sounding inexperienced and/or stupid and inviting a mandatory sentence of self-fucking, here’s an argument for one level of open-data:

Forget for a second about sharing the “raw” data and consider only the data points that go into a figure (so freezing time, not what went into calculating it). Let’s say I don’t like the way they made their figure. Say I think bar charts or stupid or something (I do) and I want to see their data as a box and whiskers to see how well it matches mine. Or say they normalized their data in a new way and I want to see what happens when you do it other ways. The kind of data I’m thinking about is literally the highest level you get to in processing before you do your normalizations and averages and make the figure. Kilobyte file sizes, and minimal curation needs.

What’s wrong with that? Is it just “checking the math”? I suppose… but so? Sharing this kind of data isn’t a huge burden so who cares if people want to check the math. Maybe sometimes math needs to be checked. (Reinhardt-Rogoff, e.g.). Especially if you disagree with the way they normalized something, or don’t understand exactly how the way they calculate an index works. (This happens quite a bit when I read outside my little niche, and I can only imagine it helping my understanding of the paper if I could play with their data briefly myself. And if I find the paper easier to understand I’m more likely to talk about it / share it and increase its impact…)

Well what if people go through the data and publish commentaries saying you did everything wrong and this is really the way it should be done? Okay… ? That’s just supposed to be how science works. If you did do something wrong, you’re corrected. If they’re wrong, you correct them. That’s a normal thing.

Why can’t we just have interested people email for it? I’d rather upload the file once than have to respond to fifty emails for the data myself. Plus, you’ll reach more people by eliminating the extra step.

To summarize:

Benefits of high-level open-data sharing:

-Allows readers from outside the field to develop their own understanding for how things like normalization and index generation work in this particular one, thus broadening the impact of the work.
-Allows readers to spot simple errors in math, and correct them.
-Allows readers who don’t like how the data are presented to make their own figure and compare to their own data, increasing their understanding and broadening the impact of the work.
-Allows for some methodological critique and back-and-forth which is already part of the scientific enterprise

Costs:
-You have to upload the .xlsx or .csv you already made when preparing the data.

LikeLike
drugmonkey Says:

March 3, 2014 at 3:58 pm
Who of those drafting guidelines has claimed that the way you research will be controlled?

This is my assertion about how this is going to go down. Inevitably, given the goals.

you won’t have to do it as long as that is the consensus of your peers

There is nothing in the #PLoSFail policy that supports this. There is no “consensus of peers”, there is a policy mandate for all of PLoS.

‘all’ (which as you say is impossible to define) or nothing is a ridiculous position to adopt

At the moment, this is the only fair read of their policy. Until and unless they see fit to define their terms a bit better.

Given that the PLoS guidlines are largely about whether to report, and where to shove it, NOT about metadata (you do see the distinction?)

Did you even read the policy?

PLOS defines the “minimal dataset” to consist of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety.

Who cares what it is “largely” about, as if we could know that. It is clearly about this definition of the necessary data and this is what I am discussing as the problem. This is what triggered me to think about how it could and would be applied to my work. The question is not their intent but in how the language they use will most likely be applied.

but can we tone this down a bit?

Tone what down? and why?

LikeLike
drugmonkey Says:

March 3, 2014 at 4:00 pm
Costs:
-You have to upload the .xlsx or .csv you already made when preparing the data.

that is far too limited to be consistent with the “with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety” clause of the policy.

LikeLike
jipkin Says:

March 3, 2014 at 4:39 pm
I know – I’m not talking about PLoS, just (a form) of open data in general.

LikeLike
Dave Says:

March 3, 2014 at 5:41 pm
Is this a comment record, DM?

LikeLike
drugmonkey Says:

March 3, 2014 at 5:58 pm
huh?

LikeLike
rxnm Says:

March 3, 2014 at 7:40 pm
jipkin, I have no problem with what you describe… though I don’t necessarily buy that it is all that useful. I do think there could be higher standards for data presentation, and that authors should be expected to share on request.

There are bad things about the social side of science where you establish relationships for sharing and cooperating, because that is where a lot of exclusion and snobbery can take place. But there is a lot of good about it, too, in building community and opportunities for interaction and discussion. I don’t think it’s ok for me to give a flat “no” to a data request, but I think it’s ok for me to know who has my primary data and why they wanted it. That is a cultural norm in my discipline, I don’t see any great harm and I see some benefits. I know these norms vary among disciplines. Not sure why that’s a problem.

Either way, the intent and scope of the PLOS policy is clearly broader than what you describe (though the policy is far from clear in intent or scope and getting murkier).

LikeLike
jipkin Says:

March 3, 2014 at 9:23 pm
I believe Dave is referring to the number of comments.

@rxnm, fair enough. “Open data lite” is at minimum a low cost low benefit option, and in some cases, a low cost medium benefit, and in rare cases (where the author made a mistake and the reviewers missed it) a low cost high benefit option. Or that’s how I imagine it. Practice, of course, could differ.

I think if I were in PLoS’s shoes, I would consult with people in the field to determine on a case-by-case basis what constitutes a reasonably useful amount of data to share. This would include a discussion of what people in the field find useful, as well as what should be included so that people outside the field can make sense of the data.

There are certainly costs to sharing all the data, down to the raw files, in time and money. It should be noted though that there are benefits. For instance, DM has made the point that no one, given his raw data files, could make sense of them (go forwards from them to produce the same result) without him taking a lot of time to generate some hand-holdy instructions for processing it. That does seem a pain to do – but isn’t there some value in a reader from outside the field being able to understand exactly how you get from the raw to the polished?

I don’t know if that value necessarily exceeds the time and money cost it would take to put in (hence I’m on the fence when it comes to deep data-sharing). But I do think it is at least a worthwhile goal to present our science in a way that the highest number of readers possible can understand and appreciate deeply, and the current “here’s the figure, now try to guess from the one-paragraph in the methods section how we got here” is often just not enough.

LikeLiked by 1 person
PLOS’ Bold Data Policy | The Scholarly Kitchen Says:

March 4, 2014 at 2:30 am
[…] the biggest practical problem with PLOS’ policy is that it puts an additional time and effort burden on already time-short, over-burdened researchers. I think I say this in nearly every post I write […]

LikeLike
halcyon Says:

March 4, 2014 at 9:45 am
I think this is great. I’ll upload ALL of my source data to PLoS. Then I can stop paying money to dropbox.

LikeLike
TeaHag Says:

March 5, 2014 at 7:48 pm
I know that I’m late to this party but I was nearly driven out of my tiny mind by PLoS One a couple of years ago. I had a dataset derived from microarray analysis using a custom chip. Within this set I had identified differentially expressed genes and these were subsequently validated via qPCR. Bacteria-never hard to get statistical significance. The manuscript went to PLoS One for review without the upload to an online public database, a step that was only required for publication. Outcome of the review generally positive with only minimal changes required.

BUT-

A problem I had arose when I tried to submit my data to GEO. They needed the gene identifiers on the custom array at upload. However, lawyers at my collaborator’s institution were in the process of patenting elements (not directly associated with my study) of the custom chip! I had closed the circle, they wouldn’t release the identifiers, GEO couldn’t accept the data without. My list of identifiers (specific to my species of interest but only ~1/6 of the genomes represented on the chip). In the end, I had to describe all of the microarray information detail as data not shown, with some tawdry, uninformative plot as supplemental data. The core data directly relevant to my research was still in, thanks to the independent validation. Useful for me, less useful for anyone who might have had their own favorite loci to investigate.

In my world, getting genome wide datasets for niche organisms is both challenging and quite expensive, and we aren’t well funded to begin with. I’m in sympathy with both the journal’s views and that of my research community. However, based on the delay, the fuss, the endless emails seeking authorizations that weren’t ever coming….. I’m not submitting to PLoS One again. Writing and being reviewed are stress enough….

LikeLike
strong opinions about data sharing mandates–mine included | [citation needed] Says:

March 12, 2014 at 9:08 am
[…] outrage and/or support for the policy. The nays have posts like DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the […]

LikeLike
Plagarism and Open Science | Mike the Mad Biologist Says:

March 21, 2014 at 6:00 am
[…] I’m on the fence regarding the calls for open data* (most recently by PLoSOne), this point by DrugMonkey is something that’s always bothered me (boldface […]

LikeLike
Notes from the field – succeessional dynamics of a tropical forest in Khao Yai National Park, Thailand | theoretical ecology Says:

March 23, 2014 at 2:47 pm
[…] and Wirong about the large amount of work and dedication it needed to create them coincided with a cascade of blog posts on data sharing, triggered by a change of PLOS’s data sharing policy (required […]

LikeLike
CARMEN, reproducible research and push-button papers | Says:

May 7, 2017 at 11:59 pm
[…] PLOS tightening their data availability policies (making them closer to ours) there has been some debate and push-back recently on the feasibility of enforcing this, so do you have anything to say on this issue? You […]

LikeLike

	Anonymous on Research Opportunities for New…
	Research Opportuniti… on A window on what is fair
	Research Opportuniti… on On targeting NIH funding oppor…
	Research Opportuniti… on NINDS Issues NOSI Requesting A…
	Links 6/7/22 \| Mike… on Grant awards and the new, new…
	Grant awards and the… on NIH tries, again, to keep gran…
	Grant awards and the… on Fighting with the New Biosketc…

DrugMonkey