Doing Research Without Doing Research

Posted on April 3, 2015 by Marko Dimjašević

One thing that has baffled me ever since I started with my PhD is the way many do research and the rest of the research community seems to be fine with it. In particular, I’m referring to computer science — although this observation easily applies to other domains — where one can publish a research paper without publishing all information about their experiments, namely research artifacts.

Most of the time it is so easy to release all the information — be it by posting data, an experiment program’s source code and instructions online, by providing virtual machines for download, etc. Yet, it is optional to do research while doing research — one can for example choose not to release the source code of the program they used to empirically confirm their theoretical results. This removes science from such research work — the reader of the paper can only trust the paper’s authors. My hunch is that science shouldn’t be about taking someone’s word. Yes, experimental setups are usually described to some extent in papers, but due to a page limit it is not possible to include everything. Even more, sometimes the program is a result of 10 or more years of software engineering and expecting the reader to reconstruct it to confirm the results in the paper is ridiculous. Anyone that has worked even for a little while with software is aware that in software development things can go wrong in so many ways, and this is especially the case with research software. Because of that, the only honest thing to do is to publish all such research artifacts. Therefore, not publishing the artifacts should serve as an alarm to the research community that paper authors have something to hide.

I am far from being the first one to notice this absurdity in publishing research. For example, Dijkstra in his EWD 443 (published in 1974) talks about Mathematics Inc., where he was serving as chairman of the board, a fictional company having over 75% of the world’s theorem market share. The company proved the Riemann Hypothesis, but the proof was a trade secret. Just like it is absurd to claim proved something without showing the proof, it is absurd to claim having done some research, but not showing the steps that confirm the research; in computer science that boils down to not releasing software source code or data.

That being said, I just got a paper accepted to the International Symposium on Software Testing and Analysis 2015 and NASA, where I was an intern while working on the paper, wouldn’t release the software we wrote and analyzed for the paper. Furthermore, it is not always the source code that is not released; sometime it is experiment data. For example, recently I submitted another paper with a few collaborators that describes our work on detecting malicious applications for Android. We analyzed thousands of applications from Google Play and from a malicious application data set we obtained from other researchers. The applications are input data to our experiments, yet we cannot release them. The reason is the all-rights-reserved copyright law, which is in stark conflict with science. In particular, it is perfectly legal to download free-of-charge applications from Google Play, but by the law we cannot distribute them along with our paper under submission because we are not copyright holders of the applications. Note that there is no easy way for anyone to obtain the same thousands of applications from Google Play. Similarly, the malicious application data set is not publicly available. The researchers that claim to solely have the right to distribute the data set are violating the copyright law because they are distributing other people’s applications without their permission (note that most of the time authors/copyright holders of malicious applications are not known, hence once can’t know whom to ask for permission to distribute their work).

Computer science is not the only research area suffering from these absurdities. Other areas are negatively impacted as well, especially those were commercial interests are high. For instance, the pharmaceutical industry conducts a lot of drug research, yet it doesn’t release so much of related information because that would hurt profit margins of the industry. Why are such actions even allowed in research? Past generations of researchers had allowed it and the current generations have been allowing it by keeping their mouth shut.

Until researchers stand up to the copyright law, the patent law, other similar restrictive laws, their causes, and optional science while doing research, we will keep on watching the farce.

Expectations - Courtesy of Don Graham