Experience with Artifact Evaluation

For the CAV 2015 conference I was in the Artifact Evaluation Committee, which evaluated digital artifacts of papers that got accepted to the conference and that their authors decided to submit in this optional artifact evaluation step. Based on my experience in evaluating a few artifacts, reading reviews of most of the submitted artifacts, and my interest in reproducible research, I provide some thoughts on how to do artifact evaluation in the future and how to make research artifacts useful; I try to provide them in a sorted order of importance, most important things coming first.

My motivation for this post is in pointing to several absurdities in the computer science community, providing suggestions on how to deal with them, fostering a discussion around the absurdities and related issues, and furthermore in making the research done in the field useful to everyone, including to the very researchers conducting research themselves.

Digital artifact submission has to be mandatory, not optional

As it is the case now, for some computer science conferences authors of an accepted paper can optionally submit digital artifacts related to the paper. CAV 2015 is the first edition of the conference to have artifact evaluation. Many other conferences still don’t have that. Given the complexity of software systems and other kinds of artifacts behind the research presented in submitted papers, omnipresence of cheap computers and digital communications, the ease of sharing research by hitting Ctrl+C & Ctrl+V, it is ghastly that submitting the paper artifacts is discretionary.

As the Artifact Evaluation for Software Conferences website, which informed CAV’s edition of artifact evaluation, points out:

Not examining artifacts enables everything from mere sloppiness to, in extreme cases, dishonesty. More subtly, it also imposes a subtle penalty on people who take the trouble to vigorously implement and test their ideas.

Making artifact submission voluntary leaves a lot of room for not carrying out research vigilantly. As soon as submitting comprehensive artifacts is mandatory, the incentive for sloppiness and dishonesty is far lower. Yes, making artifact submission compulsory might mean fewer paper submissions, but that doesn’t mean we should lower the expectation bar and let all kind of bad practices take place. On the contrary, we should aim for honest and insightful papers backed by digital artifact evidence.

The same website states the following:

Industrial researchers (but not only them) might not be willing to share their artifacts due to various proprietary considerations.

What these industrial researchers (but not only them) are saying is that they are asking for recognition of their work by the public (a trusted artifact evaluation committee), namely people outside their organization, yet they do not want to provide the artifacts to the public, which are crucial for deciding whether the work deserves the recognition or not (by granting a paper publication). A paper and its artifact go hand in hand, one supporting the other. Without the artifact made available for a scrutiny, one often cannot say for sure if statements made in the paper are true. Hopefully we won’t be seeing the industrial researchers (but not only them) asking for the recognition, but demanding that the paper be not available to the public “due to various proprietary considerations”. If the researchers do not want the public to benefit from their work, then they shouldn’t be submitting it for publication to a research venue outside their organization.

The artifacts ought to be free

If the public cannot inspect, modify, build, and share on top of the artifacts, they are not useful to the public. Given the pitiful state of the law today, researchers have to be pro-active and stick legal licenses to their artifacts, otherwise they stay proprietary by default (unless the researchers’ work falls under a special law like in the US for works by federal government employees). If the artifacts are software, there is a long practice of licensing the software under a free software license. If the artifacts consists of data, the data should be licensed under a free culture license.

Furthermore, it goes against the spirit of science to build on top of proprietary software, which often happens in the computer science community, but unfortunately in so many other communities as well. Again, if there is a dependency such as a proprietary operating system, a proprietary numerical computing environment, or any other proprietary software dependency, others cannot freely use the dependency, study it, modify it, improve it, build on it, and share a new research work based on it.

At least one artifact submitted to CAV 2015 was bizarre, not even providing an executable of their tool. Instead, the authors provided a link to a website and expected reviewers to analyze the tool’s results over the web. The setup blew up into the authors’ faces as there were technical network problems and the reviewers couldn’t analyze the tool. It is no surprise things like this happen when researchers go to such lengths by hiding their artifacts behind web servers in order not to make their artifacts useful to others.

Paper results need to be easily repeatable by reviewers

This year CAV’s Artifact Evaluation Committee chairs did a good job with providing a common system everyone should base their artifacts on. It was a VirtualBox virtual machine (VM) with Ubuntu GNU/Linux installed and it was up to paper authors to customize and configure the machine such that their tool can be executed in the machine. The chairs also asked everyone to place their artifacts to a certain place in the VM and most of the authors did that; for those that didn’t do it, reviewers had to search for the artifacts, some finding missing parts of the artifact in the Trash folder. The VM had a single user log-in, making it straightforward to start analyzing multiple artifacts once the VM boots. Almost all of authors that submitted artifacts did well and it was easy to a certain extent to run their tool and inspect data.

The first thing the paper authors have to do is to install and configure all of their tool’s dependencies in the VM and consequently do the same for the tool. As a matter of fact, this was optional — it sufficed to submit the tool archived and ask reviewers to configure the VM according to the accompanying documentation. Nevertheless, this puts a burden on the reviewers, leaving less time for actually scrutinizing the tool and the whole artifact. Some authors failed to properly document dependencies for their tools, leaving the reviewers in the dark doing a guess work as to what to install in order to get the tools running.

The next key component is a well-written documentation. The documentation instructs the reviewer how to execute the tool and where to find data. It should clearly educate the reviewer how to reproduce paper results. If there is or randomness in the tool and/or the data, the paper authors should fixate the seed in order to enable repeatability; the documentation should emphasize this. Furthermore, it should be clear how obtained/generated/observed/present data in the artifact maps to figures and tables in the paper. Using consistent naming in both the paper and artifact helps a lot. If the data is extensive, the artifact should include paper figures and tables and provide a relatively easy way of comparing those to the output of the tool, ideally giving an analog of the diff command.

At least one artifact submitted to CAV 2015 provided only a subset of data reported in the paper, stating that the rest of the full data is not available due to confidentiality agreements. Such actions leave reviewers helpless as they cannot repeat paper results. To expect the artifact (and even the paper) to get accepted in such a case is the same as expecting a paper to get accepted by providing its abstract only.

Artifacts should be reusable

If an artifact is able to repeat paper results, that is already a great improvement over the current state of affairs. It means the paper makes consistent claims and is a likely candidate for getting published.

However, paper publication shouldn’t be the goal. The goal should be to make the research available to others. Having the paper available to the public is paramount, e.g. by putting it online. Furthermore, in order for all the engineering effort and diligent work done towards making the paper results repeatable useful beyond paper publishing, the artifact should be reusable. This includes several aspects.

One thing is to document well how to use the artifact on inputs other than those used for the paper. The artifact should work on such inputs, otherwise it only serves the purpose of getting the paper accepted.

If the artifact is to be used outside the VM it was put in for artifact evaluation, it should be well documented in terms of how to obtain, install, and configure it. The baseline is plain English on how to do that, but installation and configuration scripts automating the task would be even better. If this task isn’t automated, it likely means the authors themselves will not be able to install the artifact in a clean environment if needed, especially 6 months or a year down the road. What would be nice to see in subsequent editions of all computer science conferences is to, maybe along with a VM setup like the one for CAV 2015, provide an option of using an automated virtualization environment such as Vagrant. This would enable for repeatable, reusable, and reproducible research by merely pushing a button. Also, from such an environment comprising the artifact it would be clear how to install the artifact in a non-virtualized machine.

From my personal experience and those of my student peers, I’ve realized that students go to an internship, do research there in their ivory tower, create artifacts, and maybe get a paper out of it. However, the very artifact (tool) they’d created often gets archived and is not used at all by the organization where they had had the internship, let alone by the public. It is clear that additional engineering effort needs to be made to integrate such artifacts to the organization’s tool-flow, and creating an automated virtualization environment described above would significantly remedy such efforts.

Well written source code facilitates further research and the review process

If the artifact includes a tool, it is helpful if its source code is well written and documented. This encourages other researchers to study the tool and build on it, opening space for further work, possibly establishing a new collaboration.

Furthermore, the reviewers might want to look at an implementation of the main algorithm or such. If the documentation points to the respective parts in the source code, that makes it so much easier to understand the code, the algorithm, and the overall idea presented in the paper.

Growing computational resource requirements ask for new evaluation setups

Several artifacts submitted to CAV 2015 required ten or dozens of gigabytes of memory and dozens of processor cores for experiment repetition. Obviously, this is more than what today’s laptops and desktops come with and evaluating such resource-demanding artifacts on one’s personal computer isn’t possible. Furthermore, for those that it is possible to evaluate on personal computers, running them in parallel with everyday tasks interferes with the artifact’s performance, likely invalidating performance results. Obviously, one needs a dedicated machine for artifact evaluation.

I invite all artifact reviewers to use Emulab, a testbed infrastructure provided and developed at the University of Utah. Any researcher can use the infrastructure for legitimate purposes such as artifact evaluation for a conference. Emulab has powerful hardware and hosting the demanding artifacts there was a pleasant experience during my reviewing. I described earlier my CAV 2015 artifact evaluation setup on top of Emulab, which other reviewers might find useful.