Necessity of packaging research artifacts for reuse

Posted on September 8, 2015 by Marko Dimjašević

Courtesy of SOCIALisBETTER

As it goes in research, one often builds on earlier work. In today’s world, this includes digital artifacts such as data and software. What I often find to be a big obstacle in computer science is unsuitability of the distribution form of those artifacts that are supposed be used by others.

Other research fields might be in a better position when it comes to software artifacts, e.g. by having R modules that can be managed through R itself. Generally speaking, that is not the case in computer science. Computer scientists write software artifacts in arbitrary programming languages and usually make their artifacts available to others in its original form, i.e. as source code. If someone else wants to build on that, one needs to go through a lengthy and frequently painful process of figuring out software dependencies and how to install and configure them. Too many times I’ve seen researchers give up on inspecting or trying to use an artifact due to the tedious installation process. Instead, the original authors could have provided an artifact package that is as easy to install as running this command:

sudo apt-get install <artifact-name>

Since oftentimes this is not the case, this person wanting to extend the work or simply build upon it wastes a lot of time. The original authors or their collaborators probably go through the same time-wasteful process themselves when getting back to their research project in a while trying to figure out how to get everything up and running again. Someone might also be interested in reproducing someone’s research, but reaches this serious obstacle. Such a situation is of no benefit to anyone in the research community nor to common users of the work.

Therefore, I’d argue that any computer scientist who wishes to make their research artifacts (software) useful to others has to package them for easy distribution and installation. This begs the question: what should the distribution means be, assuming it’s not a management tool provided by the programming language of choice that has easy ways to package and distribute your work?

With the rise of virtualization technology such as virtual machines (e.g. Vagrant, QEMU, VirtualBox, KVM) and software containers (e.g. Linux Containers, Docker), researchers have been distributing their artifacts as virtual machines and containers. This lets any third party to easily try out the artifact, which is great. However, it might not be easy to go from this temporary-and-easy-to-try-out virtual machine/container to integrating it into a working research environment; e.g., establishing communication channels between the virtualized environment and the research host environment could be painful. And then there is overhead of the virtualized environment which can get in the way. Therefore, for a permanent setup something that is native to the research host environment is needed.

Next that comes to my mind are binary packages for operating systems because it is a no-brainer to install a software artifact by issuing a command like above. Then the next question is: which operating system should be picked given that they have mutually incompatible package management systems? Maybe a few of them? Ideally, it should be all of them. However, the situation is not ideal as there are proprietary operating systems. If a choice has to be made between multiple operating systems, then I’d definitely recommend going with free software operating systems, not just because they’re free software and hence respect your freedom, but also because they have well working long-standing established package-management systems. This leaves out Windows and OS X, among others.

In the free software operating system land GNU/Linux is the most popular and supported operating system. It makes sense to pick a GNU/Linux distribution over a BSD distribution because of that. Then again, there are more than 1000 GNU/Linux distributions and which one to pick? The most popular GNU/Linux distribution is Ubuntu. However, it comes with proprietary software and therefore invades your freedom. Instead, I suggest Debian, which is free and is very important in the information technology world. Furthermore, if you still prefer Ubuntu, creating a package for Debian means the package will make it to Ubuntu too, because Ubuntu, like many other GNU/Linux distributions, is directly or indirectly based on Debian and uses packages from Debian.

Courtesy of GNOME icon artists

Debian has a well-developed packaging process that is thoroughly documented. In particular, if you are interested in creating a package of your research artifact for Debian, check out the Debian wiki and the extensive Debian New Maintainers’ Guide. There is a steep learning curve in creating your own first Debian package, but the outcome will be very useful to others and if you happen to package another program (or an artifact in general) in the future, you will already know a lot.

Does it sound like a lot of work? It does because it is a lot of work. On the other hand, if everyone sticks to whatever they’ve been doing so far - not packaging their own artifacts, yet being unhappy and grumpy about others not doing the same for their artifacts - then we’re going to do even more of the same work over and over again by spending more time both individually and cumulatively on figuring out how to build upon other researchers’ artifacts. Instead of not packaging your own few (software) artifacts and then spending weeks and weeks on installing and configuring others’ non-packaged artifacts, how about each of us spends time on packaging our own artifacts and then enjoys the benefits of a package manager?

This ought to be considered an integral part of every computer scientist’s research and as important as learning how to write a paper and create figures for the paper.

sudo apt-get install wise-allocation-of-precious-time