Putting KLEE to Test | Marko Dimjašević

A coding period of the Google Summer of Code 2016 program has started this Monday and I’m participating in the program with the Debian project by integrating KLEE into Debile. What is worth noting is that this nicely aligns with what I’ve been working on in my PhD program, namely symbolic execution of Debian software with KLEE. I named the project Clover.

While I already had something done for the project before GSoC’s coding period started, here is what I’ve done this week. I’ll try to have weeklies throughout the whole GSoC.

KLEE has been tested so far on Coreutils, i.e. GNU core utilities. They comprise about 90 common programs you use on the command line such as cp, mv, cat, ls, and rm. However, as far as I know, since this was done in 2008, no one has applied KLEE systematically on a bigger scale, e.g. no one has tested thousands of applications with KLEE. The aim of this GSoC project is to make it push-button-easy to do so, especially for every new source package uploaded to Debian. Keep in mind KLEE can analyze C programs only.

To start with, this week I wanted to see how well KLEE fares in its current state against some Debian packages. So I hand-picked over 30 packages, compiled them to LLVM IR, and ran KLEE on them. Here I wasn’t trying to do anything fancy — I simply used KLEE as an LLVM interpreter, i.e. I provided no symbolic inputs to analyzed programs.

The output of KLEE executions on the standard console was interesting. I’ve noticed that I was able to link some libraries, but running KLEE on them would result in segmentation faults. My hunch is that this is the case because there is no main function in libraries, yet KLEE needs that.

Interestingly enough, while KLEE is meant for testing other applications, here by running KLEE on sizable applications I have put KLEE to test. This triggered some previously unknown bugs in KLEE, e.g. a null pointer dereference and an unsupported jump instruction. I reported other issues and bugs, including more memory bugs. So far folks that are actively involved in KLEE development have been very responsive and some bugs are already being fixed. That’s highly appreciated! There was also an example with the Flex tool where I wasn’t sure if that’s a problem with KLEE, so I sent an email to the KLEE-dev list. Because I rely on WLLVM in the process, I also reported several warnings it has issued while looking at the packages (which already got fixed in a fork).

In the mean time I’m getting familiar with KLEE’s code base as well. Earlier I asked how to go about implementing the Firehose analysis format. I have to get better understanding of where this would fit in the code base.

Finally, I read a research paper on symbolic execution: Enhancing Symbolic Execution with Veritesting (direct link). It referenced a research tool discussed on the debian-devel mailing list and that found lots of bugs in applications in Debian GNU/Linux. In 2013 there was an email on the debian-devel list by a researcher from the Carnegie Mellon University notifying Debian developers that he and his peer researchers have found 1200 bugs in Debian-packaged programs. They did that with a tool called Mayhem, which is also based on symbolic execution just like KLEE. Sadly, when faced with having to choose between science and business, the researchers chose the later. In other words, instead of releasing the tool as free software, they made it proprietary and started a company based around it. Stefano Zacchiroli of the Debian fame asked them to release Mayhem as free software as “[…] doing so is also in the interest of better reproducibility and peer review of scientific experiments, as you certainly know”.

Unfortunately, this obviously didn’t resonate with the researchers and the Mayhem never became free software. Therefore, these 1200 reported bugs where a one-time thing to support their research paper and that’s it. This is a receipt as how not to advance science — it was their pockets that benefited here, and not science and consequently the general public. You could argue the opposite is the case, but if you ever implemented a non-trivial software system and even more so used or built upon such a system for research, you’ll know better.