Firehosing KLEE | Marko Dimjašević

The third week of my Google Summer of Code 2016 project is behind me and I started changing KLEE’s source code in order to meet the goal of the project. The goal is to get KLEE plugged into Debian’s Debile, and for that to happen KLEE has to support Firehose, an interchange format for results for (static) analysis tools.

Every analysis tool that has been plugged into Debile either supports the Firehose format or a converter from the tool’s output to Firehose has been implemented in Debile. I had a discussion over email with my project mentor and co-mentor and two more guys that wrote the Firehose format and contributed to it. Essentially, I had to make a choice where I would implement Firehose for KLEE: 1) in KLEE itself, 2) in Debile, 3) in the Firehose format’s code base. Implementing it in KLEE made most sense to me, and that’s what I was suggested in the discussion too. Therefore, I started coding.

KLEE is implemented in C++. While I was knowledgeable to some extent of C++ before, I’ve spent quite some time this week figuring out C++’s peculiarities. Its class system, default and non-default constructors, base classes, pure (abstract) functions, and constructor initialization lists caused me some headaches, but I figured them out with the help from a lab mate.

KLEE is already a complex code base and with the Firehose format I didn’t want to add too much to the complexity. Therefore, I decided to go with persistent data structures in implementing Firehose. That meant adding a lot of const’s all around the place, which illustrates how verbose C++ is.

What I have now is an almost finished library for Firehose data structures to be used in the rest of the KLEE code base. With this library one can create almost all XML tree nodes present in Firehose and print them out. The next step will be adding a new command line flag to KLEE that will enable providing analysis results in the Firehose format beside the usual format KLEE has. This will also require adapting the code base at a few place, and I was hinted to look at a function that processes a test case. From there I managed to trace where my changes that are going to integrate Firehose will start.

Knowing a newer programming language like Scala makes you appreciate its features when you dive into a language as old as C++. Yes, C++ has newer language versions like C++11 and C++14, but you can improve a language only so much by patching it. By implementing those persistent data structures mentioned earlier, I got convinced C++ is not a good choice for such kind of data structures, which is in my view paramount in managing code complexity.

Because C++ doesn’t have optional values like Scala does, I had to revert to not-so-nice design choices in implementing the Firehose library in KLEE. In particular, I had to use static objects, which are an equivalent of global variables, and then compare their memory addresses against addresses of meaningful values from the domain. That’s very hairy. However, optional values will be there in C++17.

Finally, I read a related research paper from 2013 titled “Billions and Billions of Constraints: Whitebox Fuzz Testing in Production”. It comes from the proprietary side of the world (Microsoft in this case). The paper is an experience story of using a symbolic execution tool on about 300 software applications. Even though I’m still early in my project with using KLEE to analyze thousands of C programs, i.e. at a scale, I already stumbled upon many challenges mentioned in the paper. Furthermore, the constraints from the title are path constraints in symbolic execution, which are translated to SMT queries for solvers. As expected for Microsoft, the reported research is not reproducible — they haven’t made the billions of constraints publicly available nor pretty much any other data nor tools developed and used. Even more so, there is a figure in the paper without a scale on the y axis, with a footnote explanation: “Unfortunately, we cannot reveal absolute numbers of crashes found.” Well, if you are keeping data proprietary for no valid scientifically justified reason, then your paper shouldn’t be considered for publication in a peer-reviewed research journal or conference. It’s very sad that computer science has become a very sloppy field in which all kind of things get published. What I’m planning to do with this project is to actually publish solver constraints that KLEE will generate, which will hopefully be beneficial for researchers in the satisfiability modulo theories field.