KLEE comes with a nice tutorial. The tutorial demonstrates how to use KLEE on a few examples. All but one of the examples operates on self-contained or so-called closed programs. These example programs do not use any external functionality, including essential things such as a library for printing out to the standard output. However, if a program is to communicate with other programs or to do more complex computation, it is very likely it will need to rely on external functionality. Enter libraries.
The goal of my Google Summer of Code 2016 project is to use KLEE to analyze Debian packages. As you might have guessed, the packages contain libraries. Hence I need to look at how KLEE can analyze libraries.
The library for C programs is the C standard library. The library provides an abstraction layer over the underlying hardware. Virtually every single Debian binary package has a runtime dependency on an implementation of the C library, and in particular on the GNU C library. Therefore, the GNU C library provides the external functionality mentioned above to C programs.
Given that the C standard library is so important and that almost every C program relies on it, KLEE comes with an implementation of the library called KLEE-uClibc, which is based on the uClibc library. Without KLEE having support for the C library, almost no C program could be symbolically executed. However, KLEE-uClibc is unfortunately rather old and likely outdated. Cristian Cadar pointed out in a discussion it might be worthwhile updating KLEE-uClibc. Therefore, I’ve set to bring the implementation to the most recent release of the uClibc library.
I wrote about my KLEE-uClibc efforts to the klee-dev mailing list earlier this week. In a nutshell, I know next to nothing about the C standard library, let alone about uClibc, but I gave upgrading KLEE-uClibc from version 0.9.29 to the upstream 0.9.30 version a try. As noted on the mailing list, I stumbled upon something that seems like a bug in my merging of the 0.9.30 version to KLEE’s 0.9.29 version of the library, but I don’t know how to go about it. Let’s see if I’ll get help and be able to figure out how to fix that.
In general, when a program that KLEE analyzes makes an external call to a library that hasn’t been compiled to LLVM IR and linked together with the program, it can only make that call concretely, which means loosing any involved symbolic information up to that call. This is sub-optimal, but it is better than not being able to make external calls at all. Analyzing a library with KLEE on its own is tricky, as KLEE needs a main() function that drives the execution of the library, and in general libraries come with no main() function. However, if I figure out a way to automatically analyze a library with KLEE, that will likely give higher code coverage of the library compared to running the library concretely. Higher code coverage is desirable, as it increases chances of hitting corner cases where bugs might be lurking.
On the Debile side of the GSoC project, I’ve been reading Debile’s source code to understand how other software analysis tools are supported in Debile. My understanding so far is that KLEE itself does not have to implement the Firehose format; instead, it is enough to parse KLEE’s output and generate Firehose data structures instead, which eventually get output to the Firehose XML format. That may be an easier road than implementing the Firehose format in KLEE, though not as clean. I’ll see what to do.
Finally, I have read a paper that just appeared in the proceedings of the International Conference on Software Engineering 2016. The paper is titled Synthesizing Framework Models for Symbolic Execution. It is on generating GUI library models that can be used in symbolic execution. The thing with symbolic execution (which includes KLEE) is that, as pointed out by paper authors, one often needs models of underlying libraries used by a program in order to analyze the program. The paper focuses on the Java Swing framework and the Android framework. Unfortunately, the approach is not applicable to libraries in Debian that I would like to symbolically execute with KLEE as it requires manual modeling of all other libraries a program relies on (in the case of the paper, one would need to model the Java standard library in the authors’ tool, which would be a bigger effort than manually modeling the Swing framework itself). In other words, this approach doesn’t buy us much. I’ve been thinking how to automatically analyze libraries and Debian library packages with KLEE as it is not straightforward. While I have brainstormed some ideas, I’ll need to think more about this.