Malware detection for Android

Posted on July 6, 2014 by Marko Dimjašević

Courtesy of the U.S. Army

Last year I took a class in machine learning. First of all, I’d been afraid of doing anything related to machine learning, yet alone taking a class on it at the graduate level. The fear has a justification - I have very little to no background in statistics and I hadn’t taken a class like that at the undergraduate level. I don’t have to mention that the main focus of my PhD is in the software verification field, which has nothing to do with machine learning. Nevertheless, my advisor and I agreed it could be a good idea to take the class and so I registered for the class for the Fall 2013 semester.

Fast forward a few months and I am writing a final project report for the class, which will become a research paper this summer. My advisor’s idea what to work on for a project in the class and Simone’s and my idea how to make it happen on Android and there you go, we have a project. The project is on malware detection on Android and it relies on monitoring system calls that an app makes during its execution. Early results we had in the project report were promising so we decided to grow the project from a class project into a research project.

Since then we performed a major overhaul of the tool behind the project (dubbed maline), but the overall idea has stayed the same - monitor dependencies between pairs of the system calls during the execution of an app and use machine learning to discriminate benign from malicious apps, i.e. between app behaviors. I will not go into details of the approach, but if you are interested, go ahead and read what we wrote in the class project report. We also got Ivo on the team to help us understand statistics and how to carry out experiments. Ivo has taken it even further, showing all of us we can significantly reduce the complexity of the problem and still have remarkable results.

Throughout the project I learned a lot about Android and Android SDK, and we even assembled our own custom version of Android SDK that suits our purposes. The reason for the custom version is that Android SDK is a remarkably unstable and buggy poorly-tested project, thereby leaving its users at the (lack of) mercy of Android SDK developers. We pulled pieces from the official release and from its source tree and managed to build a version that we can use in the machine learning project.

Another nice aspect of the project has been an effort to make it reproducible. By pushing myself to document everything I do and to make it easily reproducible by others, we have come up with an experiment environment that comes with a nice set of executable instructions for repeating even the tiniest pieces of the project. Starting from a clean operating system installation to moving to setting it up and customizing its environment and running experiments, everything has been written down and can be repeated by executing selected parts of the project tool. The effort will certainly pay out in other projects as well, because it has next to nothing to do with the specifics of machine learning or our main project idea.

We are planning to perform experiments in the upcoming weeks and write it up. At the moment I am not sure where we are planning to submit it, but I am sure we will find something that suits the paper really well.

I am going to write about the reproducible research aspect of the project in an upcoming blog post so stay tuned if you are interested in that.