Mining commit log messages to identify risky code

Software vulnerabilities and other risks continually emerge as new code is introduced or existing code is modified. The effect of these risks can be disastrous, not only for companies and organizations that provide this software, but also for those that use such software. However, the risks of new/modified code could be potentially mitigated if it were possible to reliably scan all code upon commit. In this paper, we describe a novel approach that leverages prior commit log messages as one means for training a system to automatically flag new commits. Our data-driven approach is designed to complement the hard-wired approaches that most static and dynamic code analysis tools use today. We demonstrate our approach in the context of two major existing projects: Apache Web Server (httpd) and Apache Tomcat, both popular web containers used by hundreds of thousands of organizations.