PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code

Programs usually follow many implicit programming rules, most of which are too tedious to be documented by programmers. When these rules are violated by programmers who are unaware of or forget about them, defects can be easily introduced. Therefore, it is highly desirable to have tools to automatically extract such rules and also to automatically detect violations. Previous work in this direction focuses on simple function-pair based programming rules and additionally requires programmers to provide rule templates.This paper proposes a general method called PR-Miner that uses a data mining technique called frequent itemset mining to efficiently extract implicit programming rules from large software code written in an industrial programming language such as C, requiring little effort from programmers and no prior knowledge of the software. Benefiting from frequent itemset mining, PR-Miner can extract programming rules in general forms (without being constrained by any fixed rule templates) that can contain multiple program elements of various types such as functions, variables and data types. In addition, we also propose an efficient algorithm to automatically detect violations to the extracted programming rules, which are strong indications of bugs.Our evaluation with large software code, including Linux, PostgreSQL Server and the Apache HTTP Server, with 84K--3M lines of code each, shows that PR-Miner can efficiently extract thousands of general programming rules and detect violations within 2 minutes. Moreover, PR-Miner has detected many violations to the extracted rules. Among the top 60 violations reported by PR-Miner, 16 have been confirmed as bugs in the latest version of Linux, 6 in PostgreSQL and 1 in Apache. Most of them violate complex programming rules that contain more than 2 elements and are thereby difficult for previous tools to detect. We reported these bugs and they are currently being fixed by developers.

[1]  Elaine J. Weyuker,et al.  A Tool for Mining Defect-Tracking Systems to Predict Fault-Prone Files , 2004, MSR.

[2]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[3]  Yang Meng Tan,et al.  LCLint: a tool for using specifications to check code , 1994, SIGSOFT '94.

[4]  Michael D. Ernst,et al.  Efficient incremental algorithms for dynamic detection of likely invariants , 2004, SIGSOFT '04/FSE-12.

[5]  Andy Chou,et al.  Bugs as Inconsistent Behavior: A General Approach to Inferring Errors in Systems Code. , 2001, SOSP 2001.

[6]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[7]  Michael D. Ernst,et al.  Automatic generation of program specifications , 2002, ISSTA '02.

[8]  Mana Taghdiri Inferring Specifications to Detect Errors in Code , 2004, ASE.

[9]  Michel Caplain,et al.  Finding Invariant assertions for proving programs , 1975, Reliable Software.

[10]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[11]  Michael D. Ernst,et al.  Invariant inference for static checking: , 2002, SIGSOFT '02/FSE-10.

[12]  Michael D. Ernst,et al.  Invariant inference for static checking: an empirical evaluation , 2002, SOEN.

[13]  Zohar Manna,et al.  Automatic Generation of Invariants and Intermediate Assertions , 1997, Theor. Comput. Sci..

[14]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[15]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[16]  K. Rustan M. Leino,et al.  Houdini, an Annotation Assistant for ESC/Java , 2001, FME.

[17]  Omer F. Rana,et al.  Template Mining in Source-Code Digital Libraries , 2004, MSR.

[18]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[19]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[20]  James R. Larus,et al.  Mining specifications , 2002, POPL '02.

[21]  David Notkin,et al.  Tool-assisted unit test selection based on operational violations , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[22]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[23]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[24]  James H. Morris,et al.  Subgoal induction , 1977, CACM.

[25]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[26]  Mark Lillibridge,et al.  Extended static checking for Java , 2002, PLDI '02.

[27]  Junfeng Yang,et al.  Correlation exploitation in error ranking , 2004, SIGSOFT '04/FSE-12.

[28]  Ben Wegbreit,et al.  The synthesis of loop predicates , 1974, CACM.

[29]  Hassen Saïdi,et al.  Powerful Techniques for the Automatic Generation of Invariants , 1996, CAV.

[30]  Ralph E. Johnson,et al.  Refactoring C with conditional compilation , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[31]  Annie T. T. Ying,et al.  Predicting source code changes by mining revision history , 2003 .

[32]  Serge Demeyer,et al.  Mining Version Control Systems for FACs (Frequently Applied Changes) , 2004, MSR.

[33]  Katsuro Inoue,et al.  Empirical Project Monitor: A Tool for Mining Multiple Project Data , 2004, MSR.