Enabling Open-World Specification Mining via Unsupervised Learning

Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, without the aid of rule templates, user-directed feedback, or predefined API surfaces, is a major challenge. We call this challenge Open-World Specification Mining. In this paper, we present a framework for mining specifications and usage patterns in an Open-World setting. We design this framework to be miner-agnostic and instead focus on disentangling complex and noisy API interactions. To evaluate our framework, we introduce a benchmark of 71 clusters extracted from five open-source projects. Using this dataset, we show that interesting clusters can be recovered, in a fully automatic way, by leveraging unsupervised learning in the form of word embeddings. Once clusters have been recovered, the challenge of Open-World Specification Mining is simplified and any trace-based mining technique can be applied. In addition, we provide a comprehensive evaluation of three word-vector learners to showcase the value of sub-word information for embeddings learned in the software-engineering domain.

[1]  Mira Mezini,et al.  Detecting missing method calls as violations of the majority rule , 2013, TSEM.

[2]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[3]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[4]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[5]  Thomas R. Gross,et al.  Automatic Generation of Object Usage Specifications from Large Method Traces , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[6]  Andreas Zeller,et al.  Detecting object usage anomalies , 2007, ESEC-FSE '07.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[9]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[12]  Jerome A. Feldman,et al.  On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[13]  Tao Xie,et al.  Mining API Error-Handling Specifications from Source Code , 2009, FASE.

[14]  Zhendong Su,et al.  Javert: fully automatic mining of general temporal properties from dynamic traces , 2008, SIGSOFT '08/FSE-16.

[15]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[16]  Neil Walkinshaw,et al.  Reverse Engineering State Machines by Interactive Grammar Inference , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[17]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[18]  Andreas Zeller,et al.  Mining object behavior with ADABU , 2006, WODA '06.

[19]  Eran Yahav,et al.  Static Specification Mining Using Automata-Based Abstractions , 2008, IEEE Trans. Software Eng..

[20]  Martin P. Robillard,et al.  A field study of API learning obstacles , 2011, Empirical Software Engineering.

[21]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[22]  Trong Duc Nguyen,et al.  Exploring API Embedding for API Usages and Applications , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[23]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[24]  Shuvendu K. Lahiri,et al.  Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[25]  Jian Pei,et al.  Mining API patterns as partial orders from source code: from usage scenarios to specifications , 2007, ESEC-FSE '07.

[26]  Hoan Anh Nguyen,et al.  Graph-based mining of multiple object usage patterns , 2009, ESEC/FSE '09.

[27]  Neil Walkinshaw,et al.  Inferring Finite-State Models with Temporal Constraints , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[28]  Leonardo Mariani,et al.  Automatic generation of software behavioral models , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[29]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[30]  Mira Mezini,et al.  Ieee Transactions on Software Engineering 1 Automated Api Property Inference Techniques , 2022 .

[31]  Sriram Sankaranarayanan,et al.  Mining library specifications using inductive logic programming , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[32]  David Lo,et al.  Deep specification mining , 2018, ISSTA.

[33]  Koushik Sen,et al.  Deep Learning to Find Bugs , 2017 .

[34]  James R. Larus,et al.  Mining specifications , 2002, POPL '02.

[35]  Rainer Koschke,et al.  Dynamic Protocol Recovery , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[36]  Trong Duc Nguyen,et al.  Mapping API Elements for Code Migration with Vector Representations , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[37]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[38]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[39]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[40]  Tao Xie,et al.  Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[41]  Aditya V. Thakur,et al.  Path-based function embedding and its application to error-handling specification mining , 2018, ESEC/SIGSOFT FSE.

[42]  Jiawei Han,et al.  Mining Software Specifications: Methodologies and Applications , 2011 .

[43]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[44]  Somesh Jha,et al.  Neural-augmented static analysis of Android communication , 2018, ESEC/SIGSOFT FSE.

[45]  Swarat Chaudhuri,et al.  Bayesian specification learning for finding API usage errors , 2017, ESEC/SIGSOFT FSE.

[46]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[47]  Siau-Cheng Khoo,et al.  SMArTIC: towards building an accurate, robust and scalable specification miner , 2006, SIGSOFT '06/FSE-14.

[48]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[49]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[50]  Zhenmin Li,et al.  PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[51]  Benjamin Livshits,et al.  DynaMine: finding common error patterns by mining software revision histories , 2005, ESEC/FSE-13.

[52]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[53]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[54]  Mayur Naik,et al.  APISan: Sanitizing API Usages through Semantic Cross-Checking , 2016, USENIX Security Symposium.