论文信息 - Enabling Open-World Specification Mining via Unsupervised Learning

Enabling Open-World Specification Mining via Unsupervised Learning

Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, without the aid of rule templates, user-directed feedback, or predefined API surfaces, is a major challenge. We call this challenge Open-World Specification Mining. In this paper, we present a framework for mining specifications and usage patterns in an Open-World setting. We design this framework to be miner-agnostic and instead focus on disentangling complex and noisy API interactions. To evaluate our framework, we introduce a benchmark of 71 clusters extracted from five open-source projects. Using this dataset, we show that interesting clusters can be recovered, in a fully automatic way, by leveraging unsupervised learning in the form of word embeddings. Once clusters have been recovered, the challenge of Open-World Specification Mining is simplified and any trace-based mining technique can be applied. In addition, we provide a comprehensive evaluation of three word-vector learners to showcase the value of sub-word information for embeddings learned in the software-engineering domain.

Shuvendu K. Lahiri | Thomas W. Reps | Jordan Henkel | Ben Liblit

[1] Mira Mezini,et al. Detecting missing method calls as violations of the majority rule , 2013, TSEM.

[2] Premkumar T. Devanbu,et al. A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[3] Uri Alon,et al. A general path-based representation for predicting program properties , 2018, PLDI.

[4] Dawson R. Engler,et al. Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[5] Thomas R. Gross,et al. Automatic Generation of Object Usage Specifications from Large Method Traces , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[6] Andreas Zeller,et al. Detecting object usage anomalies , 2007, ESEC-FSE '07.

[7] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8] Richard S. Zemel,et al. Gated Graph Sequence Neural Networks , 2015, ICLR.

[9] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[10] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11] Koushik Sen,et al. DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[12] Jerome A. Feldman,et al. On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[13] Tao Xie,et al. Mining API Error-Handling Specifications from Source Code , 2009, FASE.

[14] Zhendong Su,et al. Javert: fully automatic mining of general temporal properties from dynamic traces , 2008, SIGSOFT '08/FSE-16.

[15] Xiaodong Gu,et al. Deep API learning , 2016, SIGSOFT FSE.

[16] Neil Walkinshaw,et al. Reverse Engineering State Machines by Interactive Grammar Inference , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[17] Xiao Ma,et al. From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[18] Andreas Zeller,et al. Mining object behavior with ADABU , 2006, WODA '06.

[19] Eran Yahav,et al. Static Specification Mining Using Automata-Based Abstractions , 2008, IEEE Trans. Software Eng..

[20] Martin P. Robillard,et al. A field study of API learning obstacles , 2011, Empirical Software Engineering.

[21] Roni Rosenfeld,et al. Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[22] Trong Duc Nguyen,et al. Exploring API Embedding for API Usages and Applications , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[23] Lihong Li,et al. Neuro-Symbolic Program Synthesis , 2016, ICLR.

[24] Shuvendu K. Lahiri,et al. Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[25] Jian Pei,et al. Mining API patterns as partial orders from source code: from usage scenarios to specifications , 2007, ESEC-FSE '07.

[26] Hoan Anh Nguyen,et al. Graph-based mining of multiple object usage patterns , 2009, ESEC/FSE '09.

[27] Neil Walkinshaw,et al. Inferring Finite-State Models with Temporal Constraints , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[28] Leonardo Mariani,et al. Automatic generation of software behavioral models , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[29] Zellig S. Harris,et al. Distributional Structure , 1954 .

[30] Mira Mezini,et al. Ieee Transactions on Software Engineering 1 Automated Api Property Inference Techniques , 2022 .

[31] Sriram Sankaranarayanan,et al. Mining library specifications using inductive logic programming , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[32] David Lo,et al. Deep specification mining , 2018, ISSTA.

[33] Koushik Sen,et al. Deep Learning to Find Bugs , 2017 .

[34] James R. Larus,et al. Mining specifications , 2002, POPL '02.

[35] Rainer Koschke,et al. Dynamic Protocol Recovery , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[36] Trong Duc Nguyen,et al. Mapping API Elements for Code Migration with Vector Representations , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[37] Leonidas J. Guibas,et al. Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[38] Ke Wang,et al. Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[39] Zhi Jin,et al. Building Program Vector Representations for Deep Learning , 2014, KSEM.

[40] Tao Xie,et al. Inferring Resource Specifications from Natural Language API Documentation , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[41] Aditya V. Thakur,et al. Path-based function embedding and its application to error-handling specification mining , 2018, ESEC/SIGSOFT FSE.

[42] Jiawei Han,et al. Mining Software Specifications: Methodologies and Applications , 2011 .

[43] Torsten Hoefler,et al. Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[44] Somesh Jha,et al. Neural-augmented static analysis of Android communication , 2018, ESEC/SIGSOFT FSE.

[45] Swarat Chaudhuri,et al. Bayesian specification learning for finding API usage errors , 2017, ESEC/SIGSOFT FSE.

[46] Zellig S. Harris,et al. Distributional Structure , 1954 .

[47] Siau-Cheng Khoo,et al. SMArTIC: towards building an accurate, robust and scalable specification miner , 2006, SIGSOFT '06/FSE-14.

[48] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[49] Marc Brockschmidt,et al. Learning to Represent Programs with Graphs , 2017, ICLR.

[50] Zhenmin Li,et al. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[51] Benjamin Livshits,et al. DynaMine: finding common error patterns by mining software revision histories , 2005, ESEC/FSE-13.

[52] Delbert Dueck,et al. Clustering by Passing Messages Between Data Points , 2007, Science.

[53] Uri Alon,et al. code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[54] Mayur Naik,et al. APISan: Sanitizing API Usages through Semantic Cross-Checking , 2016, USENIX Security Symposium.