Supporting Concept Extraction and Identifier Quality Improvement through Programmers' Lexicon Analysis

Identifiers play an important role in communicating the intentions associated with the program entities they represent. The information captured in identifiers support programmers to (re-)build the “mental model” of the software and facilitates understanding. (Re-)building the “mental model” and understanding large software, however, is difficult and expensive. Besides, the effort involved in the process heavily depends on the quality of the programmers’ lexicon used to construct the identifiers. This thesis addresses the problem of program understanding focusing on (i) concept extraction, and (ii) quality of the lexicon used in identifiers. To address the first problem (concept extraction), two ontology extraction approaches exploiting the natural language information captured in identifiers and structural information of the source code are proposed and evaluated. We have also proposed a method to automatically train a natural language analyzer for identifiers. The trained analyzer is used for concept extraction. The evaluation was conducted on a program understanding task, concept location. Results show that the extracted concepts increase the effectiveness of concept location queries. Besides extracting concepts from the source code, we have investigated information retrieval (IR) based techniques to filter domain concepts from implementation concepts. To address the second problem (quality of the lexicon used in identifiers), we have defined a publicly available catalog of lexicon bad smells (LBS) and developed a suite of tools to automatically detect them. LBS indicate some potential lexicon construction problems that can be addressed through refactoring. The impact of LBS on concept location and the contribution they can give to fault prediction have been studied empirically. Results indicate that LBS refactoring has a significant positive impact on IR-based concept location task and contributes to improve fault prediction, when used in conjunction with structural metrics. In addition to detecting LBS in identifiers, we try also to fix them. We have proposed an approach which uses the concepts extracted from the source code to suggest names which can be used to complete or replace an identifier. The evaluation of the approach shows that it provides useful suggestions, which can effectively support programmers to write consistent names.

[1]  Paolo Tonella,et al.  Supporting concept location through identifier parsing and ontology extraction , 2013, J. Syst. Softw..

[2]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Markus Pizka,et al.  Concise and Consistent Naming , 2005, IWPC.

[5]  Paolo Tonella,et al.  Natural Language Parsing of Program Element Names for Concept Extraction , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[6]  Mark Lorenz Object-Oriented Software Metrics , 1994 .

[7]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[8]  Paolo Tonella,et al.  The Effect of Lexicon Bad Smells on Concept Location in Source Code , 2011, 2011 IEEE 11th International Working Conference on Source Code Analysis and Manipulation.

[9]  Denys Poshyvanyk,et al.  Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[10]  David W. Binkley,et al.  Expanding identifiers to normalize source code vocabulary , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[11]  Clémentine Nebut,et al.  Automatic Extraction of a WordNet-Like Identifier Network from Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[12]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[13]  Andrian Marcus,et al.  Supporting document and data views of source code , 2002, DocEng '02.

[14]  Florian Deißenböck,et al.  How Programs Represent Reality (and how they don't) , 2006, 2006 13th Working Conference on Reverse Engineering.

[15]  Giuliano Antoniol,et al.  Analyzing the Evolution of the Source Code Vocabulary , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[16]  Rainer Koschke,et al.  Revisiting the evaluation of defect prediction models , 2009, PROMISE '09.

[17]  Michael Uschold,et al.  Ontologies and semantics for seamless connectivity , 2004, SGMD.

[18]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[19]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[20]  Yann-Gaël Guéhéneuc,et al.  Fingerprinting design patterns , 2004, 11th Working Conference on Reverse Engineering.

[21]  Radu Vanciu,et al.  Partial Domain Comprehension in Software Evolution and Maintenance , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[22]  Xiaohua Hu,et al.  Dragon Toolkit: Incorporating Auto-Learned Semantic Knowledge into Large-Scale Text Retrieval and Mining , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[24]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[26]  Filomena Ferrucci,et al.  A Genetic Algorithm to Configure Support Vector Machines for Predicting Fault-Prone Components , 2011, PROFES.

[27]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[28]  Václav Rajlich,et al.  Intensions are a key to program comprehension , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[29]  Paolo Tonella,et al.  Improving Web site understanding with keyword-based clustering , 2008 .

[30]  Ted J. Biggerstaff,et al.  The concept assignment problem in program understanding , 1993, [1993] Proceedings Working Conference on Reverse Engineering.

[31]  Elaine J. Weyuker,et al.  Does measuring code change improve fault prediction? , 2011, Promise '11.

[32]  Emily Hill,et al.  Analysing source code: looking for useful verbdirect object pairs in all the right places , 2008, IET Softw..

[33]  Emily Hill,et al.  Automatically capturing source code context of NL-queries for software maintenance and reuse , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[34]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[35]  Yann-Gaël Guéhéneuc,et al.  Mining the Lexicon Used by Programmers during Sofware Evolution , 2007, 2007 IEEE International Conference on Software Maintenance.

[36]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[37]  Paolo Tonella,et al.  Reverse Engineering of Object Oriented Code , 2005, Monographs in Computer Science.

[38]  Gerald C. Gannod,et al.  Recovering Concepts from Source Code with Automated Concept Identification , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[39]  Akito Monden,et al.  Revisiting common bug prediction findings using effort-aware models , 2010, 2010 IEEE International Conference on Software Maintenance.

[40]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[41]  D. Binkley,et al.  Software Fault Prediction using Language Processing , 2007, Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007).

[42]  Jan Nonnen,et al.  Locating the Meaning of Terms in Source Code Research on "Term Introduction" , 2011, 2011 18th Working Conference on Reverse Engineering.

[43]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[44]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[45]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[46]  Leon Moonen,et al.  Evaluating the Relation Between Coding Standard Violations and Faults Within and Across Software Versions ∗ Cathal Boogerd , 2009 .

[47]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[48]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[49]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[50]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[51]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[52]  Colin Potts,et al.  Ontological excavation: unearthing the core concepts of the application , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[53]  David W. Binkley,et al.  Improving identifier informativeness using part of speech information , 2011, MSR '11.

[54]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[55]  Yijun Yu,et al.  Mining java class naming conventions , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[56]  Harry M. Sneed Object-oriented COBOL recycling , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[57]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[58]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[59]  Paolo Tonella,et al.  Automated Identifier Completion and Replacement , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[60]  Yann-Gaël Guéhéneuc,et al.  Can Lexicon Bad Smells Improve Fault Prediction? , 2012, 2012 19th Working Conference on Reverse Engineering.

[61]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[62]  Elaine J. Weyuker,et al.  Comparing the effectiveness of several modeling methods for fault prediction , 2010, Empirical Software Engineering.

[63]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[64]  Gabriele Bavota,et al.  Automatic query performance assessment during the retrieval of software artifacts , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[65]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Denys Poshyvanyk,et al.  Source Code Exploration with Google , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[67]  Paolo Tonella,et al.  Lexicon Bad Smells in Software , 2009, 2009 16th Working Conference on Reverse Engineering.

[68]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[69]  Paolo Tonella,et al.  Code quality from the programmer's perspective , 2009 .

[70]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[71]  David W. Binkley,et al.  Effective identifier names for comprehension and memory , 2007, Innovations in Systems and Software Engineering.

[72]  David W. Binkley,et al.  Normalizing Source Code Vocabulary , 2010, 2010 17th Working Conference on Reverse Engineering.

[73]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[74]  Emily Hill,et al.  AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools , 2008, MSR '08.

[75]  Yijun Yu,et al.  Relating Identifier Naming Flaws and Code Quality: An Empirical Study , 2009, 2009 16th Working Conference on Reverse Engineering.

[76]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[77]  David B. Skillicorn,et al.  Automated Concept Location Using Independent Component Analysis , 2008, 2008 15th Working Conference on Reverse Engineering.

[78]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[79]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[80]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[81]  Jan Jürjens,et al.  Extracting Domain Ontologies from Domain Specific APIs , 2008, 2008 12th European Conference on Software Maintenance and Reengineering.

[82]  David Binkley,et al.  Extracting Meaning from Abbreviated Identifiers , 2007 .

[83]  Neil C. Rowe,et al.  Enhancing Maintainability of Source Programs Through Disabbreviation , 1997, J. Syst. Softw..

[84]  Foutse Khomh,et al.  An exploratory study of the impact of antipatterns on class change- and fault-proneness , 2011, Empirical Software Engineering.

[85]  Yann-Gaël Guéhéneuc,et al.  Design evolution metrics for defect prediction in object oriented systems , 2010, Empirical Software Engineering.

[86]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[87]  Yuming Zhou,et al.  Empirical Analysis of Object-Oriented Design Metrics for Predicting High and Low Severity Faults , 2006, IEEE Transactions on Software Engineering.

[88]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[89]  Paolo Tonella,et al.  Towards the Extraction of Domain Concepts from the Identifiers , 2011, 2011 18th Working Conference on Reverse Engineering.

[90]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[91]  Lionel C. Briand,et al.  Data Mining Techniques for Building Fault-proneness Models in Telecom Java Software , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[92]  Norman Wilde,et al.  The role of concepts in program comprehension , 2002, Proceedings 10th International Workshop on Program Comprehension.

[93]  Leon Moonen,et al.  Assessing the value of coding standards: An empirical study , 2008, 2008 IEEE International Conference on Software Maintenance.

[94]  Lluís Màrquez i Villodre,et al.  Fast and accurate part-of-speech tagging: The SVM approach revisited , 2003, RANLP.

[95]  David W. Binkley,et al.  Leveraged Quality Assessment using Information Retrieval Techniques , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[96]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[97]  Rainer Koschke,et al.  How do professional developers comprehend software? , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[98]  Saumya K. Debray,et al.  Deobfuscation: reverse engineering obfuscated code , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[99]  Lionel C. Briand,et al.  A Unified Framework for Cohesion Measurement in Object-Oriented Systems , 1997, Proceedings Fourth International Software Metrics Symposium.

[100]  Václav Rajlich,et al.  Incremental change in object-oriented programming , 2004, IEEE Software.

[101]  Dekang Lin LaTaT: Language and Text Analysis Tools , 2001, HLT.

[102]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[103]  David W. Binkley,et al.  Syntactic Identifier Conciseness and Consistency , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.

[104]  Sergio Di Martino,et al.  LINSEN: An efficient approach to split identifiers and expand abbreviations , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[105]  Gabriele Bavota,et al.  Evaluating the specificity of text retrieval queries to support software engineering tasks , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[106]  Yijun Yu,et al.  Exploring the Influence of Identifier Names on Code Quality: An Empirical Study , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[107]  Michael English,et al.  An empirical analysis of information retrieval based concept location techniques in software comprehension , 2008, Empirical Software Engineering.

[108]  Barbara G. Ryder,et al.  Constructing precise object relation diagrams , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[109]  Yann-Gaël Guéhéneuc,et al.  Physical and conceptual identifier dispersion: Measures and relation to fault proneness , 2010, 2010 IEEE International Conference on Software Maintenance.

[110]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..