Improving Source Code Lexicon via Traceability and Information Retrieval

The paper presents an approach helping developers to maintain source code identifiers and comments consistent with high-level artifacts. Specifically, the approach computes and shows the textual similarity between source code and related high-level artifacts. Our conjecture is that developers are induced to improve the source code lexicon, i.e., terms used in identifiers or comments, if the software development environment provides information about the textual similarity between the source code under development and the related high-level artifacts. The proposed approach also recommends candidate identifiers built from high-level artifacts related to the source code under development and has been implemented as an Eclipse plug-in, called COde Comprehension Nurturant Using Traceability (COCONUT). The paper also reports on two controlled experiments performed with master's and bachelor's students. The goal of the experiments is to evaluate the quality of identifiers and comments (in terms of their consistency with high-level artifacts) in the source code produced when using or not using COCONUT. The achieved results confirm our conjecture that providing the developers with similarity between code and high-level artifacts helps to improve the quality of source code lexicon. This indicates the potential usefulness of COCONUT as a feature for software development environments.

[1]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[2]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[3]  Letha H. Etzkorn,et al.  A semantic entropy metric , 2002, J. Softw. Maintenance Res. Pract..

[4]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[5]  Evans,et al.  Domain-driven design , 2003 .

[6]  LuciaAndrea De,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007 .

[7]  Mordechai Nisenson,et al.  A Traceability Technique for Specifications , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[8]  David W. Binkley,et al.  Leveraged Quality Assessment using Information Retrieval Techniques , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[9]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[10]  David W. Binkley,et al.  Syntactic Identifier Conciseness and Consistency , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.

[11]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[12]  Raffaella Settimi,et al.  Supporting software evolution through dynamically retrieving traces to UML artifacts , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[13]  David W. Binkley,et al.  An empirical study of rules for well-formed identifiers , 2007, J. Softw. Maintenance Res. Pract..

[14]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[15]  Stephen Clark,et al.  Best Practices for Automated Traceability , 2007, Computer.

[16]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[17]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[18]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[19]  David W. Binkley,et al.  Quantifying identifier quality: an analysis of trends , 2006, Empirical Software Engineering.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Marco Torchiano,et al.  How Developers' Experience and Ability Influence Web Application Comprehension Tasks Supported by UML Stereotypes: A Series of Four Experiments , 2010, IEEE Transactions on Software Engineering.

[22]  Denys Poshyvanyk,et al.  Using Traceability Links to Assess and Maintain the Quality of Software Documentation , 2007 .

[23]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[24]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[25]  David W. Binkley,et al.  To camelcase or under_score , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[26]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[27]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[28]  Andrea De Lucia,et al.  Incremental Approach and User Feedbacks: a Silver Bullet for Traceability Recovery , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[29]  Denys Poshyvanyk,et al.  The conceptual cohesion of classes , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[30]  Jane Huffman Hayes,et al.  Improving requirements tracing via information retrieval , 2003, Proceedings. 11th IEEE International Requirements Engineering Conference, 2003..

[31]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[32]  A. De Lucia,et al.  Traceability management for impact analysis , 2008, 2008 Frontiers of Software Maintenance.

[33]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[34]  Massimiliano Di Penta,et al.  Smart Formatter: Learning Coding Style from Existing Source Code , 2007, ICSM.

[35]  Michael E. Fagan Design and Code Inspections to Reduce Errors in Program Development , 1976, IBM Syst. J..

[36]  Olly Gotel,et al.  An analysis of the requirements traceability problem , 1994, Proceedings of IEEE International Conference on Requirements Engineering.

[37]  Giuliano Antoniol,et al.  Traceability recovery by modeling programmer behavior , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[38]  Peter M. Chisnall,et al.  Questionnaire Design, Interviewing and Attitude Measurement , 1993 .

[39]  David W. Binkley,et al.  Effective identifier names for comprehension and memory , 2007, Innovations in Systems and Software Engineering.

[40]  Jane Cleland-Huang,et al.  Utilizing supporting evidence to improve dynamic requirements traceability , 2005, 13th IEEE International Conference on Requirements Engineering (RE'05).

[41]  David W. Binkley,et al.  Extracting Meaning from Abbreviated Identifiers , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[42]  Jane Huffman Hayes,et al.  Tracing requirements to defect reports: an application of information retrieval techniques , 2005, Innovations in Systems and Software Engineering.

[43]  Allen H Dutoit,et al.  Object-Oriented Software Engineering , 2011 .

[44]  Andrea De Lucia,et al.  Traceability Recovery Using Numerical Analysis , 2009, 2009 16th Working Conference on Reverse Engineering.

[45]  Harry M. Sneed Object-oriented COBOL recycling , 1996, Proceedings of WCRE '96: 4rd Working Conference on Reverse Engineering.

[46]  J. Cullum,et al.  Real Rectangular Matrices , 1985 .

[47]  Genny Tortora,et al.  Assessing IR-based traceability recovery tools through controlled experiments , 2009, Empirical Software Engineering.

[48]  Genny Tortora,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[49]  Steven P. Reiss,et al.  Automatic code stylizing , 2007, ASE.

[50]  Robert D. Macredie,et al.  The effects of comments and identifier names on program comprehensibility: an experimental investigation , 1996, J. Program. Lang..

[51]  Rudolf Ferenc,et al.  Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems , 2008, IEEE Transactions on Software Engineering.

[52]  Andrian Marcus,et al.  On the Use of Domain Terms in Source Code , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[53]  Ivar Jacobson,et al.  Object-Oriented Software Engineering , 1991, TOOLS.

[54]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[55]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[56]  Dag I. K. Sjøberg,et al.  Evaluating the effect of a delegated versus centralized control style on the maintainability of object-oriented software , 2004, IEEE Transactions on Software Engineering.

[57]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[58]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[59]  Giuliano Antoniol,et al.  Identifying the starting impact set of a maintenance request: a case study , 2000, Proceedings of the Fourth European Conference on Software Maintenance and Reengineering.

[60]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[61]  Arie van Deursen,et al.  An industrial case study in reconstructing requirements views , 2008, Empirical Software Engineering.

[62]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[63]  Dallas Johnson,et al.  Crossover Experiments: A Comparison of ANOVA Tests and Alternative Analyses , 2000 .

[64]  Andrea De Lucia,et al.  Improving Comprehensibility of Source Code via Traceability Information: a Controlled Experiment , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[65]  Massimiliano Di Penta,et al.  An experimental investigation of formality in UML-based development , 2005, IEEE Transactions on Software Engineering.

[66]  Genny Tortora,et al.  ADAMS: an Artefact-based Process Support System , 2004, SEKE.

[67]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[68]  William C. Chu,et al.  A measure for composite module cohesion , 1992, International Conference on Software Engineering.

[69]  Gregory Butler,et al.  Retrieving information from data flow diagrams , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[70]  Giuliano Antoniol,et al.  Analyzing the Evolution of the Source Code Vocabulary , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.