Irish: A Hidden Markov Model to detect coded information islands in free text

Abstract Developers' communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers' communication can be useful to support several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish , require specific expertise for the definition of regular expressions or grammars.

[1]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[2]  David W. Binkley,et al.  Development: Information Retrieval Applications , 2010, Encyclopedia of Software Engineering.

[3]  Mary P. Harper,et al.  A Second-Order Hidden Markov Model for Part-of-Speech Tagging , 1999, ACL.

[4]  H. D. Rombach,et al.  The Goal Question Metric Approach , 1994 .

[5]  Gerardo Canfora,et al.  A Hidden Markov Model to detect coded information islands in free text , 2013, 2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[6]  Michele Lanza,et al.  Leveraging Crowd Knowledge for Software Comprehension and Development , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[7]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[10]  William P. Birmingham,et al.  Modeling Form for On-line Following of Musical Performances , 2005, AAAI.

[11]  Michele Lanza,et al.  RTFM (Read the Factual Mails) - Augmenting Program Comprehension with Remail , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[12]  Ahmed E. Hassan,et al.  A Lightweight Approach to Uncover Technical Artifacts in Unstructured Data , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[13]  James Gosling,et al.  The Java Language Specification, 3rd Edition , 2005 .

[14]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[15]  Michele Lanza,et al.  Seahawk: Stack Overflow in the IDE , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[16]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[17]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[18]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[19]  Alberto Bacchelli,et al.  Content classification of development emails , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[20]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[21]  Michele Lanza,et al.  Extracting structured data from natural language documents with island parsing , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[22]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  David Notkin,et al.  Lightweight lexical source model extraction , 1996, TSEM.

[24]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[25]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[26]  Lena Mamykina,et al.  Design lessons from the fastest q&a site in the west , 2011, CHI.

[27]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[28]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[29]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[30]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[31]  Thomas Zimmermann,et al.  Extracting structural information from bug reports , 2008, MSR '08.