Extracting Code Segments and Their Descriptions from Research Articles

The availability of large corpora of online software-related documents today presents an opportunity to use machine learning to improve integrated development environments by first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and education conference and journal articles can be a rich source for code examples that are used to motivate or explain particular concepts or issues. Because they are used as examples in an article, these code examples are accompanied by descriptions of their functionality, properties, or other associated information expressed in natural language text. Identifying code segments in these documents is relatively straightforward, thus this paper tackles the problem of extracting the natural language text that is associated with each code segment in an article. We present and evaluate a set of heuristics that address the challenges of the text often not being colocated with the code segment as in developer communications such as online forums.

[1]  Jie Tang,et al.  Email data cleaning , 2005, KDD '05.

[2]  Daniela Cruzes,et al.  Automated Information Extraction from Empirical Software Engineering Literature: Is that possible? , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[3]  Thomas Zimmermann,et al.  Extracting structural information from bug reports , 2008, MSR '08.

[4]  Alberto Bacchelli,et al.  Extracting Source Code from E-Mails , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[5]  Marco Tulio Valente,et al.  Documenting APIs with examples: Lessons learned with the APIMiner platform , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[6]  Martin P. Robillard,et al.  Discovering Information Explaining API Types Using Text Classification , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[7]  Chanchal Kumar Roy,et al.  Recommending insightful comments for source code using crowdsourced knowledge , 2015, 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[8]  Tore Dybå,et al.  A systematic review of quasi-experiments in software engineering , 2009, Inf. Softw. Technol..

[9]  Jinqiu Yang,et al.  AutoComment: Mining question and answer sites for automatic comment generation , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Gerardo Canfora,et al.  Mining source code descriptions from developer communications , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[11]  Martin P. Robillard,et al.  Discovering essential code elements in informal documentation , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[12]  Reid Holmes,et al.  Making sense of online code snippets , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[13]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[14]  Sven Apel,et al.  Views on Internal and External Validity in Empirical Software Engineering , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15]  Paul Lukowicz,et al.  Experimental evaluation in computer science: A quantitative study , 1995, J. Syst. Softw..

[16]  Reid Holmes,et al.  Live API documentation , 2014, ICSE.

[17]  Christoph Treude,et al.  Augmenting API Documentation with Insights from Stack Overflow , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[18]  Gerardo Canfora,et al.  CODES: mining source code descriptions from developers discussions , 2014, ICPC 2014.

[19]  Gerardo Canfora,et al.  A Hidden Markov Model to detect coded information islands in free text , 2013, 2013 IEEE 13th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[20]  Nicholas A. Kraft,et al.  What information about code snippets is available in different software-related documents? An exploratory study , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).