Interactive Near Duplicate Search in Software Documentation

Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an approach for interactive detection of near duplicates that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based , and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.

[1]  Morgan Ericsson,et al.  Analysis and visualization of information quality of technical documentation , 2010 .

[2]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[3]  Claudia A. Marcos,et al.  Identifying duplicate functionality in textual use cases by aligning semantic actions , 2014, Software & Systems Modeling.

[4]  D. V. Koznov,et al.  DocLine: A method for software product lines documentation development , 2008, Programming and Computer Software.

[5]  Darius Miniotas,et al.  Visualization of eye gaze data using heat maps , 2007 .

[6]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[7]  D. V. Koznov,et al.  Duplicate finder toolkit , 2018, ICSE.

[8]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[9]  David Lorge Parnas,et al.  Precise Documentation: The Key to Better Software , 2010, The Future of Software Engineering.

[10]  Fred P. Brooks,et al.  The Mythical Man-Month , 1975, Reliable Software.

[11]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[12]  Jaroslav Porubän,et al.  Reusable software documentation with phrase annotations , 2014, Central European Journal of Computer Science.

[13]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[14]  Shigeru Chiba,et al.  Tool support for crosscutting concerns of API documentation , 2010, AOSD.

[15]  Mikhail I. Smirnov,et al.  Clone Detection in Reuse of Software Technical Documentation , 2015, Ershov Memorial Conference.

[16]  Kai Petersen,et al.  A systematic literature review of software requirements reuse approaches , 2018, Inf. Softw. Technol..

[17]  Stan Jarzabek,et al.  Documentation Reuse: Managing Similar Documents , 2017, 2017 IEEE International Conference on Information Reuse and Integration (IRI).

[18]  Bernhard Schätz,et al.  Can clone detection support quality assessments of requirements specifications? , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[19]  Jaroslav Porubän,et al.  Preliminary report on empirical study of repeated fragments in internal documentation , 2016, 2016 Federated Conference on Computer Science and Information Systems (FedCSIS).

[20]  Amir Abboud,et al.  Tight Hardness Results for LCS and Other Sequence Similarity Measures , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[21]  D. V. Koznov,et al.  Refactoring the Documentation of Software Product Lines , 2008, CEE-SET.

[22]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[23]  D. V. Koznov,et al.  Detecting Near Duplicates in Software Documentation , 2017, Program. Comput. Softw..

[24]  Pavel Drobintsev,et al.  A formal approach to test scenarios generation based on guides , 2014, Automatic Control and Computer Sciences.

[25]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[26]  Xavier Blanc,et al.  Documentation Reuse: Hot or Not? An Empirical Study , 2017, ICSR.

[27]  Tatiana Gavrilova,et al.  To a method of evaluating ontologies , 2011 .

[28]  Paul G. Bassett,et al.  Framing software reuse: lessons from the real world , 1996 .

[29]  Nikolay V. Pakulin,et al.  Model-based testing of internet e-mail protocols , 2012, Programming and Computer Software.

[30]  Luciv D. V. Koznov D. V.,et al.  Duplicate management in software documentation maintenance , 2017 .

[31]  William F. Smyth,et al.  Efficient token based clone detection with flexible tokenization , 2007, ESEC-FSE companion '07.

[32]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[33]  Hamid Abdul Basit,et al.  Poster: Duplicate Finder Toolkit , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[34]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[35]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[36]  Stan Jarzabek,et al.  Research journey towards industrial application of reuse technique , 2006, ICSE '06.

[37]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[38]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .