Semi-Supervised Approach for Recovering Traceability Links in Complex Systems

Building a complex system requires the collaboration of different stakeholders. They work together to model the system keeping in mind the requirements described in specification documents. This complexity induces a large volume of requirements and models, i.e., artefacts that will be subject to frequent changes during the project lifetime. Since the artefacts are correlated with each other's, each change has to be rigorously propagated. Identifying traceability links between system's artefacts is then a critical step to reach this goal. In Information Retrieval domain, many approaches have been already proposed to cope with traceability issues. Their main drawback is they introduce an important amount of false positive links making the traceability links validation phase time consuming and error-prone. In this paper, we propose an approach that identifies traceability links with a reduced amount of false positive links ranging from 20% to 30% while raising the amount of true links identified up to 70%. The approach consists of three main steps: 1) we measure syntactical and semantic similarities between pairs of artefacts by combining the use of four major Information Retrieval techniques; 2) using these similarity measures, we identify the most likely true and false links and we build the so called training data set; 3) this training data set and the four IR techniques are used as input of a predictive model in order to classify between true and false links leading ultimately to a reduced amount of false positives. The output is given in the form of a confidence measure that will help the modeller validating the traceability links. We evaluated our approach using four well-known public case studies. Each one comes with a clear identification of true traceability links which allowed us to compare with the outcome of our approach and validate its effectiveness.

[1]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[2]  John C. Grundy,et al.  A combination approach for enhancing automated traceability: (NIER track) , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[3]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[4]  Giuliano Antoniol,et al.  Grand challenges, benchmarks, and TraceLab: developing infrastructure for the software traceability research community , 2011, TEFSE '11.

[5]  Nan Niu,et al.  Enhancing candidate link generation for requirements tracing: The cluster hypothesis revisited , 2012, 2012 20th IEEE International Requirements Engineering Conference (RE).

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Jane Cleland-Huang,et al.  Utilizing supporting evidence to improve dynamic requirements traceability , 2005, 13th IEEE International Conference on Requirements Engineering (RE'05).

[8]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[9]  Chao Liu,et al.  Recovering Relationships between Documentation and Source Code based on the Characteristics of Software Engineering , 2009, Electron. Notes Theor. Comput. Sci..

[10]  David Lo,et al.  Should I follow this fault localization tool’s output? , 2014, Empirical Software Engineering.

[11]  Sonia Haiduc,et al.  A Machine Learning Approach for Determining the Validity of Traceability Links , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[12]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[13]  Giuliano Antoniol,et al.  The Grand Challenge of Traceability (v1.0) , 2012, Software and Systems Traceability.

[14]  Jane Cleland-Huang,et al.  Semantically Enhanced Software Traceability Using Deep Learning Techniques , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[15]  Jane Huffman Hayes,et al.  Towards overcoming human analyst fallibility in the requirements tracing process: NIER track , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[16]  Nicolas Le Roux,et al.  Efficient Non-Parametric Function Induction in Semi-Supervised Learning , 2004, AISTATS.

[17]  Tom M. Mitchell,et al.  Machine Learning and Data Mining , 2012 .

[18]  Stefan Biffl,et al.  A case study on value-based requirements tracing , 2005, ESEC/FSE-13.

[19]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[20]  Michael Edwards,et al.  A Methodology for Systems Requirements Specification and Traceability for Large Real Time Complex Systems , 1991 .