Experiences with text mining large collections of unstructured systems development artifacts at jpl

Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

[1]  Victor R. Basili,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984, IEEE Transactions on Software Engineering.

[2]  B. Boehm Software risk management: principles and practices , 1991, IEEE Software.

[3]  V. Basili Software modeling and measurement: the Goal/Question/Metric paradigm , 1992 .

[4]  Nancy G. Leveson,et al.  Safeware: System Safety and Computers , 1995 .

[5]  J Allan,et al.  Readings in information retrieval. , 1998 .

[6]  Amir Michail,et al.  Data mining library reuse patterns using generalized association rules , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[7]  Thorsten Brants TnT - Statistical Part-of-Speech Tagging , 2000 .

[8]  Andrian Marcus,et al.  Using latent semantic analysis to identify similarities in source code to support program understanding , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[9]  Awais Rashid,et al.  Risk Management in Component Based Development:: A Separation of Concerns Perspective , 2001 .

[10]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[11]  Miroslaw Truszczynski,et al.  The aspps System , 2002, JELIA.

[12]  David Carney,et al.  Identifying Commercial Off-the-Shelf (COTS) Product Risks: The COTS Usage Risk Evaluation , 2003 .

[13]  Jane Huffman Hayes,et al.  Improving requirements tracing via information retrieval , 2003, Proceedings. 11th IEEE International Requirements Engineering Conference, 2003..

[14]  Jane Huffman Hayes,et al.  Helping analysts trace requirements: an objective look , 2004, Proceedings. 12th IEEE International Requirements Engineering Conference, 2004..

[15]  Robyn R. Lutz,et al.  Empirical analysis of safety-critical anomalies during operations , 2004, IEEE Transactions on Software Engineering.

[16]  Zhenmin Li,et al.  PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code , 2005, ESEC/FSE-13.

[17]  J.S. Shirabad,et al.  Predictor models in software engineering (PROMISE) , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[18]  Allen P. Nikora,et al.  Classifying requirements: towards a more rigorous analysis of natural-language specifications , 2005, 16th IEEE International Symposium on Software Reliability Engineering (ISSRE'05).

[19]  Benjamin Livshits,et al.  DynaMine: finding common error patterns by mining software revision histories , 2005, ESEC/FSE-13.

[20]  Hajo Hippner,et al.  Text Mining , 2006, Informatik-Spektrum.

[21]  Guillermo Rein,et al.  44th AIAA Aerospace Sciences Meeting and Exhibit , 2006 .

[22]  Miroslaw Truszczynski,et al.  Predicate-calculus-based logics for modeling and solving search problems , 2006, TOCL.

[23]  Genny Tortora,et al.  Can Information Retrieval Techniques Effectively Support Traceability Link Recovery? , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[24]  Henry B. Garrett,et al.  Anomaly Trends for Robotic Missions to Mars: Implications for Mission Reliability , 2006 .

[25]  Yann-Gaël Guéhéneuc,et al.  Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[26]  Arie van Deursen,et al.  Can LSI help reconstructing requirements traceability in design and test? , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[27]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[28]  Jane Cleland-Huang,et al.  Clustering support for automated tracing , 2007, ASE '07.

[29]  Tim Menzies,et al.  Learning better IV&V practices , 2008, Innovations in Systems and Software Engineering.

[30]  Allen P. Nikora,et al.  Improving the Accuracy of Space Mission Software Anomaly Frequency Estimates , 2009, 2009 Third IEEE International Conference on Space Mission Challenges for Information Technology.

[31]  J. Hihn,et al.  Spreadsheets in Team X: Preserving Order in an Inherently Chaotic Environment , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[32]  Allen P. Nikora,et al.  Automated Identification of LTL Patterns in Natural Language Requirements , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[33]  J. Hayes,et al.  Experiments in Automated Identification of Ambiguous Natural-Language Requirements , 2010 .

[34]  Robert A. Hanna,et al.  Identification and Classification of Common Risks in Space Science Missions , 2010 .

[35]  LiGuo Huang,et al.  Text mining in supporting software systems risk assurance , 2010, ASE '10.

[36]  LiGuo Huang,et al.  Text Mining Support for Software Requirements: Traceability Assurance , 2011, 2011 44th Hawaii International Conference on System Sciences.