Does Principal Component Analysis Improve Cluster-Based Analysis?

Researchers in the dynamic program analysis field have extensively used cluster analysis to address various problems. Typically, the clustering techniques are applied onto execution profiles having high dimensionality (i.e., involving a large number of profiling elements), sometimes in the order of thousands or even hundreds of thousands. Our concern is that the high number of profiling elements might diminish the effectiveness of the clustering process, which led us to explore the use of dimensionality reduction techniques as a preprocessing step to clustering. Specifically, in this work, we used PCA (Principal Component Analysis) as a dimensionality reduction technique and investigated its impact on two cluster-based analysis techniques, one aiming at identifying coincidentally correct tests, and the other at test suite minimization. In other words, we tried to assess whether PCA improves cluster-based analysis. Our experimental results showed that the impact was positive on the first technique, but inconclusive on the second, which calls for further investigation in the future.

[1]  Wes Masri,et al.  An empirical study of the factors that reduce the effectiveness of coverage-based fault localization , 2009, DEFECTS '09.

[2]  Wes Masri,et al.  An algorithm for capturing variables dependences in test suites , 2011, J. Syst. Softw..

[3]  Jeffrey M. Voas,et al.  PIE: A Dynamic Failure-Based Technique , 1992, IEEE Trans. Software Eng..

[4]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[5]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[6]  Wes Masri,et al.  Cleansing Test Suites from Coincidental Correctness to Enhance Fault-Localization , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[7]  David Leon,et al.  An Empirical Study of Test Case Filtering Techniques Based on Exercising Information Flows , 2007, IEEE Transactions on Software Engineering.

[8]  Wes Masri,et al.  Exploiting the empirical characteristics of program dependences for improved forward computation of dynamic slices , 2008, Empirical Software Engineering.

[9]  Luciano Baresi,et al.  An Introduction to Software Testing , 2006, FoVMT.

[10]  Andy Podgurski,et al.  Algorithms and tool support for dynamic information flow analysis , 2009, Inf. Softw. Technol..

[11]  Larry Hatcher,et al.  A step-by-step approach to using SAS for factor analysis and structural equation modeling , 2014 .

[12]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[13]  Wes Masri,et al.  Identifying Failure-Correlated Dependence Chains , 2011, 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops.

[14]  David Leon,et al.  Finding failures by cluster analysis of execution profiles , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[15]  Wes Masri,et al.  Test case filtering and prioritization based on coverage of combinations of program elements , 2009, WODA '09.

[16]  Shing-Chi Cheung,et al.  Taming coincidental correctness: Coverage refinement with context patterns to improve fault localization , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[17]  James R. Larus,et al.  The use of program profiling for software maintenance with applications to the year 2000 problem , 1997, ESEC '97/FSE-5.

[18]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[19]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[20]  Robert M. Hierons Avoiding coincidental correctness in boundary value analysis , 2006, TSEM.

[21]  A. Jefferson Offutt,et al.  Introduction to Software Testing , 2008 .

[22]  Fadi A. Zaraket,et al.  Enhancing Fault Localization via Multivariate Visualization , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[23]  John T. Stasko,et al.  Visualization of test information to assist fault localization , 2002, ICSE '02.

[24]  Andy Podgurski,et al.  Application-based anomaly intrusion detection with dynamic information flow analysis , 2008, Comput. Secur..

[25]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.