Mining statistical correlations with applications to software analysis

Machine learning, data mining, and statistical methods work by representing real-world objects in terms of feature sets that best describe them. This thesis addresses problems related to inferring and analyzing correlations among such features. The contributions of this thesis are two-fold: we develop formulations and algorithms for addressing correlation mining problems, and we also provide novel applications of our methods to statistical software analysis domains. We consider problems related to analyzing correlations via unsupervised approaches, as well as algorithms that infer correlations using fully-supervised or semi-supervised information. In the context of correlation analysis, we propose the problem of correlation matrix clustering which employs a k-means style algorithm to group sets of correlations in an unsupervised manner. Fundamental to this algorithm is a measure for comparing correlations called the log-determinant (LogDet) divergence, and a primary contribution of this thesis is that of interpreting and analyzing this measure in the context of information theory and statistics. Additionally based on the LogDet divergence, we present a metric learning problem called Information-Theoretic Metric Learning which uses semi-supervised or fully-supervised data to infer correlations for parametrization of a Mahalanobis distance metric. We also consider the problem of learning Mahalanobis correlation matrices in the presence of high dimensions when the number of pairwise correlations can grow very large. In validating our correlation mining methods, we consider two in-depth and real-world statistical software analysis problems: software error reporting and unit test prioritization. In the context of Clarify, we investigate two types of correlation mining applications: metric learning for nearest neighbor software support, and decision trees for error classification. We show that our metric learning algorithms can learn program-specific similarity models for more accurate nearest neighbor comparisons. In the context of decision tree learning, we address the problem of learning correlations with associated feature costs, in particular, the overhead costs of software instrumentation. As our second application, we present a unit test ordering algorithm which uses clustering and nearest neighbor algorithms, along with a metric learning component, to efficiently search and execute large unit test suites.

[1]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Scalable Comput. Pract. Exp..

[2]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[3]  Sheng Ma,et al.  Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[4]  Jong-Deok Choi,et al.  Accurate, efficient, and adaptive calling context profiling , 2006, PLDI '06.

[5]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[6]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[7]  Gregg Rothermel,et al.  An empirical investigation of the relationship between spectra differences and regression faults , 2000 .

[8]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[9]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[10]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[11]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[12]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[13]  Vinod Ganapathy,et al.  HeapMD: identifying heap-based bugs using anomaly detection , 2006, ASPLOS XII.

[14]  Donald E. Porter,et al.  Improved error reporting for software that uses black-box components , 2007, PLDI '07.

[15]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[18]  Jason V. Davis,et al.  Cost-Sensitive Decision Tree Learning for Forensic Classification , 2006, ECML.

[19]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[20]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[21]  James R. Larus,et al.  The use of program profiling for software maintenance with applications to the year 2000 problem , 1997, ESEC '97/FSE-5.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Inderjit S. Dhillon,et al.  Learning low-rank kernel matrices , 2006, ICML.

[24]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[25]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[26]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[27]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[28]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[29]  Chao Liu,et al.  Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs , 2005, SDM.

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[32]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[33]  Thomas G. Dietterich,et al.  Pruning Improves Heuristic Search for Cost-Sensitive Learning , 2002, ICML.

[34]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[35]  Xin Guo,et al.  On the optimality of conditional expectation as a Bregman predictor , 2005, IEEE Trans. Inf. Theory.

[36]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[37]  Yoram Singer,et al.  Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy , 1998, NIPS.

[38]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[39]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[40]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI '03.

[41]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[42]  Koby Crammer,et al.  Kernel Design Using Boosting , 2002, NIPS.

[43]  Gregg Rothermel,et al.  Empirical studies of test case prioritization in a JUnit testing environment , 2004, 15th International Symposium on Software Reliability Engineering.

[44]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[45]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[46]  Eser Kandogan,et al.  Field studies of computer system administrators: analysis of system management tools and practices , 2004, CSCW.

[47]  Amitabh Srivastava,et al.  Effectively prioritizing tests in development environment , 2002, ISSTA '02.

[48]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[49]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[50]  Stephanie Forrest,et al.  Automated response using system-call delays , 2000 .

[51]  Misha Pavel,et al.  Adjustment Learning and Relevant Component Analysis , 2002, ECCV.

[52]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[53]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[54]  Inderjit S. Dhillon,et al.  Structured metric learning for high dimensional problems , 2008, KDD.

[55]  Gregg Rothermel,et al.  Test case prioritization , 2004 .

[56]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[57]  Steven W. Norton Generating Better Decision Trees , 1989, IJCAI.

[58]  Guy Lebanon,et al.  Metric learning for text documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Frank K. Soong,et al.  On divergence based clustering of normal distributions and its application to HMM adaptation , 2003, INTERSPEECH.

[60]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[61]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[62]  Gene H. Golub,et al.  Matrix computations , 1983 .

[63]  M. Lam,et al.  Tracking down software bugs using automatic anomaly detection , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[64]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[65]  Alessandro Orso,et al.  Regression test selection for Java software , 2001, OOPSLA '01.

[66]  Wei Hong,et al.  Model-based approximate querying in sensor networks , 2005, The VLDB Journal.

[67]  Yuriy Brun,et al.  Finding latent code errors via machine learning over program executions , 2004, Proceedings. 26th International Conference on Software Engineering.

[68]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.

[69]  Gregg Rothermel,et al.  Test case prioritization: an empirical study , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[70]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[71]  Ming Tan,et al.  CSL: a cost-sensitive learning system for sensing and grasping objects , 1990, Proceedings., IEEE International Conference on Robotics and Automation.

[72]  Michael I. Jordan,et al.  Statistical Debugging of Sampled Programs , 2003, NIPS.

[73]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[74]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[75]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[76]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[77]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[78]  Wei-Ying Ma,et al.  Combining High Level Symptom Descriptions and Low Level State Information for Configuration Fault Diagnosis , 2004, LISA.

[79]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[80]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[81]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[82]  Inderjit S. Dhillon,et al.  Differential Entropic Clustering of Multivariate Gaussians , 2006, NIPS.

[83]  Thomas Ball,et al.  On the limit of control flow analysis for regression test selection , 1998, ISSTA '98.

[84]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[85]  Gregory Tassey,et al.  Prepared for what , 2007 .

[86]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..