Scalable and practical probability density estimators for scientific anomaly detection

Originally, astronomers dealt with stars. Later, with galaxies. Today, large scale cosmological structures are so complex, they must be first reduced into more succinct representations. For example, a universe simulation containing millions of objects is characterized by its halo occupation distribution. This progression is typical of many disciplines of science, and even resonates in our daily lives. The easier it is for us to collect new data, store it and manage it, the harder it becomes to keep up with what it all means. For that we need to develop tools capable of mining big data sets. This new generation of data analysis tools must meet the following requirements. They have to be fast and scale well to big data. Their output has to be straightforward to understand and easy to visualize. They need to only ask for the minimum of user input—ideally they would run completely autonomously once given the data. I focus on clustering. Its main advantage is its generality. Separating data into groups of similar objects reduces the perception problem significantly. In this context, I propose new algorithms and tools to meet the challenges: an extremely fast spatial clustering algorithm, which can also estimate the number of clusters; a novel and highly comprehensible mixture model; a sub-linear learner for dependency trees; and an active learning framework to minimize the burden on a human expert hunting for rare anomalies. I implemented the algorithms and used them with very large data sets in a wide variety of applications, including astrophysics.

[1]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[2]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[5]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[6]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[7]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[8]  Masud Mansuripur,et al.  Introduction to information theory , 1986 .

[9]  V. Rich Personal communication , 1989, Nature.

[10]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[11]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[12]  William H. Press,et al.  Numerical Recipes in C, 2nd Edition , 1992 .

[13]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[14]  R. Nichol,et al.  The Edinburgh/Durham Southern Galaxy Catalogue , 1992 .

[15]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[16]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[17]  Mark Plutowski,et al.  Selecting concise training sets from clean data , 1993, IEEE Trans. Neural Networks.

[18]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[19]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[20]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[21]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[22]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[23]  S. Shectman,et al.  The Las Campanas Redshift Survey , 1996, astro-ph/9604167.

[24]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[25]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[26]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[27]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[28]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[29]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[30]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[31]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[33]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[34]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[35]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[36]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[37]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[38]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[39]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[40]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[41]  Manfred K. Warmuth,et al.  Efficient Learning With Virtual Threshold Gates , 1995, Inf. Comput..

[42]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[43]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[44]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[45]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[46]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[47]  Marina Meila,et al.  An Accelerated Chow and Liu Algorithm: Fitting Tree Distributions to High-Dimensional Sparse Data , 1999, ICML.

[48]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[49]  Sanjay Ranka,et al.  An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering , 1999, PAKDD.

[50]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[51]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[52]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[53]  Umeshwar Dayal,et al.  K-Harmonic Means - A Data Clustering Algorithm , 1999 .

[54]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[55]  Wasserman,et al.  Bayesian Model Selection and Model Averaging. , 2000, Journal of mathematical psychology.

[56]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[57]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[58]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[59]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[60]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[61]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[62]  R. Nichol,et al.  The Edinburgh/Durham Southern Galaxy Catalogue - IX. The Galaxy Catalogue , 2000, astro-ph/0008184.

[63]  Andrew W. Moore,et al.  Mixtures of Rectangles: Interpretable Soft Clustering , 2001, ICML.

[64]  Geoff Hulten,et al.  Learning from Infinite Data in Finite Time , 2001, NIPS.

[65]  Beth Logan,et al.  A music similarity function based on signal analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[66]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[67]  Andrew W. Moore,et al.  Using Tarjan's Red Rule for Fast Dependency Tree Construction , 2002, NIPS.

[68]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[69]  Greg Hamerly,et al.  Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[70]  Bhasker K. Moorthy,et al.  The First Data Release of the Sloan Digital Sky Survey , 2003, astro-ph/0305492.

[71]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[72]  Stewart Massie,et al.  Index Driven Selective Sampling for CBR , 2003, ICCBR.

[73]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[74]  Gregory James Hamerly,et al.  Learning structure and concepts in data through data clustering , 2003 .

[75]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[76]  Fabio Gagliardi Cozman,et al.  Semi-Supervised Learning of Mixture Models and Bayesian Networks , 2003 .

[77]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[78]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[79]  Mary Shaw,et al.  DETECTING SEMANTIC ANOMALIES IN TRUCK WEIGH-IN-MOTION TRAFFIC DATA USING DATA MINING , 2004 .

[80]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.