Surprise Detection in Multivariate Astronomical Data

Astronomers systematically study the sky with large sky surveys. A common feature of modern sky surveys is that they produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. For example, the LSST will produce a 20–40 PB catalog database. Large sky surveys have enormous potential to enable countless astronomical discoveries. Such discoveries will span the full spectrum of statistics: from rare one-in-a-billion (or one-in-a-trillion) object types, to complete statistical and astrophysical specifications of many classes of objects (based upon millions of instances of each class). The growth in data volumes requires more effective knowledge discovery and extraction algorithms. Among these are algorithms for outlier (novelty/surprise/anomaly) detection. Outlier detection algorithms enable scientists to discover the most “interesting” scientific knowledge hidden within large and high-dimensional datasets: the “unknown unknowns”. Effective outlier detection is essential for rapid discovery of potentially interesting and/or hazardous events. Emerging unexpected conditions in hardware, software, or network resources need to be detected, characterized, and analyzed as soon as possible for obvious system health and safety reasons, just as emerging behaviors and variations in scientific targets should be similarly detected and characterized promptly in order to enable rapid decision support in response to such events. We have developed a new algorithm for outlier detection (KNN-DD: K-Nearest Neighbor Data Distributions). We have derived results from preliminary experiments in terms of the algorithm’s precision and recall for known outliers, and in terms of its ability to distinguish between characteristically different data distributions among different classes of objects.

[1]  Francisco J. Prieto,et al.  Multivariate Outlier Detection and Robust Covariance Matrix Estimation , 2001, Technometrics.

[2]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[3]  Alexander G. Gray,et al.  EIGHT-DIMENSIONAL MID-INFRARED/OPTICAL BAYESIAN QUASAR SELECTION , 2008, 0810.3567.

[4]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[5]  Victor J. Yohai,et al.  The Behavior of the Stahel-Donoho Robust Multivariate Estimator , 1995 .

[6]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[7]  Kirk D. Borne,et al.  Scientific Data Mining in Astronomy , 2009, Next Generation of Data Mining.

[8]  S. Djorgovski,et al.  Fundamental Properties of Elliptical Galaxies , 1987 .

[9]  N. Lodieu,et al.  Epsilon Indi B: a new benchmark T dwarf , 2002 .

[10]  Haimonti Dutta,et al.  Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System , 2007, SDM.

[11]  M. P. S. Bhatia,et al.  A Cluster-based Approach for Outlier Detection in Dynamic Data Streams (KORM: k-median OutlieR Miner) , 2010, ArXiv.

[12]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  Padhraic Smyth,et al.  Rule Induction Using Information Theory , 1991, Knowledge Discovery in Databases.

[14]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[15]  Vydunas Saltenis,et al.  Outlier Detection Based on the Distribution of Distances between Data Points , 2004, Informatica.

[16]  Jian Huang,et al.  Querying for Feature Extraction and Visualization in Climate Modeling , 2009, ICCS.

[17]  S. Srinoy,et al.  Anomaly Detection Model Based on Bio-Inspired Algorithm and Independent Component Analysis , 2006, TENCON 2006 - 2006 IEEE Region 10 Conference.

[18]  R. Davies,et al.  Spectroscopy and photometry of elliptical galaxies. I: a new distance estimator , 1987 .

[19]  A. Nobel,et al.  Finding large average submatrices in high dimensional data , 2009, 0905.1682.

[20]  Haimonti Dutta,et al.  Empowering scientific discovery by distributed data mining on the grid infrastructure , 2007 .

[21]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Pasi Fränti,et al.  Outlier detection using k-nearest neighbour graph , 2004, ICPR 2004.

[23]  Michiel Debruyne,et al.  An outlier map for Support Vector Machine classification , 2010 .

[24]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[25]  Z. Paragi,et al.  Revealing Hanny's Voorwerp : radio observations of IC 2497 , 2009, 0905.1851.