Distributed anomaly detection using 1‐class SVM for vertically partitioned data

There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of data sets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only because of the massive volume of data but also because these data sets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available data sets: (i) the NASA MODIS satellite images and (ii) a simulated aviation data set generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS). © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 393–406, 2011 (A shorter version of this paper was published in NASA Conference on Intelligent Data Understanding 2010.)

[1]  Vipin Kumar,et al.  Land cover change detection: a case study , 2008, KDD.

[2]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[3]  Weili Wu,et al.  Spatial contextual classification and prediction models for mining geospatial data , 2002, IEEE Trans. Multim..

[4]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[5]  V. Rao Vemuri,et al.  Robust Support Vector Machines for Anomaly Detection in Computer Security , 2003, ICMLA.

[6]  Ashok N. Srivastava,et al.  Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study , 2010, KDD.

[7]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[8]  Ran Wolff,et al.  Communication-efficient distributed mining of association rules , 2001, SIGMOD '01.

[9]  Takafumi Kanamori,et al.  Inlier-Based Outlier Detection via Direct Density Ratio Estimation , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[11]  Hillol Kargupta,et al.  A Scalable Local Algorithm for Distributed Multivariate Regression , 2008, Stat. Anal. Data Min..

[12]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Reda Alhajj,et al.  A parallel multi-scale region outlier mining algorithm for meteorological data , 2007, GIS.

[14]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[15]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[16]  Kanishka Bhaduri,et al.  Privacy-Preserving Outlier Detection Through Random Nonlinear Data Distortion , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[18]  Christopher Potter,et al.  Revealing Land Cover Change in California With Satellite Data , 2007 .

[19]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[20]  Chang-Tien Lu,et al.  Detecting region outliers in meteorological data , 2003, GIS '03.

[21]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[22]  Kanishka Bhaduri,et al.  ν-Anomica: A Fast Support Vector Based Novelty Detection Technique , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[23]  Derya Birant,et al.  Spatio-temporal outlier detection in large databases , 2006, 28th International Conference on Information Technology Interfaces, 2006..

[24]  Jonathan S. Litt,et al.  User's Guide for the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) , 2007 .

[25]  Wenjie Hu,et al.  Robust Anomaly Detection Using Support Vector Machines , 2003 .

[26]  Ran Wolff,et al.  A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.