SuRF: Identification of Interesting Data Regions with Surrogate Models

Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region’s interesting-ness. These regions can be naively extracted from the entire dataspace – however, this is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our proposed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user’s needs). However, as data dimensions and size increase, such task inevitably becomes laborious and costly. To alleviate this cost, our solution, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods.

[1]  Peter Triantafillou,et al.  DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models , 2019, SIGMOD Conference.

[2]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[3]  Martin Atzmüller,et al.  Subgroup discovery , 2005, Künstliche Intell..

[4]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[5]  Surajit Chaudhuri,et al.  Overview of Data Exploration Techniques , 2015, SIGMOD Conference.

[6]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[7]  Jeffrey T Leek,et al.  Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. , 2012, International journal of epidemiology.

[8]  Davide Anguita,et al.  A Public Domain Dataset for Human Activity Recognition using Smartphones , 2013, ESANN.

[9]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  R. Poldrack Region of interest analysis for fMRI. , 2007, Social cognitive and affective neuroscience.

[12]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[13]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[14]  Mehdi Kaytoue-Uberall,et al.  Anytime Subgroup Discovery in Numerical Domains with Guarantees , 2018, ECML/PKDD.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[17]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[18]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[19]  Wynne Hsu,et al.  Discovering Interesting Holes in Data , 1997, IJCAI.

[20]  Debasish Ghose,et al.  Glowworm swarm optimization for simultaneous capture of multiple local optima of multimodal functions , 2009, Swarm Intelligence.

[21]  Panos Kalnis,et al.  Evaluation of Top-k OLAP Queries Using Aggregate R-Trees , 2005, SSTD.

[22]  Peter Triantafillou,et al.  Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality , 2017, ACM Trans. Knowl. Discov. Data.

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[24]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[25]  Ying Liu,et al.  The Maximum Box Problem and its Application to Data Analysis , 2002, Comput. Optim. Appl..

[26]  Chinya V. Ravishankar,et al.  Finding Regions of Interest from Trajectory Data , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[27]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[28]  Mohammed J. Zaki,et al.  SCHISM: a new approach for interesting subspace mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).