Big Holes in Big Data: A Monte Carlo Algorithm for Detecting Large Hyper-Rectangles in High Dimensional Data

We present the first algorithm for finding holes in high dimensional data that runs in polynomial time with respect to the number of dimensions. Previous algorithms are exponential. Finding large empty rectangles or boxes in a set of points in 2D and 3D space has been well studied. Efficient algorithms exist to identify the empty regions in these low-dimensional spaces. Unfortunately such efficiency is lacking in higher dimensions where the problem has been shown to be NP-complete when the dimensions are included in the input. Applications for algorithms that find large empty spaces include big data analysis, recommender systems, automated knowledge discovery, and query optimization. Our Monte Carlo-based algorithm discovers interesting maximal empty hyper-rectangles in cases where dimensionality and input size would otherwise make analysis impractical. The run-time is polynomial in the size of the input and the number of dimensions. We apply the algorithm on a 39-dimensional data set for protein structures and discover interesting properties that we think could not be inferred otherwise.

[1]  Norberto F. Ezquerra,et al.  A Clustering Algorithm to Discover Low and High Density Hyper-Rectangles in Subspaces of Multidimensional Data. , 1999 .

[2]  Ying Liu,et al.  The Maximum Box Problem and its Application to Data Analysis , 2002, Comput. Optim. Appl..

[3]  Ke Wang,et al.  Using Decision Tree Induction for Discovering Holes in Data , 1998, PRICAI.

[4]  Yang Li,et al.  KINARI-Web: a server for protein rigidity analysis , 2011, Nucleic Acids Res..

[5]  Jacobs,et al.  Generic rigidity percolation: The pebble game. , 1995, Physical review letters.

[6]  J. Mark Keil,et al.  The Bichromatic Rectangle Problem in High Dimensions , 2009, CCCG.

[7]  J. Kendrew,et al.  A Three-Dimensional Model of the Myoglobin Molecule Obtained by X-Ray Analysis , 1958, Nature.

[8]  D. Jacobs,et al.  Protein flexibility predictions using graph theory , 2001, Proteins.

[9]  B. Hendrickson,et al.  Regular ArticleAn Algorithm for Two-Dimensional Rigidity Percolation: The Pebble Game , 1997 .

[10]  Renée J. Miller,et al.  Mining for empty spaces in large data sets , 2003, Theor. Comput. Sci..

[11]  Renée J. Miller,et al.  Mining for Empty Rectangles in Large Data Sets , 2001, ICDT.

[12]  Adrian Dumitrescu,et al.  Computational Geometry Column 60 , 2014, SIGA.

[13]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[14]  Amitava Datta,et al.  An efficient algorithm for computing the maximum empty rectangle in three dimensions , 2000, Inf. Sci..

[15]  Ryutaro Tateishi,et al.  A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees , 2013 .

[16]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[17]  Atri Rudra,et al.  Joins via Geometric Resolutions: Worst-case and Beyond , 2014, PODS.

[18]  J. Mark Keil,et al.  The Mono- and Bichromatic Empty Rectangle and Square Problems in All Dimensions , 2010, LATIN.

[19]  Adrian Dumitrescu,et al.  On the Largest Empty Axis-Parallel Box Amidst n Points , 2009, Algorithmica.

[20]  Vincent Le Guilloux,et al.  fpocket: online tools for protein ensemble pocket detection and tracking , 2010, Nucleic Acids Res..

[21]  M. Karplus,et al.  Dynamics of ligand binding to heme proteins. , 1979, Journal of molecular biology.

[22]  Alok Aggarwal,et al.  Fast algorithms for computing the largest empty rectangle , 1987, SCG '87.

[23]  Wynne Hsu,et al.  Discovering Interesting Holes in Data , 1997, IJCAI.

[24]  José R. Paramá,et al.  Finding the Largest Empty Rectangle Containing Only a Query Point in Large Multidimensional Databases , 2012, SSDBM.

[25]  B. Hendrickson,et al.  An Algorithm for Two-Dimensional Rigidity Percolation , 1997 .

[26]  Pınar Tüfekci,et al.  Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods , 2014 .