Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

[1]  Juan L. Reutter,et al.  Optimal Joins Using Compressed Quadtrees , 2022, ACM Trans. Database Syst..

[2]  Xiaoliang Xu,et al.  A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search , 2021, Proc. VLDB Endow..

[3]  Long Yang,et al.  LISA: A Learned Index Structure for Spatial Data , 2020, SIGMOD Conference.

[4]  Gong Cheng,et al.  Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities , 2020, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[5]  Klemens Böhm,et al.  REDS: Rule Extraction for Discovering Scenarios , 2019, SIGMOD Conference.

[6]  Deva Ramanan,et al.  Meta-Learning to Detect Rare Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Michael S. Warren,et al.  Visual search over billions of aerial and satellite images , 2019, Comput. Vis. Image Underst..

[8]  Badrish Chandramouli,et al.  ALEX: An Updatable Adaptive Learned Index , 2019, SIGMOD Conference.

[9]  Kaushik Sinha,et al.  Improved nearest neighbor search using auxiliary information and priority functions , 2018, ICML.

[10]  Geoffrey I. Webb,et al.  Extremely Fast Decision Tree , 2018, KDD.

[11]  Fabian Gieseke,et al.  Training Big Random Forests with Little Resources , 2018, KDD.

[12]  M. Meneghetti,et al.  The strong gravitational lens finding challenge , 2018, Astronomy & Astrophysics.

[13]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[14]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[15]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bin Jiang,et al.  Geospatial Big Data Handling Theory and Methods: A Review and Research Challenges , 2015, ArXiv.

[18]  Gilles Louppe,et al.  Understanding Random Forests , 2015 .

[19]  Yanxia Zhang,et al.  Astronomy in the Big Data Era , 2015, Data Sci. J..

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[22]  Ji Wan,et al.  Deep Learning for Content-Based Image Retrieval: A Comprehensive Study , 2014, ACM Multimedia.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  Parikshit Ram,et al.  Which Space Partitioning Tree to Use for Search? , 2013, NIPS.

[25]  Haim Kaplan,et al.  Finding the maximal empty disk containing a query point , 2012, SoCG '12.

[26]  Matthias Drusch,et al.  Sentinel-2: ESA's Optical High-Resolution Mission for GMES Operational Services , 2012 .

[27]  Haim Kaplan,et al.  Finding the Maximal Empty Rectangle Containing a Query Point , 2011, ArXiv.

[28]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[29]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[30]  Heidelberg,et al.  Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra , 2008, 0809.3373.

[31]  Eduardo Serrano,et al.  LSST: From Science Drivers to Reference Design and Anticipated Data Products , 2008, The Astrophysical Journal.

[32]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[33]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[34]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[35]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[36]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[37]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[38]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[39]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[40]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[41]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[42]  Chak-Kuen Wong,et al.  Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees , 1977, Acta Informatica.

[43]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[44]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[45]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[46]  L. Breiman Random Forests , 2001, Machine Learning.

[47]  Mark de Berg,et al.  Computational geometry: algorithms and applications, 3rd Edition , 1997 .

[48]  K A Puntillo,et al.  The second step. , 1982, Imprint.