Performance analysis of a dual-tree algorithm for computing spatial distance histograms

Many scientific and engineering fields produce large volume of spatiotemporal data. The storage, retrieval, and analysis of such data impose great challenges to database systems design. Analysis of scientific spatiotemporal data often involves computing functions of all point-to-point interactions. One such analytics, the Spatial Distance Histogram (SDH), is of vital importance to scientific discovery. Recently, algorithms for efficient SDH processing in large-scale scientific databases have been proposed. These algorithms adopt a recursive tree-traversing strategy to process point-to-point distances in the visited tree nodes in batches, thus require less time when compared to the brute-force approach where all pairwise distances have to be computed. Despite the promising experimental results, the complexity of such algorithms has not been thoroughly studied. In this paper, we present an analysis of such algorithms based on a geometric modeling approach. The main technique is to transform the analysis of point counts into a problem of quantifying the area of regions where pairwise distances can be processed in batches by the algorithm. From the analysis, we conclude that the number of pairwise distances that are left to be processed decreases exponentially with more levels of the tree visited. This leads to the proof of a time complexity lower than the quadratic time needed for a brute-force algorithm and builds the foundation for a constant-time approximate algorithm. Our model is also general in that it works for a wide range of point spatial distributions, histogram types, and space-partitioning options in building the tree.

[1]  S. Rao Kosaraju,et al.  A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[2]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[3]  J. Starck,et al.  Astronomical Image and Data Analysis (Astronomy and Astrophysics Library) , 2006 .

[4]  Yi-Cheng Tu,et al.  Computing Spatial Distance Histograms Efficiently in Scientific Databases , 2008 .

[5]  Peter Z. Kunszt,et al.  The SDSS skyserver: public access to the sloan digital sky server data , 2001, SIGMOD '02.

[6]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[7]  L. Wasserman,et al.  Fast Algorithms and Efficient Statistics: N-Point Correlation Functions , 2000, astro-ph/0012333.

[8]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[9]  Alexander S. Szalay,et al.  Spatial Indexing of Large Multidimensional Databases , 2012, CIDR.

[10]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[11]  J DeWittDavid,et al.  Scientific data management in the coming decade , 2005 .

[12]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[13]  Norbert Attig,et al.  Introduction to Molecular Dynamics Simulation , 2004 .

[14]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[15]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[16]  Ting Wang,et al.  DSMM: a Database of Simulated Molecular Motions , 2003, Nucleic Acids Res..

[17]  Berend Smit,et al.  Understanding Molecular Simulations: from Algorithms to Applications , 2002 .

[18]  Michael A. Strauss,et al.  Measuring High-Order Moments of the Galaxy Distribution from Counts in Cells: The Edgeworth Approximation , 1997, astro-ph/9702144.

[19]  Gerd Heber,et al.  Supporting Finite Element Analysis with a Relational Database Backend, Part I: There is Life beyond Files , 2007, ArXiv.

[20]  Berend Smit,et al.  Understanding Molecular Simulation , 2001 .

[21]  C. Garner,et al.  X-ray absorption spectroscopy , 1979, Nature.

[22]  Rajiv K. Kalia,et al.  Scalable I/O of large-scale molecular dynamics simulations: A data-compression algorithm , 2000 .

[23]  Marianne Winslett,et al.  GODIVA: lightweight data management for scientific visualization applications , 2004, Proceedings. 20th International Conference on Data Engineering.

[24]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[25]  Stuart Ozer,et al.  Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA , 2009, SSDBM.

[26]  John S. Lewis Mining the Sky , 1996 .

[27]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[28]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[29]  J. Banavar,et al.  Computer Simulation of Liquids , 1988 .

[30]  Berend Smit,et al.  Understanding molecular simulation: from algorithms to applications , 1996 .

[31]  Bijan Najafi,et al.  A new expression for radial distribution function and infinite shear modulus of Lennard-Jones fluids , 2006 .

[32]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[33]  Jean-Luc Starck,et al.  Astronomical image and data analysis , 2002 .

[34]  B. Montgomery Pettitt,et al.  Large scale distributed data repository: design of a molecular dynamics trajectory database , 1999, Future Gener. Comput. Syst..

[35]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[36]  Ralph Schlapbach,et al.  B-Fabric: An Open Source Life Sciences Data Management System , 2009, SSDBM.

[37]  Jack A. Orenstein Multidimensional Tries Used for Associative Searching , 1982, Inf. Process. Lett..

[38]  Shaoping Chen,et al.  Computing Distance Histograms Ef?ciently in Scientific Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[39]  Jimeng Sun,et al.  Analysis of predictive spatio-temporal queries , 2003, TODS.

[40]  David Maier,et al.  Smoothing the ROI Curve for Scientific Data Management Applications , 2007, CIDR.

[41]  I. Szapudi A New Method for Calculating Counts in Cells , 1997, astro-ph/9711221.

[42]  Jignesh M. Patel,et al.  The Role of Declarative Querying in Bioinformatics , 2003, OMICS.

[43]  Gultekin Özsoyoglu,et al.  Pathways Database System: An Integrated System for Biological Pathways , 2003, Bioinform..

[44]  Scott Klasky,et al.  The Center for Plasma Edge Simulation Workflow Requirements , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[45]  W. Marsden I and J , 2012 .

[46]  Dieter Gawlick,et al.  Applications for expression data in relational database systems , 2004, Proceedings. 20th International Conference on Data Engineering.

[47]  Christos Faloutsos,et al.  QBISM: extending a DBMS to support 3D medical images , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[48]  S. Colombi,et al.  Cosmic statistics of statistics , 1999, astro-ph/9912289.

[49]  Adriano Filipponi,et al.  The radial distribution function probed by X-ray absorption spectroscopy , 1994 .