A Comparative Study of Dual-Tree Algorithms for Computing Spatial Distance Histograms

The 2-body correlation function (2-BCF) is a group of statistical measurements that found applications in many scientific domains. One type of 2-BCF named the Spatial Distance Histogram (SDH) is of vital importance in describing the physical features of natural systems. While a naı̈ve way of computing SDH requires quadratical time, efficient algorithms based on resolving nodes in spatial trees have been developed. A key decision in the design of such algorithms is to choose a proper underlying data structure: our previous work utilizes quad-tree (oct-tree for 3-dimensional data) and in this paper we study a kd-tree-based solution. Although it is easy to see that both implementations have the same time complexity O ( N 2d−1 d ) , where d is the number of dimensions of the dataset, a thorough comparison of their actual running time under different scenarios is conducted. In particular, we present an analytical model to rigorously quantify the running time of dual-tree algorithms. Our analysis suggests that the kd-tree-based implementation outperforms the quad-/oct-tree solution under a wide range of data sizes and query parameters. Specifically, such performance advantage is shown as a speedup up to 1.23X over the quad-tree algorithm for 2D data, and 1.39X over the oct-tree for 3D data, respectively. Results of extensive experiments run on synthetic and real datasets confirm our findings.

[1]  Peter Z. Kunszt,et al.  The SDSS skyserver: public access to the sloan digital sky server data , 2001, SIGMOD '02.

[2]  David M. Nicol,et al.  Using N-body algorithms for interference computation in wireless cellular simulations , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[3]  Jin Huang,et al.  Computing Spatial Distance Histograms for Large Scientific Data Sets On-the-Fly , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Ewa Deelman,et al.  Rethinking data management for big data scientific workflows , 2013, 2013 IEEE International Conference on Big Data.

[5]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[6]  Yi Li,et al.  Data exploration of turbulence simulations using a database cluster , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Walid G. Aref,et al.  A database server for next-generation scientific data management , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[8]  Jonathan W. Essex,et al.  BioSimGrid: Grid-enabled biomolecular simulation data storage and analysis , 2006, Future Gener. Comput. Syst..

[9]  Gang Shen,et al.  Distance histogram computation based on spatiotemporal uniformity in scientific data , 2012, EDBT '12.

[10]  Mahmut T. Kandemir,et al.  Data management for large‐scale scientific computations in high performance distributed systems , 2004, Cluster Computing.

[11]  Philip Heng Wai Leong,et al.  An arithmetic library and its application to the N-body problem , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[12]  Hong Chen,et al.  Mesh Data Management in Large-Scale Scientific Computing , 2008, The Third ChinaGrid Annual Conference (chinagrid 2008).

[13]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Martin L. Kersten,et al.  MonetDB/SQL Meets SkyServer: the Challenges of a Scientific Database , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[15]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[16]  Shaoping Chen,et al.  A comparative study of dual-tree algorithm implementations for computing 2-body statistics in spatial data , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[17]  Jean-Luc Starck,et al.  Astronomical image and data analysis , 2002 .

[18]  Bijan Najafi,et al.  A new expression for radial distribution function and infinite shear modulus of Lennard-Jones fluids , 2006 .

[19]  Frank Neven,et al.  Scalable multi-query optimization for exploratory queries over federated scientific databases , 2008, Proc. VLDB Endow..

[20]  B. Huberman Sociology of science: Big data deserve a bigger audience , 2012, Nature.

[21]  Yuni Xia,et al.  Performance analysis of a dual-tree algorithm for computing spatial distance histograms , 2011, The VLDB Journal.

[22]  John L. Pfaltz,et al.  A scalable DBMS for large scientific simulations , 1999, Proceedings 1999 International Symposium on Database Applications in Non-Traditional Environments (DANTE'99) (Cat. No.PR00496).

[23]  Shiyong Lu,et al.  A Collectional Data Model for Scientific Workflow Composition , 2010, 2010 IEEE International Conference on Web Services.

[24]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[25]  Hans-Peter Kriegel,et al.  3D Shape Histograms for Similarity Search and Classification in Spatial Databases , 1999, SSD.

[26]  Gunther Heidemann,et al.  Combining spatial and colour information for content based image retrieval , 2004, Comput. Vis. Image Underst..

[27]  L. Wasserman,et al.  Fast Algorithms and Efficient Statistics: N-Point Correlation Functions , 2000, astro-ph/0012333.

[28]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[29]  Jignesh M. Patel,et al.  The Role of Declarative Querying in Bioinformatics , 2003, OMICS.

[30]  J. P. Grossman,et al.  Anton, a special-purpose machine for molecular dynamics simulation , 2008, CACM.

[31]  Jeffrey Tsang Evolving trajectories of the N-body problem , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[32]  Scott Klasky,et al.  The Center for Plasma Edge Simulation Workflow Requirements , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[33]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[34]  Robert Latham,et al.  ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[35]  Shaoping Chen,et al.  Computing Distance Histograms Ef?ciently in Scientific Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Damon Centola,et al.  The Spread of Behavior in an Online Social Network Experiment , 2010, Science.

[37]  Jin Huang,et al.  Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees , 2013, IEEE Transactions on Knowledge and Data Engineering.

[38]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[39]  Peter Sanders,et al.  Fast OLAP query execution in main memory on large data in a cluster , 2013, 2013 IEEE International Conference on Big Data.

[40]  Adriano Filipponi,et al.  The radial distribution function probed by X-ray absorption spectroscopy , 1994 .