Computing Spatial Distance Histograms for Large Scientific Data Sets On-the-Fly

This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles' spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in implementing and optimizing the above algorithm in graphics processing units (GPUs) as means to further improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical analysis and results of extensive experiments using data generated from real simulation studies.

[1]  Jimeng Sun,et al.  Analysis of predictive spatio-temporal queries , 2003, TODS.

[2]  Wen-mei W. Hwu,et al.  GPU Computing Gems Jade Edition , 2011 .

[3]  Carsten Kutzner,et al.  GROMACS 4:  Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. , 2008, Journal of chemical theory and computation.

[4]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[5]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[6]  Istv'an Szapudi Introduction to Higher Order Spatial Statistics in Cosmology , 2005 .

[7]  Rajeev Raman,et al.  Persistence, amortization and randomization , 1991, SODA '91.

[8]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[9]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.

[10]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[11]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[12]  S. Nishida,et al.  The MP-tree: a data structure for spatio-temporal data , 1995, Proceedings International Phoenix Conference on Computers and Communications.

[13]  D. Frenkel,et al.  Understanding molecular simulation : from algorithms to applications. 2nd ed. , 2002 .

[14]  Evangelos Theodoridis,et al.  A time efficient indexing scheme for complex spatiotemporal retrieval , 2010, SGMD.

[15]  Rajiv K. Kalia,et al.  Scalable I/O of large-scale molecular dynamics simulations: A data-compression algorithm , 2000 .

[16]  K. Binder,et al.  A Guide to Monte Carlo Simulations in Statistical Physics , 2000 .

[17]  Yuni Xia,et al.  Performance analysis of a dual-tree algorithm for computing spatial distance histograms , 2011, The VLDB Journal.

[18]  Berend Smit,et al.  Understanding molecular simulation: from algorithms to applications , 1996 .

[19]  Norbert Attig,et al.  Introduction to Molecular Dynamics Simulation , 2004 .

[20]  Gang Shen,et al.  Distance histogram computation based on spatiotemporal uniformity in scientific data , 2012, EDBT '12.

[21]  Jonathan W. Essex,et al.  BioSimGrid: Grid-enabled biomolecular simulation data storage and analysis , 2006, Future Gener. Comput. Syst..

[22]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[23]  Shaoping Chen,et al.  Computing Distance Histograms Ef?ciently in Scientific Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[24]  V. Alagar The distribution of the distance between random points , 1976, Journal of Applied Probability.

[25]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[26]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[27]  Jack A. Orenstein Multidimensional Tries Used for Associative Searching , 1982, Inf. Process. Lett..

[28]  Haim Kaplan Persistent Data Structures , 2004, Handbook of Data Structures and Applications.

[29]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[30]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[31]  Jin Huang,et al.  Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees , 2013, IEEE Transactions on Knowledge and Data Engineering.

[32]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.