Efficient evaluation of threshold queries of derived fields in a numerical simulation database

In this paper, we present a method for the ecient evaluation of threshold queries of derived fields for large numerical simulation datasets stored in a cluster of relational databases. The datasets produced by these simulations are in the TB and even PB ranges. Data-intensive computations that examine entire time-steps of the simulation data are impractical to perform locally by the user, taking days or months to iterate over the entire dataset. The integrated method for the evaluation of threshold queries that we have developed achieves scalability through data-parallel execution of the computations on the nodes of an analysis database cluster. We extend the scientific analysis environment with the introduction of an application-aware cache for query results, building on the concept of semantic caching. The cache has little overhead and improves query performance by over an order of magnitude for queries that hit the cache. Caching the results of threshold queries preserves both the I/O and computation e↵ort used to obtain them. In the case of computational turbulence, this allows scientists to quickly focus on the most intense events and interesting regions in any time-step or the dataset as a whole, which greatly speeds up the rate of scientific exploration and discovery.

[1]  Vijay Kumar,et al.  Semantic Caching and Query Processing , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[3]  Alexander S. Szalay,et al.  I/O streaming evaluation of batch queries for data-intensive computational turbulence , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Chris Jermaine,et al.  A Sampling Algebra for Aggregate Estimation , 2013, Proc. VLDB Endow..

[5]  Kalin Kanov,et al.  Flux-freezing breakdown in high-conductivity magnetohydrodynamic turbulence , 2013, Nature.

[6]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[7]  Yufei Tao,et al.  Efficient top-k processing in large-scaled distributed environments , 2007, Data Knowl. Eng..

[8]  Joel H. Saltz,et al.  Active semantic caching to optimize multidimensional data analysis in parallel and distributed environments , 2007, Parallel Comput..

[9]  Ying Zhang,et al.  SciQL: bridging the gap between science and relational DBMS , 2011, IDEAS '11.

[10]  Prasad Deshpande,et al.  Efficient online top-K retrieval with arbitrary similarity measures , 2008, EDBT '08.

[11]  Alexander S. Szalay,et al.  GrayWulf: Scalable Clustered Architecture for Data Intensive Computing , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[12]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[13]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[14]  Hans-Peter Kriegel,et al.  Similarity Search on Time Series Based on Threshold Queries , 2006, EDBT.

[15]  Christos Doulkeridis,et al.  On efficient top-k query processing in highly distributed environments , 2008, SIGMOD Conference.

[16]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[17]  Luis Gravano,et al.  Optimizing top-k selection queries over multimedia repositories , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  Nolan Li,et al.  CasJobs and MyDB: A Batch Query Workbench , 2008, Computing in Science & Engineering.

[20]  Jiawei Han,et al.  Progressive and selective merge: computing top-k with ad-hoc ranking functions , 2007, SIGMOD '07.

[21]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[22]  Yi Li,et al.  Data exploration of turbulence simulations using a database cluster , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Martin L. Kersten,et al.  SciBORQ: Scientific data management with Bounds On Runtime and Quality , 2011, CIDR.

[24]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[25]  David R. O'Hallaron,et al.  Big Wins with Small Application-Aware Caches , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[27]  Peter Baumann,et al.  A Database Array Algebra for Spatio-Temporal Data and Beyond , 1999, NGITS.

[28]  Alexander S. Szalay,et al.  Data-intensive spatial filtering in large numerical simulation datasets , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[30]  Alexander S. Szalay,et al.  The open connectome project data cluster: scalable analysis and vision for high-throughput neuroscience , 2013, SSDBM.

[31]  Yi Li,et al.  A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence , 2008, 0804.1703.