Data-intensive spatial filtering in large numerical simulation datasets

We present a query processing framework for the efficient evaluation of spatial filters on large numerical simulation datasets stored in a data-intensive cluster. Previously, filtering of large numerical simulations stored in scientific databases has been impractical owing to the immense data requirements. Rather, filtering is done during simulation or by loading snapshots into the aggregate memory of an HPC cluster. Our system performs filtering within the database and supports large filter widths. We present two complementary methods of execution: I/O streaming computes a batch filter query in a single sequential pass using incremental evaluation of decomposable kernels, summed volumes generates an intermediate data set and evaluates each filtered value by accessing only eight points in this dataset. We dynamically choose between these methods depending upon workload characteristics. The system allows us to perform filters against large data sets with little overhead: query performance scales with the cluster's aggregate I/O throughput.

[1]  Thomas W. Crockett,et al.  A MIMD rendering algorithm for distributed memory architectures , 1993 .

[2]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[3]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.

[4]  A. Leonard Energy Cascade in Large-Eddy Simulations of Turbulent Fluid Flows , 1975 .

[5]  Subramanian Arumugam,et al.  The DataPath system: a data-centric analytic processing engine for large data warehouses , 2010, SIGMOD Conference.

[6]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[7]  Martin Cadík,et al.  FFT and Convolution Performance in Image Filtering on GPU , 2006, Tenth International Conference on Information Visualisation (IV'06).

[8]  Trygve Randen,et al.  Filtering for Texture Classification: A Comparative Study , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Alexander S. Szalay,et al.  I/O streaming evaluation of batch queries for data-intensive computational turbulence , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  C. Meneveau,et al.  Scale-Invariance and Turbulence Models for Large-Eddy Simulation , 2000 .

[11]  T. Lund The use of explicit filters in large eddy simulation , 2003 .

[12]  Alexander S. Szalay,et al.  Studying Lagrangian dynamics of turbulence using on-demand fluid particle tracking in a public turbulence database , 2012 .

[13]  Javier Jiménez,et al.  Self-similar vortex clusters in the turbulent logarithmic region , 2006, Journal of Fluid Mechanics.

[14]  Kwan-Liu Ma,et al.  Parallel rendering of 3D AMR data on the SGI/Cray T3E , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[15]  L.A. Freitag,et al.  Adaptive, Multiresolution Visualization of Large Data Sets using a Distributed Memory Octree , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[16]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[17]  Li Yi,et al.  Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions , 2009, HPDC '09.

[18]  Raymond M. Loy,et al.  Adaptive, Multiresolution Visualization of Large Data Sets using a Distributed Memory Octree , 1999, SC.

[19]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[20]  Alexander S. Szalay,et al.  GrayWulf: Scalable Clustered Architecture for Data Intensive Computing , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[21]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[22]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[23]  S. Pope Turbulent Flows: FUNDAMENTALS , 2000 .

[24]  Yi Li,et al.  Data exploration of turbulence simulations using a database cluster , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[25]  Gerik Scheuermann,et al.  Clifford Fourier transform on vector fields , 2005, IEEE Transactions on Visualization and Computer Graphics.

[26]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.