论文信息 - Scalable in situ scientific data encoding for analytical query processing

Scalable in situ scientific data encoding for analytical query processing

The process of scientific data analysis in high-performance computing environments has been evolving along with the advancement of computing capabilities. With the onset of exascale computing, the increasing gap between compute performance and I/O bandwidth has rendered the traditional method of post-simulation processing a tedious process. Despite the challenges due to increased data production, there exists an opportunity to benefit from "cheap" computing power to perform query-driven exploration and visualization during simulation time. To accelerate such analyses, applications traditionally augment raw data with large indexes, post-simulation, which are then repeatedly utilized for data exploration. However, the generation of current state-of-the-art indexes involve a compute- and memory-intensive processing, thus rendering them inapplicable in an in situ context. In this paper we propose DIRAQ, a parallel in situ, in network data encoding and reorganization technique that enables the transformation of simulation output into a query-efficient form, with negligible runtime overhead to the simulation run. DIRAQ begins with an effective core-local, precision-based encoding approach, which incorporates an embedded compressed index that is 3 -- 6x smaller than current state-of-the-art indexing schemes. DIRAQ then applies an in network index merging strategy, enabling the creation of aggregated indexes ideally suited for spatial-context querying that speed up query responses by up to 10x versus alternative techniques. We also employ a novel aggregation strategy that is topology-, data-, and memory-aware, resulting in efficient I/O and yielding overall end-to-end encoding and I/O time that is less than that required to write the raw data with MPI collective I/O.

[1] Karsten Schwan,et al. Extending I/O through high performance data services , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2] Marianne Winslett,et al. Finding regions of interest on toroidal meshes , 2011 .

[3] Robert B. Ross,et al. Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying , 2012, DEXA.

[4] Ray W. Grout,et al. Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .

[5] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[6] Edgar Gabriel,et al. Automatically Selecting the Number of Aggregators for Collective I/O Operations , 2011, 2011 IEEE International Conference on Cluster Computing.

[7] Torsten Suel,et al. Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[8] Arie Shoshani,et al. Parallel in situ indexing for data-intensive computing , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[9] Kesheng Wu,et al. FastBit: An Efficient Indexing Technology For Accelerating Data-Intensive Science , 2005 .

[10] Marcin Zukowski,et al. Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11] Arie Shoshani,et al. On the performance of bitmap indices for high cardinality attributes , 2004, VLDB.

[12] Robert Latham,et al. I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6 , 2012, ROSS '12.

[13] Michael E. Papka,et al. Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14] Karsten Schwan,et al. Just in time: adding value to the IO pipelines of high performance applications with JITStaging , 2011, HPDC '11.

[15] Karsten Schwan,et al. PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16] Robert Latham,et al. Efficient data restructuring and aggregation for I/O acceleration in PIDX , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Kwan-Liu Ma,et al. In Situ Visualization at Extreme Scale: Challenges and Opportunities , 2009, IEEE Computer Graphics and Applications.

[18] Fan Zhang,et al. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Arie Shoshani,et al. Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20] Arie Shoshani,et al. Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[21] B. Fryxell,et al. FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[22] Torsten Suel,et al. Performance of compressed inverted list caching in search engines , 2008, WWW.

[23] Scott Klasky,et al. Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[24] Christian Igel,et al. Empirical evaluation of the improved Rprop learning algorithms , 2003, Neurocomputing.

[25] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[26] Kesheng Wu,et al. FastQuery: A Parallel Indexing System for Scientific Data , 2011, 2011 IEEE International Conference on Cluster Computing.

[27] Rajeev Thakur,et al. An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays , 1996, Sci. Program..

[28] Robert Latham,et al. Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System , 2011, 2011 IEEE International Conference on Cluster Computing.

[29] Alok N. Choudhary,et al. Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[30] Karsten Schwan,et al. DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[31] David R. O'Hallaron,et al. Remote runtime steering of integrated terascale simulation and visualization , 2006, SC.

[32] Hsien-Hsin S. Lee,et al. Constructing a Non-Linear Model with Neural Networks for Workload Characterization , 2006, 2006 IEEE International Symposium on Workload Characterization.

[33] Hans Hagen,et al. High performance multivariate visual data exploration for extremely large data , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.