论文信息 - In-memory Query System for Scientific Dataseis

In-memory Query System for Scientific Dataseis

The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post-simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations using real scientific datasets showed that we can aggregate the memory capacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to traditional solutions based on out-of-core parallel file systems, we achieve significant higher query performance.

[1] David R. O'Hallaron,et al. Remote runtime steering of integrated terascale simulation and visualization , 2006, SC.

[2] Paul G. Brown,et al. Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[3] Arie Shoshani,et al. Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Jens Mache,et al. The impact of spatial layout of jobs on I/O hotspots in mesh networks , 2005, J. Parallel Distributed Comput..

[5] Ying Zhang,et al. SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[6] Karsten Schwan,et al. PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7] Arie Shoshani,et al. Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[8] Martin L. Kersten,et al. Breaking the memory wall in MonetDB , 2008, CACM.

[9] K. Bowers,et al. Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[10] Kenneth Salem,et al. Query processing techniques for arrays , 1999, SIGMOD '99.

[11] Anastasia Ailamaki,et al. NoDB: efficient query execution on raw data files , 2012, Commun. ACM.

[12] Florin Rusu,et al. GLADE: a scalable framework for efficient analytics , 2012, OPSR.

[13] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14] Robert B. Ross,et al. ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[15] Robert Latham,et al. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[16] Robert B. Ross,et al. Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying , 2012, DEXA.

[17] Surendra Byna,et al. SDS: a framework for scientific data services , 2013, PDSW@SC.

[18] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19] Arie Shoshani,et al. Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20] Arie Shoshani,et al. In situ data processing for extreme-scale computing , 2011 .

[21] Kesheng Wu,et al. FastQuery: A Parallel Indexing System for Scientific Data , 2011, 2011 IEEE International Conference on Cluster Computing.

[22] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23] Prabhat,et al. FastBit: interactively searching massive data , 2009 .

[24] Robert Latham,et al. ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25] Arie Shoshani,et al. Parallel in situ indexing for data-intensive computing , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[26] Michael E. Papka,et al. Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27] Ben. Pontin,et al. The IPCC fifth assessment report , 2013 .

[28] Yu Cheng,et al. Parallel in-situ data processing with speculative loading , 2014, SIGMOD Conference.

[29] Magdalena Balazinska,et al. ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[30] Xiaocheng Zou,et al. Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.