In-memory Query System for Scientific Dataseis

The growing gap between compute performance and I/O bandwidth coupled with the increasing data volumes has resulted in a bottleneck to the traditional post-simulation data processing method. Hence in-situ computing and query-driven data analysis are important techniques to minimize data movement. By taking advantage of the growing memory capacity on supercomputers, we developed an in-memory query system for scientific data analysis. Our approach is a combination of bitmap indexing, spatial data layout re-organization, distributed shared memory, and location-aware parallel execution. Our evaluations using real scientific datasets showed that we can aggregate the memory capacity from thousands of computes nodes to analyze a 750GB simulation dataset without transferring data to remote nodes or storage systems. Comparing to traditional solutions based on out-of-core parallel file systems, we achieve significant higher query performance.

[1]  David R. O'Hallaron,et al.  Remote runtime steering of integrated terascale simulation and visualization , 2006, SC.

[2]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[3]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Jens Mache,et al.  The impact of spatial layout of jobs on I/O hotspots in mesh networks , 2005, J. Parallel Distributed Comput..

[5]  Ying Zhang,et al.  SciQL: array data processing inside an RDBMS , 2013, SIGMOD '13.

[6]  Karsten Schwan,et al.  PreDatA – preparatory data analytics on peta-scale machines , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Arie Shoshani,et al.  Parallel data analysis directly on scientific file formats , 2014, SIGMOD Conference.

[8]  Martin L. Kersten,et al.  Breaking the memory wall in MonetDB , 2008, CACM.

[9]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[10]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.

[11]  Anastasia Ailamaki,et al.  NoDB: efficient query execution on raw data files , 2012, Commun. ACM.

[12]  Florin Rusu,et al.  GLADE: a scalable framework for efficient analytics , 2012, OPSR.

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Robert B. Ross,et al.  ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[15]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[16]  Robert B. Ross,et al.  Analytics-Driven Lossless Data Compression for Rapid In-situ Indexing, Storing, and Querying , 2012, DEXA.

[17]  Surendra Byna,et al.  SDS: a framework for scientific data services , 2013, PDSW@SC.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Arie Shoshani,et al.  In situ data processing for extreme-scale computing , 2011 .

[21]  Kesheng Wu,et al.  FastQuery: A Parallel Indexing System for Scientific Data , 2011, 2011 IEEE International Conference on Cluster Computing.

[22]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[23]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[24]  Robert Latham,et al.  ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Arie Shoshani,et al.  Parallel in situ indexing for data-intensive computing , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[26]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  Ben. Pontin,et al.  The IPCC fifth assessment report , 2013 .

[28]  Yu Cheng,et al.  Parallel in-situ data processing with speculative loading , 2014, SIGMOD Conference.

[29]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[30]  Xiaocheng Zou,et al.  Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.