FlexAnalytics: A Flexible Data Analytics Framework for Big Data Applications with I/O Performance Improvement

Increasingly larger scale applications are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. Instead of moving data from its source to the output storage, in-situ analytics processes output data while simulations are running. However, in-situ data analysis incurs much more computing resource contentions with simulations. Such contentions severely damage the performance of simulation on HPE. Since different data processing strategies have different impact on performance and cost, there is a consequent need for flexibility in the location of data analytics. In this paper, we explore and analyze several potential data-analytics placement strategies along the I/O path. To find out the best strategy to reduce data movement in given situation, we propose a flexible data analytics (FlexAnalytics) framework in this paper. Based on this framework, a FlexAnalytics prototype system is developed for analytics placement. FlexAnalytics system enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and visualization, as well as for large-scale data transfer. Two use cases – scientific data compression and remote visualization – have been applied in the study to verify the performance of FlexAnalytics. Experimental results demonstrate that FlexAnalytics framework increases data transition bandwidth and improves the application end-to-end transfer performance.

[1]  Hai Jin,et al.  A virtual-service-domain based bidding algorithm for resource discovery in computational grid , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[2]  Bogdan Nicolae,et al.  High Throughput Data-Compression for Cloud Storage , 2010, Globe.

[3]  Kwan-Liu Ma A new approach to remote visualization of large volume data , 2010, COMG.

[4]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[5]  Christian H. Bischof,et al.  VIRACOCHA: An Efficient Parallelization Framework for Large-Scale CFD Post-Processing in Virtual Environments , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[6]  Jingjin Wu,et al.  Improving Parallel IO Performance of Cell-based AMR Cosmology Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Hai Jin,et al.  HRTC: hybrid resource information service architecture based on GMA , 2005, IEEE International Conference on e-Business Engineering (ICEBE'05).

[8]  J. Manickam,et al.  Gyro-kinetic simulation of global turbulent transport properties in tokamak experiments , 2006 .

[9]  Archana Ganapathi,et al.  To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency , 2010, Green Networking '10.

[10]  Karsten Schwan,et al.  FlexIO : Location-flexible Execution of In Situ Data Analytics for Large Scale Scientific Applications , 2012 .

[11]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[12]  Daniel S. Hirschberg,et al.  Data compression , 1987, CSUR.

[13]  Andreas Wilke,et al.  A scalable data analysis platform for metagenomics , 2013, 2013 IEEE International Conference on Big Data.

[14]  Cong Du,et al.  V-MCS: A configuration system for virtual machines , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[15]  Jingjin Wu,et al.  Performance Emulation of Cell-Based AMR Cosmology Simulations , 2011, 2011 IEEE International Conference on Cluster Computing.

[16]  Xian-He Sun,et al.  An evaluation of parallel optimization for OpenSolaris® network stack , 2010, IEEE Local Computer Network Conference.

[17]  Scott Klasky,et al.  A Maya use case: adaptable scientific workflows with ADIOS for general relativistic astrophysics , 2013, XSEDE.

[18]  Jarek Nieplocha,et al.  Evaluation of active storage strategies for the lustre parallel file system , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Siyuan Ma,et al.  A Source-aware Interrupt Scheduling for Modern Parallel I/O Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[21]  Prabhat,et al.  FastBit: interactively searching massive data , 2009 .

[22]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[23]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Jingjin Wu,et al.  Hierarchical task mapping of cell-based AMR cosmology simulations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  John Shalf,et al.  Query-driven visualization of large data sets , 2005, VIS 05. IEEE Visualization, 2005..

[26]  Karsten Schwan,et al.  FlexQuery: An online query system for interactive remote visual data exploration at large scale , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[27]  Robert B. Ross,et al.  Improving I/O Forwarding Throughput with Data Compression , 2011, 2011 IEEE International Conference on Cluster Computing.

[28]  Karsten Schwan,et al.  Quality-Aware Data Management for Large Scale Scientific Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[29]  Kenneth Moreland,et al.  Sandia National Laboratories , 2000 .

[30]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[31]  Jingjin Wu,et al.  A Transparent Collective I/O Implementation , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[32]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[33]  Mathew Maltrud,et al.  Interactive remote large-scale data visualization via prioritized multi-resolution streaming , 2009, UltraVis '09.

[34]  Wei Tang,et al.  Improving I/O Performance with Adaptive Data Compression for Big Data Applications , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[35]  Ioannis Kontoyiannis,et al.  Pattern matching and lossy data compression on random fields , 2003, IEEE Trans. Inf. Theory.