论文信息 - Performance Factor Analysis and Scope of Optimization for Big Data Processing on Cluster

Performance Factor Analysis and Scope of Optimization for Big Data Processing on Cluster

Use of computational cluster for large-scale Big Data processing has attracted attention as a technology trend for its time efficiency. Modern cluster equipped with latest multi, many-core distributed shared architecture, high speed interconnect and file system, ensures high performance using message passing and multi-threading parallel approaches, also handles batch, micro-batch and stream processing of high dimensional massive dataset but running data-intensive Big Data application on compute-centric cluster imposes challenges to its performance because of several runtime overheads. In order to alleviate these bottlenecks and exploit full potential of the cluster a state of the practice, performance-oriented technical analysis covering all relevant aspects is presented in the context of Terascale Big data processing on TeraFLOPS cluster PARAM-Kanchenjunga, with identification of major factors influencing the performance or sources of these overheads related to computation, communication or IPC, memory, I/O contention, scheduling, load imbalance, synchronization, latency and network jitter; by determining their impact. As existing approaches found insufficient, to achieve possible speedup advance methods with a variety of alternatives as RDMA enabled libraries, PFS, MPI-Integrated extensions, loop tiling, hybrid parallelization are provided to consider for optimization purposes. This paper will assist to prepare performance aware design of experiments and performance modeling.

[1] Dhabaleswar K. Panda,et al. Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[2] Tamara G. Kolda,et al. COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3] Erwin Laure,et al. A data streaming model in MPI , 2015, ExaMPI '15.

[4] Roger W. Hockney,et al. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[5] Pramod K. Varshney,et al. Dimensionality Reduction for Registration of High-Dimensional Data Sets , 2013, IEEE Transactions on Image Processing.

[6] Vijay Varadharajan,et al. A Detailed Investigation and Analysis of Using Machine Learning Techniques for Intrusion Detection , 2019, IEEE Communications Surveys & Tutorials.

[7] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[9] Vipin Kumar,et al. A Comparative Study of Classification Techniques for Intrusion Detection , 2013, 2013 International Symposium on Computational and Business Intelligence.

[10] Dhabaleswar K. Panda,et al. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[11] Mahesh Chandra Govil,et al. A comparative analysis of SVM and its stacking with other classification algorithm for intrusion detection , 2016, 2016 International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Spring).

[12] Nikolay Malitsky. Bringing the HPC reconstruction algorithms to Big Data platforms , 2016, 2016 New York Scientific Data Summit (NYSDS).

[13] George Bosilca,et al. Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.

[14] Shadi Ibrahim,et al. On the Performance of Spark on HPC Systems: Towards a Complete Picture , 2018, SCFA.

[15] Ali Ghodsi,et al. Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[16] Brice Goglin,et al. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[17] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[18] David A. Padua,et al. Programming with tiles , 2008, PPOPP.