Trusted Performance Analysis on Systems With a Shared Memory

With the increasing complexity of both data structures and computer architectures, the performance of applications needs fine tuning in order to achieve the expected runtime execution time. Performance tuning is traditionally based on the analysis of performance data. The analysis results may not be accurate, depending on the quality of the data and the applied analysis approaches. Therefore, application developers may ask: Can we trust the analysis results? This paper introduces our research work in performance optimization of the memory system, with a focus on the cache locality of a shared memory and the memory locality of a distributed shared memory. The quality of the data analysis is guaranteed by using both real performance data acquired at the runtime while the application is running and well-established data analysis algorithms in the field of bioinformatics and data mining. We verified the quality of the proposed approaches by optimizing a set of benchmark applications. The experimental results show a significant performance gain.

[1]  Jeffrey K. Hollingsworth,et al.  Data Centric Cache Measurement on the Intel ltanium 2 Processor , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[2]  Allen D. Malony,et al.  Characterizing I/O Performance Using the TAU Performance System , 2011, PARCO.

[3]  Lizhe Wang,et al.  Towards building a multi‐datacenter infrastructure for massive remote sensing image processing , 2013, Concurr. Comput. Pract. Exp..

[4]  Pierre Geurts,et al.  Data mining tools and application in power system engineering , 1999 .

[5]  Tao Yuan,et al.  Distributed data structure templates for data‐intensive remote sensing applications , 2013, Concurr. Comput. Pract. Exp..

[6]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[7]  Wolfgang Karl,et al.  YACO: A User Conducted Visualization Tool for Supporting Cache Optimization , 2005, HPCC.

[8]  D. Brandt,et al.  Multi-level adaptive solutions to boundary-value problems math comptr , 1977 .

[9]  Lizhe Wang,et al.  Virtual workflow system for distributed collaborative scientific applications on Grids , 2011, Comput. Electr. Eng..

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[12]  Ahmad Chaddad,et al.  Brain tumor identification using Gaussian Mixture Model features and Decision Trees classifier , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[13]  Sriram Krishnamoorthy,et al.  Performance characterization of global address space applications: a case study with NWChem , 2012, Concurr. Comput. Pract. Exp..

[14]  Rajkumar Buyya,et al.  A Case for Cooperative and Incentive-Based Coupling of Distributed Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[15]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[16]  Hiroshi Ohta,et al.  Automatic Data Distribution Method Using First Touch Control for Distributed Shared Memory Multiprocessors , 2001, LCPC.

[17]  Aris Floratos,et al.  Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[18]  Bart Nauwelaers,et al.  Using a decision tree for real-time distributed indoor localization in healthcare environments , 2014, 2014 International Conference on Development and Application Systems (DAS).

[19]  Lizhe Wang,et al.  Massively Parallel Neural Signal Processing on a Many-Core Platform , 2011, Computing in Science & Engineering.

[20]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[21]  Hui Li,et al.  Natural Disaster Monitoring with Wireless Sensor Networks: A Case Study of Data-intensive Applications upon Low-Cost Scalable Systems , 2013, Mob. Networks Appl..

[22]  Lizhe Wang,et al.  Massively parallel Modelling & Simulation of large crowd with GPGPU , 2011, The Journal of Supercomputing.

[23]  Andrew Kusiak,et al.  Combustion efficiency optimization and virtual testing: a data-mining approach , 2006, IEEE Transactions on Industrial Informatics.

[24]  Lizhe Wang,et al.  Preliminary study of a cluster-based open-source parallel GIS based on the GRASS GIS , 2011, Int. J. Digit. Earth.

[25]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[26]  Guido Juckeland,et al.  Comprehensive Performance Tracking with Vampir 7 , 2009, Parallel Tools Workshop.

[27]  Allen D. Malony,et al.  Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.

[28]  Josep Torrellas,et al.  The Augmint multiprocessor simulation toolkit for Intel x86 architectures , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[29]  Saumil Merchant,et al.  Tool for performance tuning and regression analyses of HPC systems and applications , 2012, 2012 19th International Conference on High Performance Computing.

[30]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[31]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[32]  Hong Bao,et al.  GPGPU-Aided Ensemble Empirical-Mode Decomposition for EEG Analysis During Anesthesia , 2010, IEEE Transactions on Information Technology in Biomedicine.

[33]  Christian Borgelt A Decision Tree Plug-In for DataEnginetm , 1998 .

[34]  Guojing Cong,et al.  Tool-assisted Optimization of Shared-memory Accesses in UPC Applications , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[35]  Lizhe Wang,et al.  Large scale distributed visualization on computational Grids: A review , 2011, Comput. Electr. Eng..

[36]  Jian Wang,et al.  Towards enabling Cyberinfrastructure as a Service in Clouds , 2013, Comput. Electr. Eng..

[37]  Don Lincoln,et al.  LHC: The Large Hadron Collider , 2015 .

[38]  Bronis R. de Supinski,et al.  A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks , 2005, ICS '05.

[39]  Stan Matwin,et al.  Data mining to predict aircraft component replacement , 1999, IEEE Intell. Syst..

[40]  Arnaldo Carvalho de Melo,et al.  The New Linux ’ perf ’ Tools , 2010 .

[41]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[42]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.