MIPT: Rapid exploration and evaluation for migrating sequential algorithms to multiprocessing systems with multi-port memories

Research has shown that the memory load/store instructions consume an important part in execution time and energy consumption. Extracting available parallelism at different granularity has been an important approach for designing next generation highly parallel systems. In this work, we present MIPT, an architecture exploration framework that leverages instruction parallelism of memory and ALU operations from a sequential algorithm's execution trace. MIPT heuristics recommend memory port sizes and issue slot sizes for memory and ALU operations. Its custom simulator simulates and evaluates the recommended parallel version of the execution trace for measuring performance improvements versus dual port memory. MIPT's architecture exploration criteria is to improve performance by utilizing systems with multi-port memories and multi-issue ALUs. There exists design exploration tools such as Multi2Sim and Trimaran. These simulators offer customization of multi-port memory architectures but designers' initial starting points are usually unclear. Thus, MIPT can suggest initial starting point for customization in those design exploration systems. In addition, given same application with two different implementations, it is possible to compare their execution time by the MIPT simulator.

[1]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[2]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[3]  Henk Corporaal,et al.  Exploring processor parallelism: Estimation methods and optimization strategies , 2013, 2013 IEEE 16th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).

[4]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[6]  Michael D. Smith,et al.  Boosting beyond static scheduling in a superscalar processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[7]  Wang Zuo,et al.  An Intelligent Multi-Port Memory , 2008, IITA 2008.

[8]  Dong Yang,et al.  NativeTask: A Hadoop compatible framework for high performance , 2013, 2013 IEEE International Conference on Big Data.

[9]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Subramanian Ramaswamy,et al.  Data trace cache: an application specific cache architecture , 2006, SIGARCH Comput. Archit. News.

[11]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[12]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[13]  Scott A. Mahlke,et al.  Trimaran: An Infrastructure for Research in Instruction-Level Parallelism , 2004, LCPC.

[14]  Zvi Drezner,et al.  An Efficient Genetic Algorithm for the p-Median Problem , 2003, Ann. Oper. Res..

[15]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[16]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[17]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[18]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[19]  J. Ramanujam,et al.  An Effective Solution to Task Scheduling and Memory Partitioning for Multiprocessor System-on-Chip , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[21]  Indranil Gupta,et al.  Breaking the MapReduce stage barrier , 2010, 2010 IEEE International Conference on Cluster Computing.

[22]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[23]  Phillip Stanley-Marbell,et al.  Parallelism and data movement characterization of contemporary application classes , 2011, SPAA '11.

[24]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[25]  Hsien-Hsin S. Lee,et al.  3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[26]  Pedro López,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[27]  Alan R. Earls,et al.  Digital equipment corporation. , 2004, Analytical chemistry.

[28]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[29]  Raphael Yuster,et al.  Fast sparse matrix multiplication , 2004, TALG.

[30]  Hans Jurgen Mattausch,et al.  Fast quadratic increase of multiport-storage-cell area with port number , 1999 .

[31]  Julio Sahuquillo,et al.  Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors , 2007 .