Customizing VLIW processors from dynamically profiled execution traces

The design philosophy of VLIW processors is to maximize instruction level parallelism (ILP) starting from compiler and machine code level to all the way down to memory and computational blocks. For this purpose, VLIW tailoring has been an important research area, because non-tailored VLIWs cannot fully utilize the available VLIW hardware resources. This paper introduces a method which achieves VLIW customization by processing execution traces obtained by dynamic profiling. Our method differentiates memory and non-memory instructions while processing execution traces. Customizing VLIW multi-port memory from memory operations provides better memory utilization and higher performance. Moreover, exploration of the multi-port memory configuration is coupled with data path exploration, namely the number and the composition of execution units for efficient extraction of ILP. We have designed a genetic algorithm for the exploration of the large design space formed by the execution traces. Our experiments show that our method has improved and found more compact memory topologies than state-of-the-art VLIW customization algorithms. In addition, we compare the execution performance, power consumption, average parallelism and area-delay product results of our VLIW model with a RISC processor model on evaluated benchmarks using our simulator framework.

[1]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[2]  Marcel Gort,et al.  Range and bitmask analysis for hardware optimization in high-level synthesis , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[3]  Gustavo de Veciana,et al.  Application-specific clustered VLIW datapaths: early exploration on a parameterized design space , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4]  Michael D. Smith,et al.  Boosting beyond static scheduling in a superscalar processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[5]  Gorker Alp Malazgirt,et al.  Application specific multi-port memory customization in FPGAs , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[7]  Kevin B. Theobald,et al.  On the limits of program parallelism and its smoothability , 1992, MICRO 1992.

[8]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[10]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[11]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[12]  Alexandru Nicolau,et al.  Using an oracle to measure potential parallelism in single instruction stream programs , 1981, MICRO 14.

[13]  Alan Dain Samples,et al.  Profile-Driven Compilation , 1991 .

[14]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[15]  Jinian Bian,et al.  Automatic enhanced CDFG generation based on runtime instrumentation , 2013, Proceedings of the 2013 IEEE 17th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[16]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[17]  Chaitali Chakrabarti,et al.  Multi-Module Multi-Port Memory Design for Low Power Embedded Systems , 2004, Des. Autom. Embed. Syst..

[18]  Preeti Ranjan Panda,et al.  Shared-port register file architecture for low-energy VLIW processors , 2014, TACO.

[19]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[20]  Arquimedes Canedo,et al.  Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture , 2011, The Journal of Supercomputing.

[21]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS III.

[22]  Arda Yurdakul,et al.  An Efficient Heterogeneous Register File Implementation for FPGAs , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[23]  Henk Corporaal,et al.  Exploring processor parallelism: Estimation methods and optimization strategies , 2013, 2013 IEEE 16th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS).

[24]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[25]  Pekka Jääskeläinen,et al.  Loop Scheduling for Transport Triggered Architecture Processors , 2006, 2006 International Symposium on System-on-Chip.

[26]  Gorker Alp Malazgirt,et al.  MIPT: Rapid exploration and evaluation for migrating sequential algorithms to multiprocessing systems with multi-port memories , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[27]  Bede Liu,et al.  Understanding multimedia application characteristics for designing programmable media processors , 1998, Electronic Imaging.

[28]  Sijung Hu,et al.  BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications , 2012, IEEE Transactions on Biomedical Circuits and Systems.

[29]  Zvi Drezner,et al.  An Efficient Genetic Algorithm for the p-Median Problem , 2003, Ann. Oper. Res..

[30]  Wayne H. Wolf,et al.  Data-path synthesis of VLIW video signal processors , 1998, Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210).

[31]  B. Ramakrishna Rau,et al.  Machine-Description Driven Compilers for EPIC and VLIW Processors , 1999, Des. Autom. Embed. Syst..

[32]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[33]  K.J. O'Connor,et al.  Design issues for very-long-instruction-word VLSI video signal processors , 1996, VLSI Signal Processing, IX.

[34]  B. Ramakrishna Rau,et al.  PICO: Automatically Designing Custom Computers , 2002, Computer.

[35]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[36]  P. Faraboschi,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[37]  Vittorio Zaccaria,et al.  A framework for Compiler Level statistical analysis over customized VLIW architecture , 2013, 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC).

[38]  Andrew Wolfe,et al.  Available parallelism in video applications , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[39]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[40]  Thierry Lecroq,et al.  The exact online string matching problem: A review of the most recent results , 2013, CSUR.

[41]  Kemal Ebcioglu,et al.  A study on the number of memory ports in multiple instruction issue machines , 1993, MICRO 1993.

[42]  Fan Yang,et al.  Flexible VLIW processor based on FPGA for efficient embedded real-time image processing , 2012, Journal of Real-Time Image Processing.

[43]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[44]  Paolo Ienne,et al.  Making wide-issue VLIW processors viable on FPGAs , 2012, TACO.

[45]  Geoffrey Brown,et al.  ρ-VEX: A reconfigurable and extensible softcore VLIW processor , 2008, 2008 International Conference on Field-Programmable Technology.

[46]  Paolo Faraboschi,et al.  Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[47]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[48]  Soo-Mook Moon,et al.  Generalized Multiway Branch Unit for VLIW Microprocessors , 1995, IEEE Trans. Parallel Distributed Syst..