Design of massively parallel hardware multi-processors for highly-demanding embedded applications

Many new embedded applications require complex computations to be performed to tight schedules, while at the same time demanding low energy consumption and low cost. For implementation of these highly-demanding applications, highly-optimized application-specific multi-processor system-on-a-chip (MPSoCs) are required involving hardware multi-processors to execute the critical computations. The multi-processor accelerator design for such applications has to adequately resolve several difficult issues. Since the processors' micro- and macro-architectures, as well as, the memory and communication architectures are strongly interrelated, they have to be designed in combination. Complex mutual tradeoffs have to be resolved among the processor micro- and macro-architecture, and the corresponding memory and communication architectures, as well as, among the performance, power consumption and area. Unfortunately, the design methods and tools published till now do not address most of the design issues of the massively parallel hardware multi-processor accelerators. This paper discusses our novel quality-driven model-based multi-processor accelerator design method that adequately addresses the architecture design issues of hardware multi-processors for the modern highly-demanding embedded applications. Using the design of LDPC decoders for the latest high-speed communication system standards as an example application, we performed an extensive experimental research of the multi-processor design issues, and of our method and its design space exploration (DSE) framework. The experiments clearly demonstrated the existence of various complex architecture tradeoffs that could only be resolved through an adequate quality-driven combined design space exploration of the processors' micro- and macro-architectures, and the corresponding memory and communication architectures, as delivered by our method.

[1]  Ahmed Louri,et al.  A scalable architecture for distributed shared memory multiprocessors using optical interconnects , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Nadia Nedjah,et al.  Modern development methods and tools for embedded reconfigurable systems: A survey , 2010, Integr..

[3]  Jianwen Zhang,et al.  GNLS: a hybrid on-chip communication architecture for SoC designs , 2011, Int. J. High Perform. Syst. Archit..

[4]  Gianluca Palermo,et al.  Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors , 2006, 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[5]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[6]  David J. C. MacKay,et al.  Good Error-Correcting Codes Based on Very Sparse Matrices , 1997, IEEE Trans. Inf. Theory.

[7]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[8]  Luca Fanucci,et al.  A minimum-latency block-serial architecture of a decoder for IEEE 802.11n LDPC codes , 2007, 2007 IFIP International Conference on Very Large Scale Integration.

[9]  Lech Józwiak,et al.  Quality-driven design in the system-on-a-chip era: Why and how? , 2001, J. Syst. Archit..

[10]  Lech Józwiak,et al.  Processor architecture exploration and synthesis of massively parallel multi-processor accelerators in application to LDPC decoding , 2014, Microprocess. Microsystems.

[11]  P. Urard,et al.  A 135Mb/s DVB-S2 compliant codec based on 64800b LDPC and BCH codes , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[12]  Leonel Sousa,et al.  Massively LDPC Decoding on Multicore Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[13]  Lech Józwiak,et al.  Scalable communication architectures for massively parallel hardware multi-processors , 2012, J. Parallel Distributed Comput..

[14]  Gerald E. Sobelman,et al.  Flexible LDPC decoder architecture for high-throughput applications , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[15]  Philippe Coussy,et al.  High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[16]  Markus Rupp,et al.  Efficient DSP implementation of an LDPC decoder , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Lech Józwiak,et al.  Quality-driven methodology for demanding accelerator design , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[18]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[19]  Marek Tudruj,et al.  Communication on the Fly for Hierarchical Systems of Chip Multi-processors , 2011, 2011 Sixth International Symposium on Parallel Computing in Electrical Engineering.

[20]  Tong Zhang,et al.  Block-LDPC: a practical LDPC coding system design approach , 2005, IEEE Trans. Circuits Syst. I Regul. Pap..

[21]  A. Burg,et al.  Configurable high-throughput decoder architecture for quasi-cyclic LDPC codes , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[22]  Zhongfeng Wang,et al.  Multi-Gb/s LDPC Code Design and Implementation , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[23]  Joseph R. Cavallaro,et al.  Multi-layer parallel decoding algorithm and vlsi architecture for quasi-cyclic LDPC codes , 2011, 2011 IEEE International Symposium of Circuits and Systems (ISCAS).

[24]  Nikil D. Dutt,et al.  SPARK: a high-level synthesis framework for applying parallelizing compiler transformations , 2003, 16th International Conference on VLSI Design, 2003. Proceedings..

[25]  Krzysztof Kuchcinski,et al.  Global approach to assignment and scheduling of complex behaviors based on HCDG and constraint programming , 2003, J. Syst. Archit..

[26]  Kees A. Vissers,et al.  Optimized generation of data-path from C codes for FPGAs , 2005, Design, Automation and Test in Europe.

[27]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[28]  Gwan S. Choi,et al.  Multi-Rate Layered Decoder Architecture for Block LDPC Codes of the IEEE 802.11n Wireless Standard , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[29]  Stephen Neuendorffer,et al.  FPGA Pipeline Synthesis Design Exploration Using Module Selection and Resource Sharing , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30]  Luca Fanucci,et al.  A multi-processor NoC-based architecture for real-time image/video enhancement , 2011, Journal of Real-Time Image Processing.

[31]  Vishwas Sundaramurthy,et al.  Pipelined Block-Serial Decoder Architecture for Structured Ldpc Codes , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[32]  Lech Józwiak,et al.  Communication and Memory Architecture Design of Application-Specific High-End Multiprocessors , 2012, VLSI Design.

[33]  C.-J. Richard Shi,et al.  Sliced Message Passing: High Throughput Overlapped Decoding of High-Rate Low-Density Parity-Check Codes , 2008, IEEE Transactions on Circuits and Systems I: Regular Papers.

[34]  Saraju P. Mohanty,et al.  Low-Power High-Level Synthesis for Nanoscale CMOS Circuits , 2008 .

[35]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[36]  Tong Zhang,et al.  VLSI Design of High-Rate Quasi-Cyclic LDPC Codes for Magnetic Recording Channel , 2006, IEEE Custom Integrated Circuits Conference 2006.

[37]  Joseph R. Cavallaro,et al.  Multi-Rate High-Throughput LDPC Decoder: Tradeoff Analysis Between Decoding Throughput and Area , 2006, 2006 IEEE 17th International Symposium on Personal, Indoor and Mobile Radio Communications.

[38]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[39]  Radu Marculescu,et al.  On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches , 2007, TODE.

[40]  Lech Józwiak,et al.  CABAC Accelerator Architectures for Video Compression in Future Multimedia: A Survey , 2009, SAMOS.