SCALABLE COMPUTING Practice and Experience

We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.

[1]  André Rigland Brodtkorb,et al.  The Graphics Processor as a Mathematical Coprocessor in MATLAB , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[2]  Luca Benini,et al.  MPARM: Exploring the Multi-Processor SoC Design Space with SystemC , 2005, J. VLSI Signal Process..

[3]  D. Marpe,et al.  Video coding with H.264/AVC: tools, performance, and complexity , 2004, IEEE Circuits and Systems Magazine.

[4]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5]  Jack Dongarra,et al.  Some issues in dense linear algebra for multicore and special purpose architectures , 2008 .

[6]  Naraig Manjikian Multiprocessor enhancements of the SimpleScalar tool set , 2001, CARN.

[7]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[8]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[9]  Jan Christian Meyer,et al.  Latency Impact on Spin-Lock Algorithms for Modern Shared Memory Multiprocessors , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[10]  Victor V. Zyuban,et al.  The energy complexity of register files , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[11]  David R. Butenhof Programming with POSIX threads , 1993 .

[12]  André Seznec,et al.  Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors , 2002, MICRO 35.

[13]  Laxmi N. Bhuyan,et al.  An Adaptive Submesh Allocation Strategy for Two-Dimensional Mesh Connected Systems , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[14]  John Goodacre,et al.  ARM MPCore; The streamlined and scalable ARM11 processor core , 2007, 2007 Asia and South Pacific Design Automation Conference.

[15]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  A. M. Abdullah,et al.  Wireless lan medium access control (mac) and physical layer (phy) specifications , 1997 .

[17]  Jianfeng Xu,et al.  Fast integer-pel and fractional-pel motion estimation for H.264/AVC , 2006, J. Vis. Commun. Image Represent..

[18]  Gerard J. M. Smit,et al.  Mapping of DSP algorithms on the MONTIUM architecture , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[19]  M. Butts,et al.  A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[20]  T. N. Vijaykumar,et al.  Reducing register ports for higher speed and lower energy , 2002, MICRO.

[21]  Karam S. Chatha,et al.  An ILP Formulation for System-Level Application Mapping on Network Processor Architectures , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[22]  James H. Anderson,et al.  A generic local-spin fetch-and-phi-based mutual exclusion algorithm , 2007, J. Parallel Distributed Comput..

[23]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[24]  M. McCool Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform , 2006 .

[25]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[26]  Ieee Standards Board IEEE Standard for local and metropolitan area networks : supplement to Integrated Services (IS) LAN Interface at the Medium Access Control (MAC) and Physical (PHY) layers : Managed Object Conformance (MOCS) Proforma , 1996 .

[27]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[28]  Tughrul Arslan,et al.  System-level Scheduling on Instruction Cell Based Reconfigurable Systems , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[29]  Peter Y. K. Cheung,et al.  Analysis of yield loss due to random photolithographic defects in the interconnect structure of FPGAs , 2005, FPGA '05.

[30]  Tughrul Arslan,et al.  Efficient Implementation of Wireless Applications on Multi-core Platforms Based on Dynamically Reconfigurable Processors , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[31]  Olav Lysne,et al.  Layered routing in irregular networks , 2006, IEEE Transactions on Parallel and Distributed Systems.

[32]  Erik Hagersten,et al.  Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[33]  Olav Lysne,et al.  Routing-Contained Virtualization Based on Up*/Down* Forwarding , 2007, HiPC.

[34]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[35]  Maged M. Michael,et al.  Scalability of Atomic Primitives on Distributed Shared Memory Multiprocessors , 1994 .

[36]  Wei-Chang Tsai,et al.  A simple and efficient block motion estimation algorithm based on full-search array architecture , 2004, Signal Process. Image Commun..

[37]  Hee Yong Youn,et al.  Isomorphic Strategy for Processor Allocation in k-Ary n-Cube Systems , 2003, IEEE Trans. Computers.

[38]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[40]  Antonio Robles,et al.  Effective methodology for deadlock-free minimal routing in InfiniBand networks , 2002, Proceedings International Conference on Parallel Processing.

[41]  Cong Fu,et al.  The RASE (Rapid, Accurate Simulation Environment) for chip multiprocessors , 2005, CARN.

[42]  Carl Ebeling,et al.  Implementing an OFDM receiver on the RaPiD reconfigurable architecture , 2003, IEEE Transactions on Computers.

[43]  Y. Danieli Guide , 2005 .

[44]  Leonel Sousa,et al.  A Parallel Algorithm for Advanced Video Motion Estimation on Multicore Architectures , 2008, 2008 International Conference on Complex, Intelligent and Software Intensive Systems.

[45]  Tughrul Arslan,et al.  The Reconfigurable Instruction Cell Array , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[46]  Vipul Gupta,et al.  A flexible processor allocation strategy for mesh connected parallel systems , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[47]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[48]  Allan Heydon,et al.  System Description Language , 2006 .

[49]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[50]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[51]  Rafael Mayo,et al.  Evaluation and tuning of the Level 3 CUBLAS for graphics processors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[52]  Sven-Arne Reinemo,et al.  An Analysis of Connectivity and Yield for 2D Mesh Based NoC with Interconnect Router Failures , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[53]  Frank Ghenassia,et al.  Transaction Level Modeling with SystemC , 2005 .

[54]  Stephen D. Brown,et al.  Architecture of FPGAs and CPLDs: A Tutorial , 2000 .

[55]  Hee Yong Youn,et al.  Processor Scheduling and Allocation for 3D Torus Multicomputer Systems , 2000, IEEE Trans. Parallel Distributed Syst..

[56]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[57]  Boon Shyang Lim A Simplified High Definition Video Encoder Based on The STI CELL Multiprocessor , 2007 .

[58]  Antonio Robles,et al.  LASH-TOR: a generic transition-oriented routing algorithm , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[59]  Rafael Mayo,et al.  GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors , 2008 .

[60]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[61]  Jie Chen,et al.  Efficient subtorus processor allocation in a multi-dimensional torus , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[62]  Victor V. Zyuban,et al.  Inherently Lower-Power High-Performance Superscalar Architectures , 2001, IEEE Trans. Computers.

[63]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[64]  Po-Jen Chuang,et al.  An Efficient Recognition-Complete Processor Allocation Strategy for k-ary n-cube Multiprocessors , 2000, IEEE Trans. Parallel Distributed Syst..

[65]  Tughrul Arslan,et al.  The Design of Multitasking Based Applications on Reconfigurable Instruction Cell Based Architectures , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[66]  Fan Wu,et al.  Processor Allocation in the Mesh Multiprocessors Using the Leapfrog Method , 2003, IEEE Trans. Parallel Distributed Syst..

[67]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[68]  Yahui Zhu,et al.  Efficient Processor Allocation Strategie for Mesh-Connected Parallel Computers , 1992, J. Parallel Distributed Comput..

[69]  Michael Gschwind The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[70]  Antonio Robles,et al.  An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori , 2004, IEEE Computer Architecture Letters.

[71]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[72]  Jiun-In Guo,et al.  An Embedded Coherent-Multithreading Multimedia Processor and Its Programming Model , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[73]  Daniel A. Brokenshire,et al.  Introduction to the Cell Broadband Engine Architecture , 2007, IBM J. Res. Dev..