GP-SIMD Processing-in-Memory

GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

[1]  Eby G. Friedman,et al.  AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Tomer Y. Morad,et al.  Optimization of Asymmetric and Heterogeneous MultiCore , 2013 .

[4]  Kevin Skadron,et al.  Studying Thermal Management for Graphics-Processor Architectures , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[5]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[6]  Bill Lynch,et al.  Smart memory , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[7]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[8]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[10]  Uri C. Weiser,et al.  Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[11]  G. Jack Lipovski,et al.  The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval , 1999, Records of the 1999 IEEE International Workshop on Memory Technology, Design and Testing.

[12]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[13]  Ran Ginosar,et al.  Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[14]  Coniferous softwood GENERAL TERMS , 2003 .

[15]  Ran Ginosar,et al.  Thermal analysis of 3D associative processor , 2013, ArXiv.

[16]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[17]  Fred J. Pollack New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only) , 1999, MICRO.

[18]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[19]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[20]  Ran Ginosar,et al.  The effect of communication and synchronization on Amdahl's law in multicore systems , 2013, Parallel Comput..

[21]  Sheng-Chih Lin,et al.  A self-consistent junction temperature estimation methodology for nanometer scale ICs with implications for performance and thermal management , 2003, IEEE International Electron Devices Meeting 2003.

[22]  L. W. Tucker,et al.  Architecture and applications of the Connection Machine , 1988, Computer.

[23]  Gabriel H. Loh,et al.  The Cost of Uncore in Throughput-Oriented Many-Core Processors , 2008 .

[24]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[25]  Michael J. Quinn,et al.  Designing Efficient Algorithms for Parallel Computers , 1987 .

[26]  Gordon E. Sayre STARAN: An associative approach to multiprocessor architecture , 1975, Computer Architecture.

[27]  Feifei Li,et al.  Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[28]  Kenneth E. Batcher STARAN parallel processor system hardware , 1974, AFIPS '74.

[29]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[30]  BurgerDoug,et al.  The SimpleScalar tool set, version 2.0 , 1997 .

[31]  Ardavan Pedram,et al.  Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[32]  W. C. Meilander,et al.  Array processor supercomputers , 1989, Proc. IEEE.

[33]  Karthikeyan Sankaralingam,et al.  Power challenges may end the multicore era , 2013, CACM.

[34]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[35]  Andrew S. Cassidy,et al.  Beyond Amdahl's Law: An Objective Function That Links Multiprocessor Performance Gains to Delay and Energy , 2012, IEEE Transactions on Computers.

[36]  Ran Ginosar,et al.  Efficient Dense and Sparse Matrix Multiplication on GP-SIMD , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[37]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[38]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[39]  Isaac D. Scherson,et al.  Bit-Parallel Arithmetic in a Massively-Parallel Associative Processor , 1992, IEEE Trans. Computers.

[40]  Peter M. Kogge,et al.  PIM architectures to support petaflops level computation in the HTMT machine , 1999, Innovative Architecture for Future Generation High-Performance Processors and Systems (Cat. No.PR00650).

[41]  Peter M. Kogge,et al.  A low cost, multithreaded processing-in-memory system , 2004, WMPI '04.

[42]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[43]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[44]  Robert Parker,et al.  A PIM-based multiprocessor system , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[45]  E. L. Cloud,et al.  The geometric arithmetic parallel processor , 1988, Proceedings., 2nd Symposium on the Frontiers of Massively Parallel Computation.

[46]  Erdal Oruklu,et al.  Performance evaluation of SRAM cells in 22nm predictive CMOS technology , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[47]  S. F. Reddaway DAP—a distributed array processor , 1973, ISCA '73.

[48]  Anant Agarwal,et al.  Core Count vs Cache Size for Manycore Architectures in the Cloud , 2010 .

[49]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[50]  Thomas L. Sterling,et al.  Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[51]  Neil J. Gunther,et al.  A Methodology for Optimizing Multithreaded System Scalability on Multi-cores , 2011, ArXiv.

[52]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[53]  Avidan J. Akerib,et al.  Associative approach to real time color, motion and stereo vision , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[55]  C. Auth,et al.  A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[56]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[57]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[58]  R. Ginosar,et al.  Convex Optimization of Resource Allocation in Asymmetric and Heterogeneous MultiCores , 2014 .

[59]  Ken Kennedy,et al.  Performance of parallel processors , 1989, Parallel Comput..

[60]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[61]  John D. Owens,et al.  General Purpose Computation on Graphics Hardware , 2005, IEEE Visualization.

[62]  B. Parhami,et al.  Content addressable parallel processors , 1978, Proceedings of the IEEE.

[63]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[64]  Ran Ginosar,et al.  Convex optimization of resource allocation in asymmetric and heterogeneous SoC , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[65]  Stephen L. Scott,et al.  ASC: an associative-computing paradigm , 1994, Computer.

[66]  Ran Ginosar,et al.  Computer Architecture with Associative Processor Replacing Last-Level Cache and SIMD Accelerator , 2013, IEEE Transactions on Computers.

[67]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[68]  P. A. Ivey,et al.  Architectural considerations of a wafer scale processor , 1988 .