论文信息 - GP-SIMD Processing-in-Memory

GP-SIMD Processing-in-Memory

GP-SIMD, a novel hybrid general-purpose SIMD computer architecture, resolves the issue of data synchronization by in-memory computing through combining data storage and massively parallel processing. GP-SIMD employs a two-dimensional access memory with modified SRAM storage cells and a bit-serial processing unit per each memory row. An analytic performance model of the GP-SIMD architecture is presented, comparing it to associative processor and to conventional SIMD architectures. Cycle-accurate simulation of four workloads supports the analytical comparison. Assuming a moderate die area, GP-SIMD architecture outperforms both the associative processor and conventional SIMD coprocessor architectures by almost an order of magnitude while consuming less power.

[1] Eby G. Friedman,et al. AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[2] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Tomer Y. Morad,et al. Optimization of Asymmetric and Heterogeneous MultiCore , 2013 .

[4] Kevin Skadron,et al. Studying Thermal Management for Graphics-Processor Architectures , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[5] Brian Rogers,et al. Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[6] Bill Lynch,et al. Smart memory , 2010, 2010 IEEE Hot Chips 22 Symposium (HCS).

[7] Mike Ignatowski,et al. TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[8] Patrice Y. Simard,et al. Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9] Chun Chen,et al. The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[10] Uri C. Weiser,et al. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors , 2006, IEEE Computer Architecture Letters.

[11] G. Jack Lipovski,et al. The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval , 1999, Records of the 1999 IEEE International Workshop on Memory Technology, Design and Testing.

[12] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[13] Ran Ginosar,et al. Generalized MultiAmdahl: Optimization of Heterogeneous Multi-Accelerator SoC , 2014, IEEE Computer Architecture Letters.

[14] Coniferous softwood. GENERAL TERMS , 2003 .

[15] Ran Ginosar,et al. Thermal analysis of 3D associative processor , 2013, ArXiv.

[16] David A. Patterson,et al. Computer architecture (2nd ed.): a quantitative approach , 1996 .

[17] Fred J. Pollack. New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only) , 1999, MICRO.

[18] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.

[19] F. Black,et al. The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[20] Ran Ginosar,et al. The effect of communication and synchronization on Amdahl's law in multicore systems , 2013, Parallel Comput..

[21] Sheng-Chih Lin,et al. A self-consistent junction temperature estimation methodology for nanometer scale ICs with implications for performance and thermal management , 2003, IEEE International Electron Devices Meeting 2003.

[22] L. W. Tucker,et al. Architecture and applications of the Connection Machine , 1988, Computer.

[23] Gabriel H. Loh,et al. The Cost of Uncore in Throughput-Oriented Many-Core Processors , 2008 .

[24] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[25] Michael J. Quinn,et al. Designing Efficient Algorithms for Parallel Computers , 1987 .

[26] Gordon E. Sayre. STARAN: An associative approach to multiprocessor architecture , 1975, Computer Architecture.

[27] Feifei Li,et al. Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[28] Kenneth E. Batcher. STARAN parallel processor system hardware , 1974, AFIPS '74.

[29] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[30] BurgerDoug,et al. The SimpleScalar tool set, version 2.0 , 1997 .

[31] Ardavan Pedram,et al. Algorithm/Architecture Codesign of Low Power and High Performance Linear Algebra Compute Fabrics , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[32] W. C. Meilander,et al. Array processor supercomputers , 1989, Proc. IEEE.

[33] Karthikeyan Sankaralingam,et al. Power challenges may end the multicore era , 2013, CACM.

[34] Jens H. Krüger,et al. GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[35] Andrew S. Cassidy,et al. Beyond Amdahl's Law: An Objective Function That Links Multiprocessor Performance Gains to Delay and Energy , 2012, IEEE Transactions on Computers.

[36] Ran Ginosar,et al. Efficient Dense and Sparse Matrix Multiplication on GP-SIMD , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[37] Jaewook Shin,et al. Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[38] Noah Treuhaft,et al. Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[39] Isaac D. Scherson,et al. Bit-Parallel Arithmetic in a Massively-Parallel Associative Processor , 1992, IEEE Trans. Computers.

[40] Peter M. Kogge,et al. PIM architectures to support petaflops level computation in the HTMT machine , 1999, Innovative Architecture for Future Generation High-Performance Processors and Systems (Cat. No.PR00650).

[41] Peter M. Kogge,et al. A low cost, multithreaded processing-in-memory system , 2004, WMPI '04.

[42] José E. Moreira,et al. Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[43] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[44] Robert Parker,et al. A PIM-based multiprocessor system , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[45] E. L. Cloud,et al. The geometric arithmetic parallel processor , 1988, Proceedings., 2nd Symposium on the Frontiers of Massively Parallel Computation.

[46] Erdal Oruklu,et al. Performance evaluation of SRAM cells in 22nm predictive CMOS technology , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[47] S. F. Reddaway. DAP—a distributed array processor , 1973, ISCA '73.

[48] Anant Agarwal,et al. Core Count vs Cache Size for Manycore Architectures in the Cloud , 2010 .

[49] Babak Falsafi,et al. Toward Dark Silicon in Servers , 2011, IEEE Micro.

[50] Thomas L. Sterling,et al. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[51] Neil J. Gunther,et al. A Methodology for Optimizing Multithreaded System Scalability on Multi-cores , 2011, ArXiv.

[52] Mark D. Hill,et al. Amdahl's Law in the Multicore Era , 2008, Computer.

[53] Avidan J. Akerib,et al. Associative approach to real time color, motion and stereo vision , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[55] C. Auth,et al. A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[56] Dave Brown,et al. Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[57] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[58] R. Ginosar,et al. Convex Optimization of Resource Allocation in Asymmetric and Heterogeneous MultiCores , 2014 .

[59] Ken Kennedy,et al. Performance of parallel processors , 1989, Parallel Comput..

[60] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[61] John D. Owens,et al. General Purpose Computation on Graphics Hardware , 2005, IEEE Visualization.

[62] B. Parhami,et al. Content addressable parallel processors , 1978, Proceedings of the IEEE.

[63] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[64] Ran Ginosar,et al. Convex optimization of resource allocation in asymmetric and heterogeneous SoC , 2014, 2014 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[65] Stephen L. Scott,et al. ASC: an associative-computing paradigm , 1994, Computer.

[66] Ran Ginosar,et al. Computer Architecture with Associative Processor Replacing Last-Level Cache and SIMD Accelerator , 2013, IEEE Transactions on Computers.

[67] Martin Hopkins,et al. Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[68] P. A. Ivey,et al. Architectural considerations of a wafer scale processor , 1988 .