Data-Centric Computing Frontiers: A Survey On Processing-In-Memory

A major shift from compute-centric to data-centric computing systems can be perceived, as novel big data workloads like cognitive computing and machine learning strongly enforce embarrassingly parallel and highly efficient processor architectures. With Moore's law having surrendered, innovative architectural concepts as well as technologies are urgently required, to enable a path for tackling exascale and beyond -- even though current computing systems face the inevitable instruction-level parallelism, power, memory, and bandwidth walls. As part of any computing system, the general perception of memories depicts unreliability, power hungriness and slowness, resulting in a future prospective bottleneck. The latter being an outcome of a pin limitation derived by packaging constraints, an unexploited tremendous row bandwidth is determinable, which off-chip diminishes to a bare minimum. Building upon a shift towards data-centric computing systems, the near-memory processing concept seems to be most promising, since power efficiency and computing performance increase by co-locating tasks on bandwidth-rich in-memory processing units, whereas data motion mitigates by the avoidance of entire memory hierarchies. By considering the umbrella of near-data processing as the urgent required breakthrough for future computing systems, this survey presents its derivations with a special emphasis on Processing-In-Memory (PIM), highlighting historical achievements in technology as well as architecture while depicting its advantages and obstacles.

[1]  Makoto Motoyoshi,et al.  Through-Silicon Via (TSV) , 2009, Proceedings of the IEEE.

[2]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[3]  Duncan G. Elliott,et al.  Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[4]  Robert S. Germain,et al.  Using the Active Storage Fabrics model to address petascale storage challenges , 2009, PDSW '09.

[5]  Francisco J. Cazorla,et al.  Kilo-instruction processors: overcoming the memory wall , 2005, IEEE Micro.

[6]  Marco Minutoli,et al.  Implementing Radix Sort on Emu 1 , 2015 .

[7]  Werner Weber,et al.  Performance improvement of the memory hierarchy of RISC-systems by application of 3-D-technology , 1995, 1995 Proceedings. 45th Electronic Components and Technology Conference.

[8]  Peter M. Kogge,et al.  [2009] Exploring the Possible Past Futures of a Single Part Type Multi-core PIM Chip , 2010, 2010 International Workshop on Innovative Architecture for Future Generation High Performance.

[9]  Katherine Yelick,et al.  A Case for Intelligent DRAM: IRAM , 1998 .

[10]  Rae A. Earnshaw,et al.  State of the Art in Computer Graphics: Visualization and Modeling , 2011 .

[11]  Jaejin Lee,et al.  25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[12]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[13]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[14]  Fabrice Paillet,et al.  FIVR — Fully integrated voltage regulators on 4th generation Intel® Core™ SoCs , 2014, 2014 IEEE Applied Power Electronics Conference and Exposition - APEC 2014.

[15]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[16]  Hiroaki Kobayashi,et al.  Vertically integrated processor and memory module design for vector supercomputers , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[17]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[18]  Sascha Vongehr,et al.  The Missing Memristor has Not been Found , 2015, Scientific Reports.

[19]  David K. McAllister,et al.  Fast Matrix Multiplies Using Graphics Hardware , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[20]  Karthikeyan Sankaralingam,et al.  Memory processing units , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[21]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[22]  Seung-Moon Yoo,et al.  FlexRAM: Toward an advanced Intelligent Memory system , 1999, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[23]  Phillip Stanley-Marbell,et al.  Pinned to the walls — Impact of packaging and application properties on the memory and power walls , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[24]  L. Chua Memristor-The missing circuit element , 1971 .

[25]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[26]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[27]  Fred J. Pollack New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only) , 1999, MICRO.

[28]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[29]  Mike Ignatowski,et al.  A new perspective on processing-in-memory architecture design , 2013, MSPC '13.

[30]  Y. Fujita,et al.  A 7.68 GIPS 3.84 GB/s 1W parallel image processing RAM integrating a 16 Mb DRAM and 128 processors , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[31]  Tom Coughlin Crossing the Chasm to New Solid-State Storage Architectures [The Art of Storage] , 2016, IEEE Consumer Electron. Mag..

[32]  Stuart R. Ball Embedded microprocessor systems , 1996 .

[33]  Kunle Olukotun,et al.  Energy-Efficient Abundant-Data Computing: The N3XT 1,000x , 2015, Computer.

[34]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[35]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[36]  David Daly,et al.  The cache and memory subsystems of the IBM POWER8 processor , 2015, IBM J. Res. Dev..

[37]  Wolfgang Straßer,et al.  GRAMMY: High Performance Graphics Using Graphics Memories , 1996 .

[38]  Jian Xu,et al.  Demystifying 3D ICs: the pros and cons of going vertical , 2005, IEEE Design & Test of Computers.

[39]  William H. Kautz,et al.  Cellular Logic-in-Memory Arrays , 1969, IEEE Transactions on Computers.

[40]  Peter M. Kogge,et al.  [2010] Facing the Exascale Energy Wall , 2010, 2010 International Workshop on Innovative Architecture for Future Generation High Performance.

[41]  Michael F. Deering,et al.  FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[42]  R. H. Dennard Technical literature [Reprint of "Field-Effect Transistor Memory" (US Patent No. 3,387,286)] , 2008, IEEE Solid-State Circuits Newsletter.

[43]  Krishna M. Kavi,et al.  Processing-in-Memory: Exploring the Design Space , 2015, ARCS.

[44]  Zhen Fang,et al.  Quantifying the performance contribution of various aspects of AMOs , 2022 .

[45]  Guang R. Gao,et al.  Processing In Memory: Chips to Petaflops , 1997, ISCA 1997.

[46]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[47]  John Shalf,et al.  Computing beyond Moore's Law , 2015, Computer.

[48]  Hiroshi Nakamura,et al.  A scalable 3D heterogeneous multi-core processor with inductive-coupling thruchip interface , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[49]  Zhen Fang,et al.  Active memory controller , 2012, The Journal of Supercomputing.

[50]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[51]  Peter M. Kogge,et al.  Updating the Energy Model for Future Exascale Systems , 2015, ISC.

[52]  Paul D. Franzon,et al.  A low power 3D integrated FFT engine using hypercube memory division , 2009, ISLPED.

[53]  Christos Faloutsos,et al.  Active Disks for Large-Scale Data Processing , 2001, Computer.

[54]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[55]  N. Okumura,et al.  A multimedia 32 b RISC microprocessor with 16 Mb DRAM , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[56]  Houman Homayoun,et al.  Wide I/O or LPDDR? Exploration and analysis of performance, power and temperature trade-offs of emerging DRAM technologies in embedded MPSoCs , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[57]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[58]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[59]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[60]  Kevin W. Boyack,et al.  Data-centric computing with the Netezza architecture. , 2006 .

[61]  Haitham Akkary,et al.  Simultaneous continual flow pipeline architecture , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[62]  Peter M. Kogge,et al.  Facing the Exascale Energy Wall. , 2010 .

[63]  Mikko H. Lipasti,et al.  Data compression for thermal mitigation in the Hybrid Memory Cube , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[64]  Andreas Schilling,et al.  Eliminating the Z-Buffer bottleneck , 1995, Proceedings the European Design and Test Conference. ED&TC 1995.

[65]  James A. Kahle,et al.  The Cell Processor Architecture , 2005, MICRO.

[66]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[67]  Dave Bergeron,et al.  More than Moore , 2008, CICC.

[68]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[69]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[70]  Parthasarathy Ranganathan,et al.  From Microprocessors to Nanostores: Rethinking Data-Centric Systems , 2011, Computer.

[71]  Alan Gara The long term impact of codesign , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[72]  P. Townsend,et al.  High Performance Computing for Computer Graphics and Visualisation , 1996, Springer London.

[73]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[74]  Nam Sung Kim,et al.  Reevaluating the latency claims of 3D stacked memories , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[75]  José R. Brunheroto,et al.  Accelerating LBM and LQCD Application Kernels by In-Memory Processing , 2015, ISC.

[76]  Rainer Buchty,et al.  Revealing Potential Performance Improvements by Utilizing Hybrid Work-Sharing for Resource-Intensive Seismic Applications , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[77]  Jung Ho Ahn,et al.  DRAMA: An Architecture for Accelerated Processing Near Memory , 2015, IEEE Computer Architecture Letters.

[78]  Youngil Kim,et al.  Analysis of thermal behavior for 3D integration of DRAM , 2014, The 18th IEEE International Symposium on Consumer Electronics (ISCE 2014).

[79]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[80]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[81]  Peter M. Kogge,et al.  Combined DRAM and logic chip for massively parallel systems , 1995, Proceedings Sixteenth Conference on Advanced Research in VLSI.

[82]  Robert C. Minnick,et al.  A Survey of Microcellular Research , 1967, JACM.

[83]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[84]  Tack-Don Han,et al.  An effective memory-processor integrated architecture for computer vision , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[85]  Duncan G. Elliott,et al.  Computational Ram: A Memory-simd Hybrid And Its Application To Dsp , 1992, 1992 Proceedings of the IEEE Custom Integrated Circuits Conference.

[86]  Stefanos Kaxiras,et al.  Distributed Vector Architecture: Beyond a Single Vector-IRAM , 1997 .

[87]  Peter M. Kogge,et al.  A low cost, multithreaded processing-in-memory system , 2004, WMPI '04.

[88]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[89]  Zain-ul-Abdin,et al.  Kickstarting high-performance energy-efficient manycore architectures with Epiphany , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[90]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[91]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[92]  Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .

[93]  Jay B. Brockman,et al.  PIM lite: a multithreaded processor-in-memory prototype , 2005, GLSVLSI '05.

[94]  A. Kopser,et al.  Overview of the Next Generation Cray XMT , 2011 .

[95]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[96]  Guoqi Zhang,et al.  More than Moore: Creating High Value Micro/Nanoelectronics Systems , 2009 .

[97]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[98]  Jaejin Lee,et al.  A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM With Effective I/O Test Circuits , 2015, IEEE Journal of Solid-State Circuits.

[99]  S. F. Reddaway DAP—a distributed array processor , 1973, ISCA '73.

[100]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[101]  Robert A. Short,et al.  CELLULAR ARRAYS FOR LOGIC AND STORAGE. , 1966 .

[102]  Andreas Schilling,et al.  High Performance Texture Mapping Architectures , 1996 .

[103]  K. Yelick,et al.  Intelligent RAM (IRAM): chips that remember and compute , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[104]  Yuan Xie Future memory and interconnect technologies , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).