暂无分享,去创建一个
Nectarios Koziris | Onur Mutlu | Vasileios Karakostas | Nandita Vijaykumar | Lois Orosa | Christina Giannoula | Georgios Goumas | Juan G'omez-Luna | Nikela Papadopoulou | Ivan Fernandez
[1] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[2] Rachata Ausavarungnirun,et al. Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.
[3] Dan Alistarh,et al. The SprayList: a scalable relaxed priority queue , 2015, PPoPP.
[4] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[5] Josep Torrellas,et al. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication , 2016, ASPLOS.
[6] Sanjay J. Patel,et al. Cohesion: a hybrid memory model for accelerators , 2010, ISCA.
[7] Babak Falsafi,et al. The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[8] Zhimin Zhang,et al. Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Co-Design Approach , 2019, MICRO.
[9] Dong Li,et al. Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[10] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[11] David R. Kaeli,et al. HQL: A Scalable Synchronization Mechanism for GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[12] Milos Prvulovic,et al. MiSAR: Minimalistic synchronization accelerator with resource overflow management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[13] Norman P. Jouppi,et al. Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[14] Christoforos E. Kozyrakis,et al. ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.
[15] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.
[16] Onur Mutlu,et al. Data marshaling for multi-core architectures , 2010, ISCA.
[17] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.
[18] Christoforos E. Kozyrakis,et al. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[19] Mingyu Gao,et al. HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[20] Maya Gokhale,et al. Near memory data structure rearrangement , 2015, MEMSYS.
[21] Dimin Niu,et al. iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[22] Sander Stuijk,et al. NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).
[23] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.
[24] Stefanos Kaxiras,et al. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[25] Sander Stuijk,et al. NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).
[26] Rachata Ausavarungnirun,et al. A Modern Primer on Processing in Memory , 2020, ArXiv.
[27] Daniel Sánchez,et al. Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28] Kunle Olukotun,et al. Simplifying Scalable Graph Processing with a Domain-Specific Language , 2014, CGO '14.
[29] Wenguang Chen,et al. pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing , 2019, ASPLOS.
[30] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[31] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[32] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[33] Dipl.-Inf. Torsten Hoefler,et al. A Survey of Barrier Algorithms for Coarse Grained Supercomputers , 2005 .
[34] R. E. Kessler,et al. Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.
[35] Reena Panda,et al. Data partitioning strategies for graph workloads on heterogeneous clusters , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[36] Haibo Chen,et al. Scalable Adaptive NUMA-Aware Lock , 2017, IEEE Transactions on Parallel and Distributed Systems.
[37] Onur Mutlu,et al. Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.
[38] Allan Porterfield,et al. The Tera computer system , 1990, ICS '90.
[39] Henk Corporaal,et al. Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.
[40] Nectarios Koziris,et al. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures , 2019, SC.
[41] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.
[42] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[43] Tudor David,et al. Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.
[44] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[45] Anoop Gupta,et al. A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols , 2022 .
[46] Omer Khan,et al. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.
[47] Oscar Plata,et al. NATSA: A Near-Data Processing Accelerator for Time Series Analysis , 2020, 2020 IEEE 38th International Conference on Computer Design (ICCD).
[48] Nectarios Koziris,et al. An adaptive concurrent priority queue for NUMA architectures , 2019, CF.
[49] Larry Rudolph,et al. Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.
[50] Sudhakar Yalamanchili,et al. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[51] Sander Stuijk,et al. Near-Memory Computing: Past, Present, and Future , 2019, Microprocess. Microsystems.
[52] Jaejin Lee,et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).
[53] Maurice Herlihy,et al. Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.
[54] Josep Torrellas,et al. Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery , 2017, IEEE Computer Architecture Letters.
[55] Robert Tappan Morris,et al. An Analysis of Linux Scalability to Many Cores , 2010, OSDI.
[56] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[57] Fabrice Devaux,et al. The true Processing In Memory accelerator , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[58] Milos Prvulovic,et al. TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[59] P.T. Wolkotte,et al. Energy Model of Networks-on-Chip and a Bus , 2005, 2005 International Symposium on System-on-Chip.
[60] Vladimir Vlassov,et al. Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.
[61] Dirk Grunwald,et al. Efficient barriers for distributed shared memory computers , 1994, Proceedings of 8th International Parallel Processing Symposium.
[62] Maurice Herlihy,et al. Concurrent Data Structures with Near-Data-Processing: an Architecture-Aware Implementation , 2019, SPAA.
[63] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.
[64] Nectarios Koziris,et al. Combining HTM with RCU to Speed Up Graph Coloring on Multicore Platforms , 2018, ISC.
[65] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[66] Onur Mutlu,et al. Demystifying Complex Workload-DRAM Interactions: An Experimental Study , 2019, SIGMETRICS.
[67] Tejas Karkhanis,et al. Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..
[68] U. Narayan Bhat,et al. An Introduction to Queueing Theory: Modeling and Analysis in Applications , 2006 .
[69] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.
[70] G. S. Graham. A New Solution of Dijkstra ' s Concurrent Programming Problem , 2022 .
[71] Traviss. Craig,et al. Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .
[72] José Ignacio Benavides Benítez,et al. Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.
[73] Vivien Quéma,et al. Multicore Locks: The Case Is Not Closed Yet , 2016, USENIX Annual Technical Conference.
[74] Onur Mutlu,et al. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).
[75] José L. Abellán,et al. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[76] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[77] Kai Wang,et al. Fast Fine-Grained Global Synchronization on GPUs , 2019, ASPLOS.
[78] Stefanos Kaxiras,et al. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.
[79] Stefanos Kaxiras,et al. Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.
[80] Onur Mutlu,et al. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.
[81] William J. Dally,et al. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).
[82] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[83] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[84] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..
[85] Maleen Abeydeera,et al. Chronos: Efficient Speculative Parallelism for Accelerators , 2020, ASPLOS.
[86] D. M. Hutton,et al. The Art of Multiprocessor Programming , 2008 .
[87] Mateo Valero,et al. Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[88] Rachid Guerraoui,et al. Optimistic concurrency with OPTIK , 2016, PPOPP.
[89] Liu Liu,et al. Leveraging 3D technologies for hardware security: Opportunities and challenges , 2016, 2016 International Great Lakes Symposium on VLSI (GLSVLSI).
[90] Zhen Fang,et al. Highly efficient synchronization based on active memory operations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[91] Debra Hensgen,et al. Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.
[92] Tudor David,et al. Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.
[93] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[94] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.
[95] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.
[96] Sarita V. Adve,et al. DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.
[97] Rachata Ausavarungnirun,et al. CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[98] Onur Mutlu,et al. Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.
[99] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.
[100] Erik Hagersten,et al. Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.
[101] Gerard J. M. Smit,et al. Portable Memory Consistency for Software Managed Distributed Memory in Many-Core SoC , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[102] Nathan R. Tallent,et al. Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.
[103] Donald Yeung,et al. The MIT Alewife machine: architecture and performance , 1995, ISCA '98.
[104] Tor M. Aamodt,et al. Warp Scheduling for Fine-Grained Synchronization , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[105] Onur Mutlu,et al. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.
[106] Nir Shavit,et al. Flat-combining NUMA locks , 2011, SPAA '11.
[107] Nathan Beckmann,et al. PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.
[108] Nectarios Koziris,et al. RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[109] Thomas E. Anderson,et al. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.
[110] Eamonn J. Keogh,et al. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).
[111] Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer , 1983, ISCA '83.
[112] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[113] Michael L. Scott,et al. Synchronization without contention , 1991, ASPLOS IV.
[114] Gu-Yeon Wei,et al. Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[115] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[116] John M. Mellor-Crummey,et al. High performance locks for multi-level NUMA systems , 2015, PPoPP.
[117] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.
[118] Yanzhi Wang,et al. GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.
[119] Eran Yahav,et al. Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.
[120] David R. Kaeli,et al. Profiling DNN Workloads on a Volta-based DGX-1 System , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).
[121] Rachata Ausavarungnirun,et al. GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[122] Stefanos Kaxiras,et al. A new perspective for efficient virtual-cache coherence , 2013, ISCA.
[123] Michael L. Scott,et al. Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.
[124] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.
[125] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.
[126] Chris Fallin,et al. Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[127] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.
[128] Onur Mutlu,et al. Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.
[129] B J Smith,et al. A pipelined, shared resource MIMD computer , 1986 .
[130] Steven Swanson,et al. Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.
[131] W. Daniel Hillis,et al. The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.
[132] Rachata Ausavarungnirun,et al. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.
[133] Sanjay J. Patel,et al. WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.
[134] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[135] Jaehwan Lee,et al. A system-on-a-chip lock cache with task preemption support , 2001, CASES '01.
[136] Nir Shavit,et al. A Hierarchical CLH Queue Lock , 2006, Euro-Par.
[137] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[138] Onur Mutlu,et al. Processing-in-memory: A workload-driven perspective , 2019, IBM J. Res. Dev..
[139] Dominique Lavenier,et al. DNA mapping using Processor-in-Memory architecture , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
[140] Onur Mutlu,et al. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.
[141] José L. Abellán,et al. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs , 2010, 2010 39th International Conference on Parallel Processing.
[142] William Pugh,et al. Concurrent maintenance of skip lists , 1990 .
[143] Volker Lohweg,et al. Survey on time series motif discovery , 2017, WIREs Data Mining Knowl. Discov..
[144] Mahmut T. Kandemir,et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[145] Vladimir Vlassov,et al. Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).