On Improving Efficiency and Utilization of Last Level Cache in Multicore Systems

With the increasing need of computational power the trend towards multicore processors is ubiquitous. The current on-chip architecture comprises multiple cores which usually share last level cache which can be physically distributed on chip. In order to provide system predictability, especially for a real time system where quality of service (QoS) depends on minimum miss rates and low worst case execution time (WCET) for applications running on different cores, efficient cache management techniques are required. Since memory hierarchy and its management is the key of overall system performance and access to off-chip memory for data consumes many clock cycles along with many units of power, it is important to restrict the off-chip access and provide the optimum solution for the on-chip access. To increase performance and energy efficiency various techniques are proposed. This article aims to provide the researchers with the state-of-the-art critical review of the various approaches that focus on data replication and cache partitioning techniques for L3 cache. The existing literature is presented through several classifications based on appropriate design and algorithm. Maintaining energy efficient system is a crucial challenge for multicore processors. This article discusses various techniques which address upscaling performance without compromising on energy efficiency. The article also evaluates cache and/or various processors for high performance applications such as bioinformatics, image & video processing, applications and DSP and IOT. DOI: http://dx.doi.org/10.5755/j01.itc.47.3.18433

[1]  Jan Treur,et al.  Modelling the Reciprocal Interaction between Believing and Feeling from a Neurological Perspective , 2009, Brain Informatics.

[2]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Kunle Olukotun,et al.  Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency , 2007 .

[4]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[5]  Rafia Inam An Introduction to GPGPU Programming - CUDA Architecture , 2010 .

[6]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[7]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[8]  Tibor Bosse,et al.  A computational model for dynamics of desiring and feeling , 2012, Cognitive Systems Research.

[9]  J. Treur,et al.  Emergent Storylines Based on Autonomous Characters with Mindreading Capabilities , 2007, 2007 IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT'07).

[10]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[11]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[12]  Mustafa U. Torun,et al.  High performance digital signal processing: Theory, design, and applications in finance , 2013 .

[13]  Tibor Bosse,et al.  Modelling Animal Behaviour Based on Interpretation of Another Animal's Behaviour , 2007 .

[14]  Sparsh Mittal,et al.  A Survey of Techniques for Cache Partitioning in Multicore Processors , 2017, ACM Comput. Surv..

[15]  Timothy G. Mattson,et al.  Programming the Intel 80-core network-on-a-chip Terascale Processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Jan Treur,et al.  On the reciprocal interaction between believing and feeling: an adaptive agent modelling perspective , 2010, Cognitive Neurodynamics.

[17]  I. Mahgoub,et al.  Evaluation of Application-Specific Multiprocessor Mobile System , 2004 .

[18]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[19]  Ahmed Bouridane,et al.  Modality identification for heterogeneous face recognition , 2017, Multimedia Tools and Applications.

[20]  Francisco J. Cazorla,et al.  FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[21]  Stephan Wong,et al.  Using VLIW softcore processors for image processing applications , 2015, 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[22]  Saurabh Gupta,et al.  Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing , 2015, 2015 44th International Conference on Parallel Processing.

[23]  Cheol Hong Kim,et al.  A Novel Last-level Cache Replacement Policy to Improve the Performance of Mobile Systems , 2014 .

[24]  John E. Stone,et al.  GPU clusters for high-performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[25]  Tibor Bosse,et al.  An Adaptive Model for Dynamics of Desiring and Feeling Based on Hebbian Learning , 2010, Brain Informatics.

[26]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[27]  Zhao Zhang,et al.  Palette: A Cache Leakage Energy Saving Technique for Green Computing , 2012, High Performance Computing Workshop.

[28]  Kevin Skadron,et al.  Quantifying Latency and Throughput Compromises in CMP Design , 2012 .

[29]  Jaydip Sen,et al.  Internet of Things - Applications and Challenges in Technology and Standardization , 2011 .

[30]  Tibor Bosse,et al.  Adaptive Estimation of Emotion Generation for an Ambient Agent Model , 2008, AmI.

[31]  Surin Kittitornkun,et al.  Speedup bioinformatics applications on multicore-based processor using vectorizing and multithreading strategies , 2007, Bioinformation.

[32]  Nouman M. Durrani,et al.  Towards Efficient Graph Traversal using a Multi-GPU Cluster , 2017 .

[33]  Tibor Bosse,et al.  An Adaptive Human-Aware Software Agent Supporting Attention-Demanding Tasks , 2009, PRIMA.

[34]  Geoffrey C. Fox,et al.  Distributed and Cloud Computing: From Parallel Processing to the Internet of Things , 2011 .

[35]  Sebastian Fischmeister,et al.  Implementation and evaluation of global and partitioned scheduling in a real-time OS , 2013, Real-Time Systems.

[36]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[37]  John Faella On performance of GPU and DSP architectures for computationally intensive applications , 2013 .

[38]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Rajkumar Buyya,et al.  Data Replication Strategies in Wide-Area Distributed Systems , 2007 .

[40]  Jan Treur,et al.  Cognitive and Biological Agent Models for Emotion Reading , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[41]  George Kurian,et al.  The locality-aware adaptive cache coherence protocol , 2013, ISCA.

[42]  Michel C. A. Klein,et al.  Formal specification and analysis of intelligent agents for model-based medicine usage management , 2013, Comput. Biol. Medicine.

[43]  Michael F. P. O'Boyle,et al.  IATAC: a smart predictor to turn-off L2 cache lines , 2005, TACO.

[44]  Fouad Khelifi,et al.  Robust off-line text independent writer identification using bagged discrete cosine transform features , 2017, Expert Syst. Appl..

[45]  George Kurian,et al.  Locality-aware data replication in the Last-Level Cache , 2014, HPCA.

[46]  M. A. Khan,et al.  Highly Available Hadoop NameNode Architecture , 2012, 2012 International Conference on Advanced Computer Science Applications and Technologies (ACSAT).

[47]  Jawwad Shamsi,et al.  BUMPSTER: A Mobile Cloud Computing System for Speed Breakers and Ditches , 2016, 2016 IEEE 41st Conference on Local Computer Networks Workshops (LCN Workshops).

[48]  Gang Chen,et al.  Abstract: Energy optimization for real-time multiprocessor system-on-chip with optimal DVFS and DPM combination , 2013, ESTImedia.

[49]  Jan Treur,et al.  An Agent Model for Cognitive and Affective Empathic Understanding of Other Agents , 2012, Trans. Comput. Collect. Intell..

[50]  Jawwad Shamsi A Laboratory Based Course on GPU Programming: Methods, Practices, and Lessons , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[51]  Jeffrey S. Vetter,et al.  A Survey Of Techniques for Architecting DRAM Caches , 2016, IEEE Transactions on Parallel and Distributed Systems.

[52]  Randima Fernando,et al.  GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics , 2004 .

[53]  H. Watkins,et al.  Mutations in the gamma(2) subunit of AMP-activated protein kinase cause familial hypertrophic cardiomyopathy: evidence for the central role of energy compromise in disease pathogenesis. , 2001, Human molecular genetics.

[54]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[55]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[56]  Giovani Gracioli,et al.  An experimental evaluation of the cache partitioning impact on multicore real-time schedulers , 2013, 2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications.

[57]  Mahmut T. Kandemir,et al.  Managing Leakage Energy in Cache Hierarchies , 2003, J. Instr. Level Parallelism.

[58]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[59]  Francisco José Esteban,et al.  Next-generation bioinformatics: using many-core processor architecture to develop a web service for sequence alignment , 2010, Bioinform..

[60]  Monica Brockmeyer,et al.  Predictable service overlay networks: Predictability through adaptive monitoring and efficient overlay construction and management , 2012, J. Parallel Distributed Comput..

[61]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[62]  Yao Guo,et al.  Energy-Aware Fixed-Priority Multi-core Scheduling for Real-Time Systems , 2011, 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications.