Measuring memory access latency for software objects in a NUMA system-on-chip architecture

We consider streaming applications modeled as a set of tasks communicating via channels. These channels are mapped to on-chip memory of a multi-processor system on chip (MPSoC) with non-uniform memory access. In complex applications like advanced packet processing and video streaming, often only part of the data transits through the channels. Tasks also communicate via shared memory; synchronization mechanisms like locks and barriers might be required. Effects of I/O on the traffic on the interconnect also have to be taken into account, all together increasing traffic to and from memory. Our clustered MPSoC architecture is modeled with SoCLib. SocLib's design space exploration tool proposes, among others, communication channels and shared memory for inter-task communication. Each consists of one of several software objects which are mapped to on-chip memory. The difficulty when measuring latency is to find out which (co-)processor issued a request for a particular software object. We intervene early in the design process by monitoring the transfers on the interconnection network caused by the access to these software objects. We identify the software objects by name and trace the corresponding memory accesses. In spite of the cycle accurate bit accurate level of simulation, our method has little overhead and avoids distorting the performance results.

[1]  Wei Liu,et al.  Efficient and flexible architectural support for dynamic monitoring , 2005, TACO.

[2]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[3]  Mahmut T. Kandemir,et al.  Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Marcel Waldvogel,et al.  IBM PowerNP network processor: Hardware, software, and applications , 2003, IBM J. Res. Dev..

[5]  Guru Venkataramani,et al.  MemTracker: Efficient and Programmable Support for Memory Access Monitoring and Debugging , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[7]  Ludovic Apvrille,et al.  Prototyping an Embedded Automotive System from its UML/SysML Models , 2012 .

[8]  Amir Roth,et al.  DISE: a programmable macro engine for customizing applications , 2003, ISCA '03.

[9]  Etienne Faure Communications matérielles / logicielles dans les systèmes sur puces multi-processeurs orientés télécommunications , 2007 .

[10]  Erwin A. de Kock,et al.  YAPI: application modeling for signal processing systems , 2000, Proceedings 37th Design Automation Conference.

[11]  Nicolas Pouillon Modèle de programmation pour applications parallèles multitâches et outil de déploiement sur architecture multicore à mémoire partagée , 2011 .

[12]  Pieter van der Wolf,et al.  An MPEG-2 decoder case study as a driver for a system level design methodology , 1999, Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450).

[13]  Alain Greiner,et al.  A generic hardware/software communication mechanism for Multi-Processor System on Chip, Targeting Telecommunication Applications , 2006, ReCoSoC.

[14]  Daniela Genius,et al.  Monitoring communication channels on a shared memory multi-processor system on chip , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[15]  Etienne Faure,et al.  Mapping a Telecommunication Application on a Multiprocessor System-on-Chip , 2011 .