Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU Architectures

Detecting concurrency bugs, such as data race, atomicity violation and order violation, is a cumbersome task for programmers. This situation is further being exacerbated due to the increasing number of cores in a single machine and the prevalence of threaded programming models. Unfortunately, many existing software-based approaches usually incur high runtime overhead or accuracy loss, while most hardware-based proposals usually focus on a specific type of bugs and thus are inflexible to detect a variety of concurrency bugs. In this paper, we propose Hydra, an approach that leverages massive parallelism and programmability of fused CPU-GPU architectures to simultaneously detect multiple concurrency bugs in threaded software, including data race, atomicity violation and order violation. Hydra extends contemporary fused CPU and GPU by introducing two modules: 1) a trace collecting module (TCM) that instruments and collects program behavior on CPU; 2) a trace preprocessing module (TPM) that processes and then transfers the traces to GPU for bug detection. Furthermore, Hydra exploits three optimizations to improve speed and accuracy, which includes: 1). using the bloom filter to filter out unnecessary traces; 2). avoiding eviction of shared traces; 3). comparing only last-write traces for shared data with the happens-before relation. Hydra incurs small hardware complexity and requires no changes to internal critical-path processor components such as cache and its coherence protocol, and is with about 1.1 percent hardware overhead under a 32-core configuration. Experimental results show that Hydra only introduces about 0.18 percent overhead on average for detecting one type of bugs and 0.46 percent overhead for simultaneously detecting multiple bugs, yet with the similar detectability of a heavyweight software bug detector (e.g., Helgrind).

[1]  Babak Falsafi,et al.  Flexible Hardware Acceleration for Instruction-Grain Program Monitoring , 2008, 2008 International Symposium on Computer Architecture.

[2]  Josep Torrellas,et al.  Light64: Lightweight hardware support for data race detection during Systematic Testing of parallel programs , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Satish Narayanasamy,et al.  LiteRace: effective sampling for lightweight data-race detection , 2009, PLDI '09.

[4]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[5]  Josep Torrellas,et al.  SigRace: signature-based data race detection , 2009, ISCA '09.

[6]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[7]  GuptaAnoop,et al.  The SPLASH-2 programs , 1995 .

[8]  Yuanyuan Zhou,et al.  AVIO: Detecting Atomicity Violations via Access-Interleaving Invariants , 2007, IEEE Micro.

[9]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[10]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[11]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[12]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[13]  Brandon Lucia,et al.  Atom-Aid: Detecting and Surviving Atomicity Violations , 2009, IEEE Micro.

[14]  AustinTodd,et al.  Demand-driven software race detection using hardware performance counters , 2011 .

[15]  SenKoushik Race directed random testing of concurrent programs , 2008 .

[16]  Satish Narayanasamy,et al.  A case for an interleaving constrained shared-memory multi-processor , 2009, ISCA '09.

[17]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[18]  Serdar Tasiran,et al.  KUDA: GPU Accelerated Split Race Checker , 2012 .

[19]  Josep Torrellas,et al.  Accurate and efficient filtering for the Intel thread checker race detector , 2006, ASID '06.

[20]  Dan Grossman,et al.  RADISH: Always-on sound and complete race detection in software and hardware , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[21]  Pin Zhou,et al.  HARD: Hardware-Assisted Lockset-based Race Detection , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[24]  Brandon Lucia,et al.  ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations , 2010, ISCA.

[25]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[26]  Shan Lu,et al.  ConMem: detecting severe concurrency bugs through an effect-oriented approach , 2010, ASPLOS XV.

[27]  Sebastian Burckhardt,et al.  Effective Data-Race Detection for the Kernel , 2010, OSDI.

[28]  Zhiqiang Ma,et al.  Demand-driven software race detection using hardware performance counters , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[29]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[30]  YuJie,et al.  A case for an interleaving constrained shared-memory multi-processor , 2009 .

[31]  ParkSoyeon,et al.  Learning from mistakes , 2008 .

[32]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[33]  Satish Narayanasamy,et al.  Parallelizing data race detection , 2013, ASPLOS '13.

[34]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[35]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[36]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[37]  Satish Narayanasamy,et al.  Detecting and surviving data races using complementary schedules , 2011, SOSP.

[38]  Koushik Sen,et al.  Race directed random testing of concurrent programs , 2008, PLDI '08.

[39]  Konstantin Serebryany,et al.  ThreadSanitizer: data race detection in practice , 2009, WBIA '09.

[40]  George Varghese,et al.  An Improved Construction for Counting Bloom Filters , 2006, ESA.

[41]  Stephen N. Freund,et al.  FastTrack: efficient and precise dynamic race detection , 2009, PLDI '09.

[42]  Edwin Brady,et al.  Scrapping your inefficient engine: using partial evaluation to improve domain-specific language implementation , 2010, ICFP '10.

[43]  J. Torrellas,et al.  ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[44]  Milos Prvulovic,et al.  CORD: cost-effective (and nearly overhead-free) order-recording and data race detection , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[45]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[46]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[47]  George Candea,et al.  Data races vs. data race bugs: telling the difference with portend , 2012, ASPLOS XVII.

[48]  G. Edward Suh,et al.  Non-race concurrency bug detection through order-sensitive critical sections , 2013, ISCA.

[49]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).