Optimizing Caching DSM for Distributed Software Speculation

Clusters with caching DSMs deliver programmability and performance by supporting shared-memory programming and tolerate remote I/O latencies via caching. The input to a data parallel program is partitioned across the cluster while the DSM transparently fetches and caches remote data as needed. Irregular applications, however, are challenging to parallelize because the input related data dependences that manifest at runtime require use of speculation for correct parallel execution. By speculating that there are no input related cross iteration dependences, private copies of the input can be processed by parallelizing the loop, the absence of dependences is validated before committing the computed results. We show that while caching helps tolerate long communication latencies in irregular data-parallel applications, using a cached values in a computation can lead to misspeculation and thus aggressive caching can degrade performance due to increased misspeculation rate. We present optimizations for distributed speculation on caching based DSMs that decrease the cost of misspeculation check and speed up the re-execution of misspeculated recomputations. Optimized distributed speculation achieves speedups of 2.24x for coloring, 1.71x for connected components, 1.88x for community detection, 1.32x for shortest path, and 1.74x for pagerank over unoptimized speculation.

[1]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[2]  Anne Rogers,et al.  Understanding Language Support for Irregular Parallelism , 1995, PSLS.

[3]  Brian Demsky,et al.  Integrating Caching and Prefetching Mechanisms in a Distributed Transactional Memory , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[5]  Jérôme Kunegis,et al.  KONECT: the Koblenz network collection , 2013, WWW.

[6]  Henri E. Bal,et al.  Performance evaluation of the Orca shared-object system , 1998, TOCS.

[7]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[8]  Rajiv Gupta,et al.  SpiceC: scalable parallelism via implicit copying and explicit commit , 2011, PPoPP '11.

[9]  Kourosh Gharachorloo,et al.  Shasta: A System for Supporting Fine-Grain Shared Memory Across Clusters , 1997, PPSC.

[10]  Joseph M. Hellerstein,et al.  GraphLab: A New Framework For Parallel Machine Learning , 2010, UAI.

[11]  Rajiv Gupta,et al.  ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM , 2014, OOPSLA.

[12]  Joseph E. Gonzalez,et al.  GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[13]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.