Finding the 'Needle': Locating Interesting Nodes Using the K-shortest Paths Algorithm in MapReduce

Understanding how nodes interconnect in large graphs is an important problem in many fields. We wish to find connecting nodes between two nodes or two groups of source nodes. In order to find these connecting nodes in huge graphs, we have devised a highly parallelized variant of a k-shortest path algorithm that levies the power of the Hadoop distributed computing system and HBase distributed key/value store. We show how our system enables previously unobtainable graph analysis by finding these connecting nodes in graphs as large as one billion nodes or more on modest commodity hardware in a time frame of just minutes.

[1]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[2]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[3]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[4]  David A. Bader,et al.  Techniques for Designing Efficient Parallel Graph Algorithms for SMPs and Multicore Processors , 2007, ISPA.

[5]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[6]  M. C. Sinclair,et al.  A Comparative Study of k-Shortest Path Algorithms , 1996 .

[7]  Stuart E. Dreyfus,et al.  An Appraisal of Some Shortest-Path Algorithms , 1969, Oper. Res..

[8]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[9]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[12]  Bhavani M. Thuraisingham,et al.  Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce , 2009, CloudCom.

[13]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[14]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[15]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[16]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Ana Paula Appel,et al.  HADI: Mining Radii of Large Graphs , 2011, TKDD.

[19]  Yon Dohn Chung,et al.  SPIDER: a system for scalable, parallel / distributed evaluation of large-scale RDF data , 2009, CIKM.

[20]  J. Y. Yen,et al.  Finding the K Shortest Loopless Paths in a Network , 2007 .