In-database connected component analysis

We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any ϵ>0 and any input graph $G = \langle V,E\rangle $ the algorithm terminates after O(log|V |) SQL queries with probability of at least, which we show empirically to translate to a quasi-linear runtime in practice.

[1]  Yong Woon Park,et al.  Motion-based skin region of interest detection with a real-time connected component labeling algorithm , 2017, Multimedia Tools and Applications.

[2]  Dilip V. Sarwate,et al.  Computing connected components on parallel computers , 1979, CACM.

[3]  Chong-Wah Ngo,et al.  Detection of bird nests in overhead catenary system images for high-speed rail , 2016, Pattern Recognit..

[4]  Jeffrey D. Ullman,et al.  Set Merging Algorithms , 1973, SIAM J. Comput..

[5]  Donald. Miner,et al.  MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems , 2012 .

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[8]  Satoshi Nakamoto Bitcoin : A Peer-to-Peer Electronic Cash System , 2009 .

[9]  Bruce Schneier,et al.  Description of a New Variable-Length Key, 64-bit Block Cipher (Blowfish) , 1993, FSE.

[10]  Nigel Shadbolt,et al.  Structural analysis of online criminal social networks , 2012, 2012 IEEE International Conference on Intelligence and Security Informatics.

[11]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[12]  Laura Ricci,et al.  Fast Connected Components Computation in Large Graphs by Vertex Pruning , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  David A. Bader,et al.  Parallel Algorithms for Image Histogramming and Connected Components with an Experimental Study , 1996, J. Parallel Distributed Comput..

[14]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[15]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[16]  Holden Karau,et al.  High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark , 2017 .

[17]  Uri Zwick,et al.  An optimal randomized logarithmic time connectivity algorithm for the EREW PRAM (extended abstract) , 1994, SPAA '94.

[18]  Baruch Awerbuch,et al.  New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM , 1987, IEEE Transactions on Computers.

[19]  Silvio Lattanzi,et al.  Connected Components in MapReduce and Beyond , 2014, SoCC.

[20]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[21]  David Eppstein,et al.  Parallel Algorithmic Techniques for Combinatorial Computation , 1988, ICALP.

[22]  John E. Savage,et al.  Models of computation - exploring the power of computing , 1998 .

[23]  Pawel Forczmanski,et al.  Automatic Analysis of Vehicle Trajectory Applied to Visual Surveillance , 2015, IP&C.

[24]  Y. Mizokami,et al.  A new quantitative evaluation method for age‐related changes of individual pigmented spots in facial skin , 2016, Skin research and technology : official journal of International Society for Bioengineering and the Skin (ISBS) [and] International Society for Digital Imaging of Skin (ISDIS) [and] International Society for Skin Imaging.

[25]  David J. DeWitt,et al.  Parallel Database Systems: The Future of High Performance Database Processing 1 , 1992 .

[26]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[27]  Faith Ellen,et al.  The Complexity of Computation on the Parallel Random Access Machine , 1993 .

[28]  Stefan Savage,et al.  A fistful of bitcoins: characterizing payments among men with no names , 2013, Internet Measurement Conference.

[29]  Uzi Vishkin,et al.  An optimal parallel connectivity algorithm , 1984, Discret. Appl. Math..

[30]  Kameshwar Poolla,et al.  Building Efficiency and Sustainability in the Tropics ( SinBerBEST ) , 2012 .

[31]  Chaoping Xing,et al.  Coding Theory: A First Course , 2004 .

[32]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[33]  Cliff Joslyn,et al.  Towards a multiscale approach to cybersecurity modeling , 2013, 2013 IEEE International Conference on Technologies for Homeland Security (HST).