Efficient and Scalable Algorithms for Network Motifs Discovery

Networks are a powerful representation for a multitude of natural and artificial systems. They are ubiquitous in real-world systems, presenting substantial non-trivial topological features. These are called complex networks and have received increasing attention in recent years. In order to understand their design principles, the concept of network motifs emerged. These are recurrent over-represented patterns of interconnections, conjectured to have some significance, that can be seen as basic building blocks of networks. Algorithmically, discovering network motifs is a hard problem related to graph isomorphism. The needed execution time grows exponentially as the size of networks or motifs increases, thus limiting their applicability. Since motifs are a fundamental concept, increasing the efficiency in its detection can lead to new insights in several areas of knowledge. To develop efficient and scalable algorithms for motifs discovery is precisely the main aim of this thesis. We provide a thorough survey of existing methods, complete with an associated chronology, taxonomy, algorithmic description and empirical evaluation and comparison. We propose a novel data-structure, g-tries, designed to represent a collection of graphs. Akin to a prefix tree, it takes advantage of common substructures to both reduce the memory needed to store the graphs, and to produce a new more efficient sequential algorithm to compute their frequency as subgraphs of another larger graph. We also introduce a sampling methodology for g-tries that successfully trades accuracy for faster execution times. We identify opportunities for parallelism in motif discovery, creating an associated taxonomy. We expose the whole motif computation as a tree based search and devise a general methodology for parallel execution with dynamic load balancing, including a novel strategy capable of efficiently stopping and dividing computation on the fly. In particular we provide parallel algorithms for ESU and g-tries. Finally, we extensively evaluate our algorithms on a set of diversified complex networks. We show that we are able to outperform all existing sequential algorithms, and are able to scale our parallel algorithms up to 128 processors almost linearly. By combining the power of g-tries and parallelism, we speedup motif discovery by several orders of magnitude, thus effectively pushing the limits in its applicability.

[1]  Marcus Kaiser,et al.  Strategies for Network Motifs Discovery , 2009, 2009 Fifth IEEE International Conference on e-Science.

[2]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[3]  Uri Alon,et al.  Kashtan, N., Itzkovitz, S., Milo, R. & Alon, U. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20, 1746-1758 , 2004 .

[4]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[5]  M. Vergassola,et al.  An evolutionary and functional assessment of regulatory network motifs , 2005, Genome Biology.

[6]  Fernando M. A. Silva,et al.  g-tries: an efficient data structure for discovering network motifs , 2010, SAC '10.

[7]  D. Lusseau,et al.  The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations , 2003, Behavioral Ecology and Sociobiology.

[8]  V. Latora,et al.  Complex networks: Structure and dynamics , 2006 .

[9]  Z N Oltvai,et al.  Evolutionary conservation of motif constituents in the yeast protein interaction network , 2003, Nature Genetics.

[10]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[11]  Gregory M. Provan,et al.  Generating Application-Specific Benchmark Models for Complex Systems , 2008, AAAI.

[12]  U. Alon Biological Networks: The Tinkerer as an Engineer , 2003, Science.

[13]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[14]  Michio Kondoh,et al.  Building trophic modules into a persistent food web , 2008, Proceedings of the National Academy of Sciences.

[15]  Elliott Cooper-Balis,et al.  Parallel Network Motif Finding , 2007 .

[16]  Etienne Birmele,et al.  Detecting local network motifs , 2010, 1007.1410.

[17]  M. Rosvall Information horizons in a complex world , 2006 .

[18]  D. Bu,et al.  Topological structure analysis of the protein-protein interaction network in budding yeast. , 2003, Nucleic acids research.

[19]  Sebastian Wernicke,et al.  A Faster Algorithm for Detecting Network Motifs , 2005, WABI.

[20]  Marie-France Sagot,et al.  Assessing the Exceptionality of Coloured Motifs in Networks , 2008, EURASIP J. Bioinform. Syst. Biol..

[21]  Luís M. B. Lopes,et al.  A Parallel Algorithm for Counting Subgraphs in Complex Networks , 2010, BIOSTEC.

[22]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[23]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[24]  Sebastian Wernicke,et al.  Comment on ‘An optimal algorithm for counting networks motifs’ [Physica A 381 (2007) 482–490] , 2011 .

[25]  J. Hopcroft,et al.  Are randomly grown graphs really random? , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  O. Sporns,et al.  Motifs in Brain Networks , 2004, PLoS biology.

[27]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  Falk Schreiber,et al.  Towards Motif Detection in Networks: Frequency Concepts and Flexible Search , 2004 .

[29]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[30]  U. Alon,et al.  Subgraphs and network motifs in geometric networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Yoram Louzoun,et al.  An optimal algorithm for counting network motifs , 2007 .

[32]  R. Milo,et al.  Topological generalizations of network motifs. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[33]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[34]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[35]  Peter Sanders,et al.  A detailed analysis of random polling dynamic load balancing , 1994, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN).

[36]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[37]  Jetta Carol Culpepper,et al.  ODLIS: Online Dictionary of Library and Information Science , 2013 .

[38]  Concettina Guerra,et al.  A review on models and algorithms for motif discovery in protein-protein interaction networks. , 2008, Briefings in functional genomics & proteomics.

[39]  Miron Livny,et al.  Evaluation of an Adaptive Scheduling Strategy for Master-Worker Applications on Clusters of Workstations , 2000, HiPC.

[40]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[41]  M. Newman,et al.  Random graphs with arbitrary degree distributions and their applications. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[43]  Réka Albert,et al.  Conserved network motifs allow protein-protein interaction prediction , 2004, Bioinform..

[44]  Lars Paul Huse Collective Communication on Dedicated Clusters of Workstations , 1999, PVM/MPI.

[45]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[46]  Lin Gao,et al.  Evaluation of subgraph searching algorithms detecting network motif in biological networks , 2009, Frontiers of Computer Science in China.

[47]  Sarel J Fleishman,et al.  Comment on "Network Motifs: Simple Building Blocks of Complex Networks" and "Superfamilies of Evolved and Designed Networks" , 2004, Science.

[48]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[49]  Fernando M. A. Silva,et al.  Efficient Parallel Subgraph Counting Using G-Tries , 2010, 2010 IEEE International Conference on Cluster Computing.

[50]  Chrystopher L. Nehaniv,et al.  Do motifs reflect evolved function? - No convergent evolution of genetic regulatory network subgraph topologies , 2008, Biosyst..

[51]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[52]  Lucas Antiqueira,et al.  Analyzing and modeling real-world phenomena with complex networks: a survey of applications , 2007, 0711.3199.

[53]  Uri Alon,et al.  Coarse-graining and self-dissimilarity of complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  Joost N. Kok,et al.  Frequent graph mining and its application to molecular databases , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[55]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[56]  Jan-Ming Ho,et al.  Web Appearance Disambiguation of Personal Names Based on Network Motif , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[57]  Edward D. Lazowska,et al.  A comparison of receiver-initiated and sender-initiated adaptive load sharing (extended abstract) , 1985, SIGMETRICS 1985.

[58]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[59]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[60]  A. Arenas,et al.  Community detection in complex networks using extremal optimization. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[61]  K. Norlen 1 EVA : Extraction , Visualization and Analysis of the Telecommunications and Media Ownership Network , 2002 .

[62]  S. Shen-Orr,et al.  Superfamilies of Evolved and Designed Networks , 2004, Science.

[63]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  J. Köbler,et al.  The Graph Isomorphism Problem: Its Structural Complexity , 1993 .

[65]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[66]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[67]  F. Schreiber,et al.  MODA: an efficient algorithm for network motif discovery in biological networks. , 2009, Genes & genetic systems.

[68]  Fernando M. A. Silva,et al.  Efficient Subgraph Frequency Estimation with G-Tries , 2010, WABI.

[69]  Brendan D. McKay,et al.  Isomorph-Free Exhaustive Generation , 1998, J. Algorithms.

[70]  Peter Sanders Asynchronous Random Polling Dynamic Load Balancing , 1999, ISAAC.

[71]  H E Stanley,et al.  Classes of small-world networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Arun Siddharth Konagurthu,et al.  On the origin of distribution patterns of motifs in biological networks , 2008, BMC Systems Biology.

[73]  Ricardo Rocha,et al.  YapDss: An Or-Parallel Prolog System for Scalable Beowulf Clusters , 2003, EPIA.

[74]  Jari Saramäki,et al.  Characterizing Motifs in Weighted Complex Networks , 2005 .

[75]  Cristina G. Fernandes,et al.  Motif Search in Graphs: Application to Metabolic Networks , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[76]  Joan M Reitz,et al.  Dictionary for library and information science , 2004 .

[77]  Edward B. Suh,et al.  A parallel algorithm for extracting transcriptional regulatory network motifs , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[78]  Ruoming Jin,et al.  Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[79]  J. Stark,et al.  Network motifs: structure does not determine function , 2006, BMC Genomics.

[80]  A Vázquez,et al.  The topological relationship between the large-scale attributes and local interaction patterns of complex networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[81]  J. Snoeyink,et al.  Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[82]  Manuel Middendorf,et al.  Systematic identification of statistically significant network measures. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[83]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[84]  Christos Faloutsos,et al.  PEGASUS: mining peta-scale graphs , 2011, Knowledge and Information Systems.

[85]  Sahar Asadi,et al.  Kavosh: a new algorithm for finding network motifs , 2009, BMC Bioinformatics.

[86]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[87]  Michael R. Fellows,et al.  Sharp Tractability Borderlines for Finding Connected Motifs in Vertex-Colored Graphs , 2007, ICALP.

[88]  Sergi Valverde,et al.  Network motifs in computational graphs: a case study in software architecture. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[89]  S. Mangan,et al.  Structure and function of the feed-forward loop network motif , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[90]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[91]  Albert-László Barabási,et al.  Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network , 2004, BMC Bioinformatics.

[92]  S. Brenner,et al.  The structure of the nervous system of the nematode Caenorhabditis elegans. , 1986, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[93]  Fernando M. A. Silva,et al.  Parallel discovery of network motifs , 2012, J. Parallel Distributed Comput..

[94]  Fernando M. A. Silva,et al.  Parallel Calculation of Subgraph Census in Biological Networks , 2010, BIOINFORMATICS.

[95]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[96]  S. Mangan,et al.  The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. , 2003, Journal of molecular biology.

[97]  Mong-Li Lee,et al.  NeMoFinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs , 2006, KDD '06.

[98]  Jonathan Cohen,et al.  Graph Twiddling in a MapReduce World , 2009, Computing in Science & Engineering.

[99]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[100]  E. Ziv,et al.  Inferring network mechanisms: the Drosophila melanogaster protein interaction network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[101]  Franck Picard,et al.  Assessing the Exceptionality of Network Motifs , 2007, J. Comput. Biol..

[102]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[103]  Paul T. Jackway,et al.  Network Motifs, Feedback Loops and the Dynamics of Genetic Regulatory Networks , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[104]  Dieter Kratsch,et al.  Finding and Counting Small Induced Subgraphs Efficiently , 1995, WG.

[105]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[106]  Stéphane Robin,et al.  Network motifs : mean and variance for the count , 2006 .

[107]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[108]  Donald L. Kreher,et al.  Combinatorial algorithms: generation, enumeration, and search , 1998, SIGA.

[109]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[110]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.