Data Mining in Complex Networks: Missing Link Prediction and Fuzzy Communities

This dissertation is devoted to networks: complex interconnected systems where the individual components are connected by binary links arranged in seemingly random but intrinsically structured patterns. Networks are used to model various real-world phenomena ranging from protein interaction in living organisms to the large-scale organisation of human society or the structure of technological networks such as software systems or the Internet. The first part of the dissertation studies a stochastic graph model, which can be considered as a possible extension of Erdős–Rényi random graphs. I discuss some basic statistical properties of the model and devise methods to find the best fit of the model to a given network instance. I also demonstrate how the fitted model can be used to predict previously unknown connections in the network. The second part of the dissertation studies overlapping communities (i.e., dense subgraphs) in sparse networks. I introduce a method based on the concept of fuzzy partition matrices and vertex similarity to uncover meaningful communities with possible overlaps and to identify bridge vertices that belong to more than one community significantly. Finally, I present applications of the link prediction and community detection methods on real-world datasets.

[1]  M. Newman,et al.  Random graphs with arbitrary degree distributions and their applications. , 2000, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[4]  G. Caldarelli,et al.  Detecting communities in large networks , 2004, cond-mat/0402499.

[5]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[6]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[7]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[8]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[9]  Christopher R. Myers,et al.  Software systems as complex networks: structure, function, and evolvability of software collaboration graphs , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Terence Tao Szemerédi's regularity lemma revisited , 2006, Contributions Discret. Math..

[11]  S. Brenner,et al.  The structure of the nervous system of the nematode Caenorhabditis elegans. , 1986, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[12]  B Jouve,et al.  A mathematical approach to the connectivity between the cortical visual areas of the macaque monkey. , 1998, Cerebral cortex.

[13]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[14]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[15]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[16]  T. Nepusz,et al.  Likelihood-based Clustering of Directed Graphs , 2007, 2007 International Symposium on Computational Intelligence and Intelligent Informatics.

[17]  B. Bollobás The evolution of random graphs , 1984 .

[18]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[19]  R. Guimerà,et al.  Functional cartography of complex metabolic networks , 2005, Nature.

[20]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[21]  M. Newman 1 Who is the best connected scientist ? A study of scientific coauthorship networks , 2004 .

[22]  P. ERDbS ON THE STRENGTH OF CONNECTEDNESS OF A RANDOM GRAPH , 2001 .

[23]  Mu Zhu,et al.  Automatic dimensionality selection from the scree plot via the use of profile likelihood , 2006, Comput. Stat. Data Anal..

[24]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[25]  S. Zeki,et al.  The position and topography of the human colour centre as revealed by functional magnetic resonance imaging. , 1997, Brain : a journal of neurology.

[26]  F. Chung Laplacians and the Cheeger Inequality for Directed Graphs , 2005 .

[27]  D. V. van Essen,et al.  Corticocortical connections of visual, sensorimotor, and multimodal processing areas in the parietal lobe of the macaque monkey , 2000, The Journal of comparative neurology.

[28]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[29]  E. Ziv,et al.  Information-theoretic approach to network modularity. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Bruce A. Reed,et al.  A Critical Point for Random Graphs with a Given Degree Sequence , 1995, Random Struct. Algorithms.

[31]  Anitha Pasupathy,et al.  Neural basis of shape representation in the primate brain. , 2006, Progress in brain research.

[32]  Sankar K. Pal,et al.  Fuzzy models for pattern recognition : methods that search for structures in data , 1992 .

[33]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[34]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[35]  M. Simonovits,et al.  Szemeredi''s Regularity Lemma and its applications in graph theory , 1995 .

[36]  K. Kaski,et al.  Clustering and information in correlation based financial networks , 2003, cond-mat/0312682.

[37]  P. Erdos,et al.  On the strength of connectedness of a random graph , 1964 .

[38]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  S. Zeki,et al.  The architecture of the colour centre in the human visual brain: new results and a review * , 2000, The European journal of neuroscience.

[40]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[41]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[42]  Cristopher Moore,et al.  Structural Inference of Hierarchies in Networks , 2006, SNA@ICML.

[43]  Steven B. Andrews,et al.  Structural Holes: The Social Structure of Competition , 1995, The SAGE Encyclopedia of Research Design.

[44]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[45]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[46]  László Kocsis,et al.  Prediction of the main cortical areas and connections involved in the tactile function of the visual cortex by network analysis , 2006, The European journal of neuroscience.

[47]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[48]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[49]  I. D. Hill,et al.  An Efficient and Portable Pseudo‐Random Number Generator , 1982 .

[50]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[51]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[52]  Leon Danon,et al.  The effect of size heterogeneity on community identification in complex networks , 2006, physics/0601144.

[53]  Vojtech Rödl,et al.  The Algorithmic Aspects of the Regularity Lemma , 1994, J. Algorithms.

[54]  D. V. Essen,et al.  Surface-Based and Probabilistic Atlases of Primate Cerebral Cortex , 2007, Neuron.

[55]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[56]  John Scott What is social network analysis , 2010 .

[57]  Albert-László Barabási,et al.  Evolution of Networks: From Biological Nets to the Internet and WWW , 2004 .

[58]  H. H. Rosenbrock,et al.  An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..

[59]  Yaghout Nourani,et al.  A comparison of simulated annealing cooling strategies , 1998 .

[60]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[61]  R. Solé,et al.  Evolving protein interaction networks through gene duplication. , 2003, Journal of Theoretical Biology.

[62]  Valentin I. Spitkovsky,et al.  A dictionary based approach for gene annotation , 1999, J. Comput. Biol..

[63]  T. Nepusz,et al.  Fuzzy communities and the concept of bridgeness in complex networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  John N. Tsitsiklis,et al.  Introduction to Probability , 2002 .

[65]  Shihua Zhang,et al.  Identification of overlapping community structure in complex networks using fuzzy c-means clustering , 2007 .

[66]  S. Dongen A stochastic uncoupling process for graphs , 2000 .

[67]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[68]  András A. Benczúr,et al.  Telephone Call Network Data Mining: A Survey with Experiments , 2008 .

[69]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[70]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[71]  Tamás Nepusz,et al.  Measuring tie-strength in virtual social networks , 2006 .

[72]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[74]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[75]  Prof. Dr. Dr. Valentino Braitenberg,et al.  Cortex: Statistics and Geometry of Neuronal Connectivity , 1998, Springer Berlin Heidelberg.

[76]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[77]  Stefan Bornholdt,et al.  Detecting fuzzy community structures in complex networks with a Potts model. , 2004, Physical review letters.

[78]  Mark E. J. Newman,et al.  Structure and Dynamics of Networks , 2009 .

[79]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[80]  T. Nepusz,et al.  Maximum Likelihood Methods for Data Mining in Datasets Represented by Graphs , 2007, 2007 5th International Symposium on Intelligent Systems and Informatics.

[81]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[82]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[83]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[84]  Albert-László Barabási,et al.  Linked - how everything is connected to everything else and what it means for business, science, and everyday life , 2003 .

[85]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[86]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[87]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[88]  A. Rapoport,et al.  Connectivity of random nets , 1951 .

[89]  E A Leicht,et al.  Community structure in directed networks. , 2007, Physical review letters.

[90]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[91]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[92]  S. N. Dorogovtsev,et al.  Structure of Growing Networks: Exact Solution of the Barabasi--Albert's Model , 2000, cond-mat/0004434.

[93]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[94]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[95]  Gábor E. Tusnády,et al.  Reconstructing Cortical Networks: Case of Directed Graphs with High Level of Reciprocity , 2008 .

[96]  Sharon L. Milgram,et al.  The Small World Problem , 1967 .

[97]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[98]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[99]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[100]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[101]  S. N. Dorogovtsev,et al.  Structure of growing networks with preferential linking. , 2000, Physical review letters.

[102]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[103]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[104]  O. Sporns,et al.  Organization, development and function of complex brain networks , 2004, Trends in Cognitive Sciences.

[105]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[106]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[107]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[108]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[109]  Dániel Fogaras Where to Start Browsing the Web? , 2003, IICS.

[110]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[111]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[112]  Mark Buchanan,et al.  Nexus: Small Worlds and the Groundbreaking Science of Networks , 2002 .

[113]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[114]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[115]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[116]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[117]  Ravi Montenegro,et al.  Mathematical Aspects of Mixing Times in Markov Chains , 2006, Found. Trends Theor. Comput. Sci..

[118]  Bill Cheswick,et al.  Mapping and Visualizing the Internet , 2000, USENIX Annual Technical Conference, General Track.

[119]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[120]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[121]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[122]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[123]  H. Akaike A new look at the statistical model identification , 1974 .

[124]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[125]  Karl J. Friston,et al.  The colour centre in the cerebral cortex of man , 1989, Nature.

[126]  E. Szemerédi Regular Partitions of Graphs , 1975 .

[127]  Casper Goffman,et al.  And What is Your Erdös Number , 1969 .

[128]  Luciano da Fontoura Costa,et al.  Predicting the connectivity of primate cortical networks from topological and spatial node properties , 2007, BMC Systems Biology.