Graffiti: graph-based classification in heterogeneous networks

We address the problem of multi-label classification in heterogeneous graphs, where nodes belong to different types and different types have different sets of classification labels. We present a novel approach that aims to classify nodes based on their neighborhoods. We model the mutual influence of nodes as a random walk in which the random surfer aims at distributing class labels to nodes while walking through the graph. When viewing class labels as “colors”, the random surfer is essentially spraying different node types with different color palettes; hence the name Graffiti of our method. In contrast to previous work on topic-based random surfer models, our approach captures and exploits the mutual influence of nodes of the same type based on their connections to nodes of other types. We show important properties of our algorithm such as convergence and scalability. We also confirm the practical viability of Graffiti by an experimental study on subsets of the popular social networks Flickr and LibraryThing. We demonstrate the superiority of our approach by comparing it to three other state-of-the-art techniques for graph-based classification.

[1]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.

[2]  John D. Lafferty,et al.  Semi-supervised learning using randomized mincuts , 2004, ICML.

[3]  Gerhard Weikum,et al.  Graffiti: node labeling in heterogeneous networks , 2009, WWW '09.

[4]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[5]  Mounia Lalmas,et al.  SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , 2006 .

[6]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[8]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[9]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[10]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[11]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[12]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[13]  Pavel Berkhin,et al.  Bookmark-Coloring Algorithm for Personalized PageRank Computing , 2006, Internet Math..

[14]  Carl D. Meyer,et al.  Google's PageRank and Beyond , 2007 .

[15]  Tamara G. Kolda,et al.  Higher-order Web link analysis using multilinear algebra , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Ricardo A. Baeza-Yates,et al.  Generalizing PageRank: damping functions for link-based ranking algorithms , 2006, SIGIR.

[17]  Zheng Chen,et al.  Latent semantic analysis for multiple-type interrelated data objects , 2006, SIGIR.

[18]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[19]  Taher H. Haveliwala,et al.  The Second Eigenvalue of the Google Matrix , 2003 .

[20]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[21]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[22]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[23]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[24]  Sudha Ram,et al.  Proceedings of the 1997 ACM SIGMOD international conference on Management of data , 1997, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[25]  Christos Faloutsos,et al.  Using ghost edges for classification in sparsely labeled networks , 2008, KDD.

[26]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[27]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[28]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[29]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[30]  Danny Dolev,et al.  Fixing convergence of Gaussian belief propagation , 2009, 2009 IEEE International Symposium on Information Theory.

[31]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[32]  Ben Taskar,et al.  Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) , 2007 .

[33]  Fei Wang,et al.  Label Propagation through Linear Neighborhoods , 2008, IEEE Trans. Knowl. Data Eng..

[34]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[35]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[36]  Tie-Yan Liu,et al.  Star-Structured High-Order Heterogeneous Data Co-clustering Based on Consistent Information Theory , 2006, Sixth International Conference on Data Mining (ICDM'06).

[37]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[38]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[39]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[40]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[41]  Brian D. Davison,et al.  Topical link analysis for web search , 2006, SIGIR.

[42]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[43]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[44]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[45]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[46]  Jennifer Neville,et al.  Why collective inference improves relational classification , 2004, KDD.

[47]  Olle Häggström Finite Markov Chains and Algorithmic Applications , 2002 .

[48]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[49]  Jeff Shrager,et al.  Observation of Phase Transitions in Spreading Activation Networks , 1987, Science.

[50]  Zvi Galil,et al.  Proceedings of the 30th IEEE symposium on Foundations of computer science , 1994, FOCS 1994.

[51]  Foster Provost,et al.  NetKit-SRL: A Toolkit for Network Learning and Inference , 2005 .

[52]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[53]  John G. Breslin,et al.  Towards the Social Semantic Web , 2009 .

[54]  Sung-Hyon Myaeng,et al.  A practical hypertext catergorization method using links and incrementally available class information , 2000, SIGIR '00.

[55]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.