Mining Heterogeneous Information Networks: Principles and Methodologies

Real-world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects and interactions between these objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real-world applications that handle big data, including interconnected social media and social networks, scientific, engineering, or medical information systems, online e-commerce systems, and most database systems, can be structured into heterogeneous information networks. Therefore, effective analysis of large-scale heterogeneous information networks poses an interesting but critical challenge. In this book, we investigate the principles and methodologies of mining heterogeneous information networks. Departing from many existing network models that view interconnected data as homogeneous graphs or networks, our semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and uncovers surprisingly rich knowledge from the network. This semi-structured heterogeneous network modeling leads to a series of new principles and powerful methodologies for mining interconnected data, including: (1) rank-based clustering and classification; (2) meta-path-based similarity search and mining; (3) relation strength-aware mining, and many other potential developments. This book introduces this new research frontier and points out some promising research directions. Table of Contents: Introduction / Ranking-Based Clustering / Classification of Heterogeneous Information Networks / Meta-Path-Based Similarity Search / Meta-Path-Based Relationship Prediction / Relation Strength-Aware Clustering with Incomplete Attributes / User-Guided Clustering via Meta-Path Selection / Research Frontiers

[1]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[2]  Jiawei Han,et al.  Mining advisor-advisee relationships from research publication networks , 2010, KDD.

[3]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[4]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[5]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[6]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[7]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[8]  Charu C. Aggarwal,et al.  Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes , 2012, Proc. VLDB Endow..

[9]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[10]  Ichigaku Takigawa,et al.  A spectral clustering approach to optimally combining numericalvectors with a modular network , 2007, KDD '07.

[11]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Joydeep Ghosh,et al.  CONSENSUS-BASED ENSEMBLES OF SOFT CLUSTERINGS , 2008, MLMTA.

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  Nitesh V. Chawla,et al.  New perspectives and methods in link prediction , 2010, KDD.

[16]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[17]  Charu C. Aggarwal,et al.  Social Network Data Analytics , 2011 .

[18]  M. Newman Clustering and preferential attachment in growing networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[21]  Lyle H. Ungar,et al.  Statistical Relational Learning for Link Prediction , 2003 .

[22]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[23]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[24]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[25]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[26]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[27]  Mohammad Al Hasan,et al.  Link prediction using supervised learning , 2006 .

[28]  Charu C. Aggarwal,et al.  When will it happen?: relationship prediction in heterogeneous information networks , 2012, WSDM '12.

[29]  Chris Clifton,et al.  Knowledge discovery from transportation network data , 2005, 21st International Conference on Data Engineering (ICDE'05).

[30]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[31]  E. Rogers Diffusion of Innovations , 1962 .

[32]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[33]  Bo Zhao,et al.  Community evolution detection in dynamic heterogeneous information networks , 2010, MLG '10.

[34]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[35]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[36]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[37]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[38]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[39]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[40]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[41]  Francesco Bonchi,et al.  Cold start link prediction , 2010, KDD.

[42]  Jiawei Han,et al.  Ranking-based classification of heterogeneous information networks , 2011, KDD.

[43]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[44]  Annette J. Dobson,et al.  An introduction to generalized linear models , 1991 .

[45]  Berthier A. Ribeiro-Neto,et al.  Combining link-based and content-based methods for web document classification , 2003, CIKM '03.

[46]  Yizhou Sun,et al.  Heterogeneous source consensus learning via decision propagation and negotiation , 2009, KDD.

[47]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[48]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[49]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[50]  Peter J. Cameron,et al.  Rank three permutation groups with rank three subconstituents , 1985, J. Comb. Theory, Ser. B.

[51]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[52]  Philip S. Yu,et al.  A probabilistic framework for relational clustering , 2007, KDD '07.

[53]  Yizhou Sun,et al.  Query-driven discovery of semantically similar substructures in heterogeneous networks , 2012, KDD.

[54]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[55]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[56]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[57]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[58]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[59]  Yizhou Sun,et al.  Graph Regularized Transductive Classification on Heterogeneous Information Networks , 2010, ECML/PKDD.

[60]  China Chengdu,et al.  Health and Medical , 2012 .

[61]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[62]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[63]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[64]  Yun Chi,et al.  Combining link and content for community detection: a discriminative approach , 2009, KDD.

[65]  Philip S. Yu,et al.  Graph OLAP: Towards Online Analytical Processing on Graphs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[66]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[67]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[69]  Philip S. Yu,et al.  Relevance search in heterogeneous networks , 2012, EDBT '12.

[70]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[71]  N. Christakis,et al.  The Spread of Obesity in a Large Social Network Over 32 Years , 2007, The New England journal of medicine.

[72]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[73]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[74]  M E J Newman Assortative mixing in networks. , 2002, Physical review letters.

[75]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[76]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[77]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[78]  Srinivasan Parthasarathy,et al.  Local Probabilistic Models for Link Prediction , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[79]  Margaret Werner-Washburne,et al.  Integrative Construction and Analysis of Condition-specific Biological Networks , 2008, AAAI.

[80]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[81]  Jiawei Han,et al.  Graph cube: on warehousing and OLAP multidimensional networks , 2011, SIGMOD '11.

[82]  Cyrus Shahabi,et al.  Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases , 2004, VLDB.

[83]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[84]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[85]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[86]  C. Lee Giles The Future of CiteSeer: CiteSeerx , 2006, PKDD.

[87]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[88]  Katharina Morik,et al.  Local Pattern Detection, International Seminar, Dagstuhl Castle, Germany, April 12-16, 2004, Revised Selected Papers , 2005, Local Pattern Detection.

[89]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.