Mining knowledge from databases: an information network analysis approach

Most people consider a database is merely a data repository that supports data storage and retrieval. Actually, a database contains rich, inter-related, multi-typed data and information, forming one or a set of gigantic, interconnected, heterogeneous information networks. Much knowledge can be derived from such information networks if we systematically develop an effective and scalable database-oriented information network analysis technology. In this tutorial, we introduce database-oriented information network analysis methods and demonstrate how information networks can be used to improve data quality and consistency, facilitate data integration, and generate interesting knowledge. This tutorial presents an organized picture on how to turn a database into one or a set of organized heterogeneous information networks, how information networks can be used for data cleaning, data consolidation, and data qualify improvement, how to discover various kinds of knowledge from information networks, how to perform OLAP in information networks, and how to transform database data into knowledge by information network analysis. Moreover, we present interesting case studies on real datasets, including DBLP and Flickr, and show how interesting and organized knowledge can be generated from database-oriented information networks.

[1]  John Scott Social Network Analysis , 1988 .

[2]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Alain Degenne,et al.  Introducing Social Networks , 1999 .

[5]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[6]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[9]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[10]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[11]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  Jon M. Kleinberg,et al.  Small-World Phenomena and the Dynamics of Information , 2001, NIPS.

[14]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[15]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[16]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[17]  Ben Taskar,et al.  Probabilistic Models of Text and Link Structure for Hypertext Classification , 2001 .

[18]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[19]  D. Watts,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2001 .

[20]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[22]  Dieter Fensel,et al.  Towards the Semantic Web: Ontology-driven Knowledge Management , 2002 .

[23]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[24]  Jie Wu,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2003 .

[25]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[26]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[27]  Ramakrishnan Srikant,et al.  Mining newsgroups using networks arising from social behavior , 2003, WWW '03.

[28]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[29]  Michelangelo Ceci,et al.  Mining Model Trees: A Multi-relational Approach , 2003, ILP.

[30]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[31]  Andrew W. Moore,et al.  Tractable group detection on large link data sets , 2003, Third IEEE International Conference on Data Mining.

[32]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[33]  Duncan J. Watts,et al.  Six Degrees: The Science of a Connected Age , 2003 .

[34]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[35]  Alan M. Frieze,et al.  A general model of web graphs , 2003, Random Struct. Algorithms.

[36]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[37]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[38]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[39]  Jiawei Han,et al.  Mining scale-free networks using geodesic clustering , 2004, KDD.

[40]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[42]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[43]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[44]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Gobinda G. Chowdhury,et al.  Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential , 2004 .

[46]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[47]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[48]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[49]  S. Wasserman,et al.  Models and Methods in Social Network Analysis , 2005 .

[50]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[51]  Jiawei Han,et al.  Community Mining from Multi-relational Networks , 2005, PKDD.

[52]  Philip S. Yu,et al.  Efficient classification across multiple database relations: a CrossMine approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[53]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[54]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[55]  Amit P. Sheth,et al.  Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection , 2006, WWW '06.

[56]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[57]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[58]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[59]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[60]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[61]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[62]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[63]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[64]  Philip S. Yu,et al.  CrossClus: user-guided multi-relational clustering , 2007, Data Mining and Knowledge Discovery.

[65]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[66]  Ben Taskar,et al.  Relational Markov Networks , 2007 .

[67]  David D. Jensen,et al.  Graph clustering with network structure indices , 2007, ICML '07.

[68]  Thomas G. Dietterich,et al.  Structured machine learning: the next ten years , 2008, Machine Learning.

[69]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2008, IEEE Trans. Knowl. Data Eng..

[70]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[71]  Philip S. Yu,et al.  Graph OLAP: Towards Online Analytical Processing on Graphs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[72]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[73]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[74]  Yizhou Sun,et al.  iTopicModel: Information Network-Integrated Topic Modeling , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[75]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[76]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[77]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[78]  Rui Li,et al.  Exploring social tagging graph for web object classification , 2009, KDD.

[79]  Jiawei Han,et al.  A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks , 2009, Proc. VLDB Endow..

[80]  Tina Eliassi-Rad,et al.  Evaluating Statistical Tests for Within-Network Classifiers of Relational Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[81]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[82]  Philip S. Yu,et al.  Graph OLAP: a multi-dimensional framework for graph data analysis , 2009, Knowledge and Information Systems.

[83]  Bo Zhao,et al.  iNextCube: Information Network-Enhanced Text Cube , 2009, Proc. VLDB Endow..

[84]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[85]  Jiawei Han,et al.  LINKREC: a unified framework for link recommendation with user attributes and graph structure , 2010, WWW '10.

[86]  Jiebo Luo,et al.  RankCompete: simultaneous ranking and clustering of web photos , 2010, WWW '10.

[87]  Gang Wang,et al.  iRIN: image retrieval in image-rich information networks , 2010, WWW '10.

[88]  Yizhou Sun,et al.  Graph Regularized Transductive Classification on Heterogeneous Information Networks , 2010, ECML/PKDD.

[89]  Yizhou Sun,et al.  Graph-based Classification on Heterogeneous Information Networks , 2010 .

[90]  Bo Zhao,et al.  Community evolution detection in dynamic heterogeneous information networks , 2010, MLG '10.

[91]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[92]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[93]  Jiawei Han,et al.  Mining advisor-advisee relationships from research publication networks , 2010, KDD.

[94]  Jiawei Han,et al.  The Joint Inference of Topic Diffusion and Evolution in Social Communities , 2011, 2011 IEEE 11th International Conference on Data Mining.

[95]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[96]  Jiawei Han,et al.  Graph cube: on warehousing and OLAP multidimensional networks , 2011, SIGMOD '11.

[97]  Jiawei Han,et al.  Ranking-based classification of heterogeneous information networks , 2011, KDD.

[98]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[99]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[100]  Charu C. Aggarwal,et al.  When will it happen?: relationship prediction in heterogeneous information networks , 2012, WSDM '12.

[101]  A. John MINING GRAPH DATA , 2022 .

[102]  Pedro M. Domingos Mining Social Networks for Viral Marketing , 2022 .