Synthesis Lectures on Data Mining and Knowledge Discovery

The past decade has witnessed the emergence of participatory Web and social media, bringing people together in many creative ways. Millions of users are playing, tagging, working, and socializing online, demonstrating new forms of collaboration, communication, and intelligence that were hardly imaginable just a short time ago. Social media also helps reshape business models, sway opinions and emotions, and opens up numerous possibilities to study human interaction and collective behavior in an unparalleled scale.This lecture, from a data mining perspective, introduces characteristics of social media, reviews representative tasks of computing with social media, and illustrates associated challenges. It introduces basic concepts, presents state-of-the-art algorithms with easy-to-understand examples, and recommends effective evaluation methods. In particular, we discuss graph-based community detection techniques and many important extensions that handle dynamic, heterogeneous networks in social media. We also demonstrate how discovered patterns of communities can be used for social media mining. The concepts, algorithms, and methods presented in this lecture can help harness the power of social media and support building socially-intelligent systems. This book is an accessible introduction to the study of community detection and mining in social media. It is an essential reading for students, researchers, and practitioners in disciplines and applications where social media is a key source of data that piques our curiosity to understand, manage, innovate, and excel. This book is supported by additional materials, including lecture slides, the complete set of figures, key references, some toy data sets used in the book, and the source code of representative algorithms. The readers are encouraged to visit the book website for the latest information: http://dmml.asu.edu/cdm/

[1]  Jennifer Neville,et al.  Randomization tests for distinguishing social influence and homophily effects , 2010, WWW '10.

[2]  Philip S. Yu,et al.  Identifying the influential bloggers in a community , 2008, WSDM '08.

[3]  Thomas C. Schelling,et al.  Dynamic models of segregation , 1971 .

[4]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[5]  Yun Chi,et al.  Detecting splogs via temporal dynamics using self-similarity analysis , 2008, TWEB.

[6]  Paolo Avesani,et al.  Using Tags and Clustering to Identify Topic-Relevant Blogs , 2007, ICWSM.

[7]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[8]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[9]  Yihong Gong,et al.  A Bayesian Approach Toward Finding Communities and Their Evolutions in Dynamic Social Networks , 2009, SDM.

[10]  Bart Selman,et al.  Tracking evolving communities in large linked networks , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jure Leskovec,et al.  Microscopic evolution of social networks , 2008, KDD.

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  Huan Liu,et al.  BlogTrackers: A Tool for Sociologists to Track and Analyze Blogosphere , 2009, ICWSM.

[14]  Ralph L. Keeney,et al.  Decisions with multiple objectives: preferences and value tradeoffs , 1976 .

[15]  Jafar Adibi,et al.  Characterizing Network Motifs to Identify Spam Comments , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[16]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[17]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1998, SODA '98.

[18]  Timothy W. Finin,et al.  Modeling Trust and Influence in the Blogosphere Using Link Polarity , 2007, ICWSM.

[19]  Philip S. Yu,et al.  GraphScope: parameter-free mining of large time-evolving graphs , 2007, KDD '07.

[20]  Jon Kleinberg,et al.  Maximizing the spread of influence through a social network , 2003, KDD '03.

[21]  Robin Burke,et al.  Exploring the Impact of Profile Injection Attacks in Social Tagging Systems ? , 2008 .

[22]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[23]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[24]  Christos Faloutsos,et al.  Cascading Behavior in Large Blog Graphs , 2007 .

[25]  Wei-Ying Ma,et al.  A unified framework for clustering heterogeneous Web objects , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[26]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[27]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[28]  A. Moore,et al.  Dynamic social network analysis using latent space models , 2005, SKDD.

[29]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Marie desJardins,et al.  Active Constrained Clustering by Examining Spectral Eigenvectors , 2005, Discovery Science.

[31]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[32]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[33]  Sougata Mukherjea,et al.  On the structural properties of massive telecom call graphs: findings and implications , 2006, CIKM '06.

[34]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[35]  A. Barabasi,et al.  Quantifying social group evolution , 2007, Nature.

[36]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[37]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[38]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[39]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[40]  Huan Liu,et al.  Graph Mining Applications to Social Network Analysis , 2010, Managing and Mining Graph Data.

[41]  Nick Koudas,et al.  Searching the Blogosphere , 2007, WebDB.

[42]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[43]  Gilad Mishne,et al.  Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy , 2006, WWW '06.

[44]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[45]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[46]  Gueorgi Kossinets,et al.  Empirical Analysis of an Evolving Social Network , 2006, Science.

[47]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[48]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[49]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[50]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[51]  Sandra Sudarsky,et al.  Massive Quasi-Clique Detection , 2002, LATIN.

[52]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[53]  Fang Wu,et al.  Social Networks that Matter: Twitter Under the Microscope , 2008, First Monday.

[54]  Vicenç Gómez,et al.  Statistical analysis of the social network and discussion threads in slashdot , 2008, WWW.

[55]  Wei Chen,et al.  Scalable influence maximization for prevalent viral marketing in large-scale social networks , 2010, KDD.

[56]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[57]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[58]  Tanya Y. Berger-Wolf,et al.  A framework for community identification in dynamic social networks , 2007, KDD '07.

[59]  Huan Liu,et al.  Trust in Blogosphere , 2009, Encyclopedia of Database Systems.

[60]  James A. Hendler,et al.  Inferring binary trust relationships in Web-based social networks , 2006, TOIT.

[61]  Chris Volinsky,et al.  Network-Based Marketing: Identifying Likely Adopters Via Consumer Networks , 2006, math/0606278.

[62]  Philip S. Yu,et al.  A General Model for Multiple View Unsupervised Learning , 2008, SDM.

[63]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[64]  Ravi Kumar,et al.  Influence and correlation in social networks , 2008, KDD.

[65]  Yun Chi,et al.  Structural and temporal analysis of the blogosphere through community factorization , 2007, KDD '07.

[66]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[67]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[68]  M. Abrahamson,et al.  Principles of Group Solidarity. , 1988 .

[69]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[70]  P. Sztompka Trust: A Sociological Theory , 2000 .

[71]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[72]  Dafna Shahaf,et al.  Turning down the noise in the blogosphere , 2009, KDD.

[73]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[74]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[75]  Ken Wakita,et al.  Finding community structure in mega-scale social networks: [extended abstract] , 2007, WWW '07.

[76]  Ramanathan V. Guha,et al.  The predictive power of online chatter , 2005, KDD '05.

[77]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[78]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[79]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[80]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[81]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[82]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[83]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[84]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[85]  Yihong Gong,et al.  Incremental Spectral Clustering With Application to Monitoring of Evolving Blog Communities , 2007, SDM.

[86]  Beibei Li,et al.  Enhancing clustering blog documents by utilizing author/reader comments , 2007, ACM-SE 45.

[87]  Huan Liu,et al.  Scalable learning of collective behavior based on sparse social dimensions , 2009, CIKM.

[88]  Daniel W. Drezner,et al.  The power and politics of blogs , 2007 .

[89]  Robert Scoble,et al.  Naked Conversations: How Blogs are Changing the Way Businesses Talk with Customers , 2006 .

[90]  Paul Dwyer,et al.  Building Trust with Corporate Blogs , 2007, ICWSM.

[91]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[92]  Shankara B. Subramanya,et al.  Clustering Blogs with Collective Wisdom , 2008, 2008 Eighth International Conference on Web Engineering.

[93]  M. Thelwall Bloggers during the London attacks: Top information sources and topics , 2006 .

[94]  Yun Chi,et al.  Analyzing communities and their evolutions in dynamic social networks , 2009, TKDD.

[95]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[96]  Lei Wang,et al.  A multi-resolution approach to learning with overlapping communities , 2010, SOMA '10.

[97]  Srinivasan Parthasarathy,et al.  An event-based framework for characterizing the evolutionary behavior of interaction graphs , 2007, KDD '07.

[98]  Kevin J. Lang,et al.  Communities from seed sets , 2006, WWW '06.

[99]  Yun Chi,et al.  Identifying opinion leaders in the blogosphere , 2007, CIKM '07.

[100]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[101]  Huan Liu,et al.  Uncoverning Groups via Heterogeneous Interaction Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[102]  Huan Liu,et al.  Community evolution in dynamic multi-mode networks , 2008, KDD.

[103]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[104]  Jon M. Kleinberg,et al.  Feedback effects between similarity and social influence in online communities , 2008, KDD.

[105]  Chris Anderson,et al.  The Long Tail: Why the Future of Business is Selling Less of More , 2006 .

[106]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[107]  Mark S. Granovetter The Strength of Weak Ties , 1973, American Journal of Sociology.

[108]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[109]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[110]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[111]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[112]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[113]  J. Berry The Influentials: One American in Ten Tells the Other Nine How to Vote, Where to Eat, and What to Buy , 2003 .

[114]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[115]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[116]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[117]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[118]  Mark E. J. Newman,et al.  Structure and Dynamics of Networks , 2009 .

[119]  Byron Choi,et al.  Online spam-blog detection through blog search , 2008, CIKM '08.

[120]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[121]  Arun Sundararajan,et al.  Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks , 2009, Proceedings of the National Academy of Sciences.

[122]  Tom M Mitchell,et al.  Mining Our Reality , 2009, Science.

[123]  Clay Shirky Here Comes Everybody: The Power of Organizing Without Organizations , 2008 .

[124]  LeeJoonghoon Exploring global terrorism data , 2008 .

[125]  Eric Gilbert,et al.  Predicting tie strength with social media , 2009, CHI.

[126]  Jon M. Kleinberg,et al.  The structure of information pathways in a social communication network , 2008, KDD.

[127]  Jiawei Han,et al.  A Particle-and-Density Based Evolutionary Clustering Method for Dynamic Networks , 2009, Proc. VLDB Endow..

[128]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[129]  Fang Jin-Qing,et al.  Topological Properties and Transition Features Generated by a New Hybrid Preferential Model , 2005 .

[130]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[131]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[132]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[133]  Timothy W. Finin,et al.  Detecting Commmunities via Simultaneous Clustering of Graphs and Folksonomies , 2008, WebKDD 2008.

[134]  Huan Liu,et al.  Learning with large-scale social media networks , 2010 .

[135]  Scott Prevost An Information Structural Approach to Spoken Language Generation , 1996, ACL.

[136]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[137]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[138]  Huan Liu,et al.  A Social Identity Approach to Identify Familiar Strangers in a Social Network , 2009, ICWSM.

[139]  Rong Jin,et al.  Representative entry selection for profiling blogs , 2008, CIKM '08.

[140]  D. Watts,et al.  Influentials, Networks, and Public Opinion Formation , 2007 .

[141]  Bart Selman,et al.  Natural communities in large linked networks , 2003, KDD '03.

[142]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[143]  Marco Pellegrini,et al.  Extraction and classification of dense communities in the web , 2007, WWW '07.

[144]  N. Christakis,et al.  The Spread of Obesity in a Large Social Network Over 32 Years , 2007, The New England journal of medicine.

[145]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[146]  Terrill L. Frantz,et al.  Communication Networks from the Enron Email Corpus “It's Always About the People. Enron is no Different” , 2005, Comput. Math. Organ. Theory.

[147]  Qiang Yang,et al.  Exploring in the weblog space by detecting informative and affective articles , 2007, WWW '07.

[148]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[149]  David W. McDonald,et al.  Social matching: A framework and research agenda , 2005, TCHI.

[150]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[151]  Tim Oates,et al.  Modeling the Spread of Influence on the Blogosphere , 2006 .

[152]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[153]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[154]  Timothy W. Finin,et al.  Characterizing the Splogosphere , 2006, WWW 2006.

[155]  Masahiro Kimura,et al.  Behavioral Analyses of Information Diffusion Models by Observed Data of Social Network , 2010, SBP.

[156]  A-L Barabási,et al.  Structure and tie strengths in mobile communication networks , 2006, Proceedings of the National Academy of Sciences.

[157]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[158]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[159]  J. Hopcroft,et al.  Algorithm 447: efficient algorithms for graph manipulation , 1973, CACM.

[160]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[161]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[162]  R. Hanneman Introduction to Social Network Methods , 2001 .

[163]  Jure Leskovec,et al.  Planetary-scale views on a large instant-messaging network , 2008, WWW.

[164]  Jun'ichi Tatemura,et al.  Discovering Important Bloggers based on Analyzing Blog Threads , 2005 .

[165]  Donald B. Johnson,et al.  Efficient Algorithms for Shortest Paths in Sparse Networks , 1977, J. ACM.

[166]  Prasanna Desikan TR 08-024 I / O efficient computation of First Order Markov Measures for Large and Evolving Graphs , .

[167]  Wei Chen,et al.  Efficient influence maximization in social networks , 2009, KDD.

[168]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[169]  Matthew Richardson,et al.  Yes, there is a correlation: - from social networks to personal behavior on the web , 2008, WWW.

[170]  T.R. Coffman,et al.  Dynamic classification of groups through social network analysis and HMMs , 2004, 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No.04TH8720).

[171]  Huan Liu,et al.  Toward Predicting Collective Behavior via Social Dimension Extraction , 2010, IEEE Intelligent Systems.

[172]  Mark S. Granovetter Threshold Models of Collective Behavior , 1978, American Journal of Sociology.

[173]  Jennifer Neville,et al.  Modeling relationship strength in online social networks , 2010, WWW '10.

[174]  Iraklis Varlamis,et al.  BlogRank: ranking weblogs based on connectivity and similarity features , 2006, AAA-IDEA '06.

[175]  Dan Gillmor,et al.  We the media - grassroots journalism by the people, for the people , 2006 .

[176]  R. L. Keeney,et al.  Decisions with Multiple Objectives: Preferences and Value Trade-Offs , 1977, IEEE Transactions on Systems, Man, and Cybernetics.