Synthesis Lectures on Data Mining and Knowledge Discovery

This book offers a comprehensive overview of the various concepts and research issues about blogs or weblogs. It introduces techniques and approaches, tools and applications, and evaluation methodologies with examples and case studies. Blogs allow people to express their thoughts, voice their opinions, and share their experiences and ideas. Blogs also facilitate interactions among individuals creating a network with unique characteristics. Through the interactions individuals experience a sense of community. We elaborate on approaches that extract communities and cluster blogs based on information of the bloggers. Open standards and low barrier to publication in Blogosphere have transformed information consumers to producers, generating an overwhelming amount of ever-increasing knowledge about the members, their environment and symbiosis. We elaborate on approaches that sift through humongous blog data sources to identify influential and trustworthy bloggers leveraging content and network information. Spam blogs or "splogs" are an increasing concern in Blogosphere and are discussed in detail with the approaches leveraging supervised machine learning algorithms and interaction patterns. We elaborate on data collection procedures, provide resources for blog data repositories, mention various visualization and analysis tools in Blogosphere, and explain conventional and novel evaluation methodologies, to help perform research in the Blogosphere. The book is supported by additional material, including lecture slides as well as the complete set of figures used in the book, and the reader is encouraged to visit the book website for the latest information: http://tinyurl.com/mcp-agarwal Table of Contents: Modeling Blogosphere / Blog Clustering and Community Discovery / Influence and Trust / Spam Filtering in Blogosphere / Data Collection and Evaluation

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Timothy W. Finin,et al.  Detecting Commmunities via Simultaneous Clustering of Graphs and Folksonomies , 2008, WebKDD 2008.

[3]  Jafar Adibi,et al.  Characterizing Network Motifs to Identify Spam Comments , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[4]  Timothy W. Finin,et al.  Modeling Trust and Influence in the Blogosphere Using Link Polarity , 2007, ICWSM.

[5]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[6]  Gilad Mishne,et al.  Deriving wishlists from blogs show us your blog, and we'll tell you what books to buy , 2006, WWW '06.

[7]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[8]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[9]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[10]  R. Armstrong The Long Tail: Why the Future of Business Is Selling Less of More , 2008 .

[11]  LeeJoonghoon Exploring global terrorism data , 2008 .

[12]  Yun Chi,et al.  Detecting splogs via temporal dynamics using self-similarity analysis , 2008, TWEB.

[13]  Nick Koudas,et al.  BlogScope: spatio-temporal analysis of the blogosphere , 2007, WWW '07.

[14]  Paolo Avesani,et al.  Using Tags and Clustering to Identify Topic-Relevant Blogs , 2007, ICWSM.

[15]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[16]  Huan Liu,et al.  A Social Identity Approach to Identify Familiar Strangers in a Social Network , 2009, ICWSM.

[17]  Christos Faloutsos,et al.  Cascading Behavior in Large Blog Graphs , 2007 .

[18]  Rong Jin,et al.  Representative entry selection for profiling blogs , 2008, CIKM '08.

[19]  D. Watts,et al.  Influentials, Networks, and Public Opinion Formation , 2007 .

[20]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[21]  Philip S. Yu,et al.  Identifying the influential bloggers in a community , 2008, WSDM '08.

[22]  Fang Jin-Qing,et al.  Topological Properties and Transition Features Generated by a New Hybrid Preferential Model , 2005 .

[23]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[26]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[27]  Vicenç Gómez,et al.  Statistical analysis of the social network and discussion threads in slashdot , 2008, WWW.

[28]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[29]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.

[30]  Qiang Yang,et al.  Exploring in the weblog space by detecting informative and affective articles , 2007, WWW '07.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Jun'ichi Tatemura,et al.  Discovering Important Bloggers based on Analyzing Blog Threads , 2005 .

[33]  Scott Prevost An Information Structural Approach to Spoken Language Generation , 1996, ACL.

[34]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[35]  Gene H. Golub,et al.  Matrix computations , 1983 .

[36]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[37]  Yun Chi,et al.  Structural and temporal analysis of the blogosphere through community factorization , 2007, KDD '07.

[38]  Iraklis Varlamis,et al.  BlogRank: ranking weblogs based on connectivity and similarity features , 2006, AAA-IDEA '06.

[39]  Dan Gillmor,et al.  We the media - grassroots journalism by the people, for the people , 2006 .

[40]  梁勇,et al.  Topological Properties and Transition Features Generated by a New Hybrid Preferential Model , 2005 .

[41]  R. L. Keeney,et al.  Decisions with Multiple Objectives: Preferences and Value Trade-Offs , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[42]  Byron Choi,et al.  Online spam-blog detection through blog search , 2008, CIKM '08.

[43]  J. Berry The Influentials: One American in Ten Tells the Other Nine How to Vote, Where to Eat, and What to Buy , 2003 .

[44]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Marie desJardins,et al.  Active Constrained Clustering by Examining Spectral Eigenvectors , 2005, Discovery Science.

[46]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[47]  Kathy E. Gill How can we measure the influence of the blogosphere? , 2004 .

[48]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[49]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[50]  Yihong Gong,et al.  Incremental Spectral Clustering With Application to Monitoring of Evolving Blog Communities , 2007, SDM.

[51]  Beibei Li,et al.  Enhancing clustering blog documents by utilizing author/reader comments , 2007, ACM-SE 45.

[52]  Daniel W. Drezner,et al.  The power and politics of blogs , 2007 .

[53]  David W. McDonald,et al.  Social matching: A framework and research agenda , 2005, TCHI.

[54]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[55]  Tim Oates,et al.  Modeling the Spread of Influence on the Blogosphere , 2006 .

[56]  Paul Dwyer,et al.  Building Trust with Corporate Blogs , 2007, ICWSM.

[57]  Shankara B. Subramanya,et al.  Clustering Blogs with Collective Wisdom , 2008, 2008 Eighth International Conference on Web Engineering.

[58]  M. Thelwall Bloggers during the London attacks: Top information sources and topics , 2006 .

[59]  Robert Scoble,et al.  Naked Conversations: How Blogs are Changing the Way Businesses Talk with Customers , 2006 .

[60]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[61]  Yun Chi,et al.  Identifying opinion leaders in the blogosphere , 2007, CIKM '07.

[62]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[63]  Nick Koudas,et al.  Searching the Blogosphere , 2007, WebDB.

[64]  Joonghoon Lee Exploring global terrorism data , 2008, ACM Crossroads.

[65]  T.R. Coffman,et al.  Dynamic classification of groups through social network analysis and HMMs , 2004, 2004 IEEE Aerospace Conference Proceedings (IEEE Cat. No.04TH8720).

[66]  P. Sztompka Trust: A Sociological Theory , 2000 .

[67]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[68]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[69]  Ramanathan V. Guha,et al.  The predictive power of online chatter , 2005, KDD '05.

[70]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[71]  Huan Liu,et al.  BlogTrackers: A Tool for Sociologists to Track and Analyze Blogosphere , 2009, ICWSM.

[72]  Ralph L. Keeney,et al.  Decisions with multiple objectives: preferences and value tradeoffs , 1976 .

[73]  Huan Liu,et al.  Trust in Blogosphere , 2009, Encyclopedia of Database Systems.

[74]  James A. Hendler,et al.  Inferring binary trust relationships in Web-based social networks , 2006, TOIT.

[75]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[76]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.