A dynamic bibliometric model for identifying online communities

Predictive modelling of online dynamic user-interaction recordings and community identification from such data becomes more and more important with the widespread use of online communication technologies. Despite of the time-dependent nature of the problem, existing approaches of community identification are based on static or fully observed network connections. Here we present a new, dynamic generative model for the inference of communities from a sequence of temporal events produced through online computer- mediated interactions. The distinctive feature of our approach is that it tries to model the process in a more realistic manner, including an account for possible random temporal delays between the intended connections. The inference of these delays from the data then forms an integral part of our state-clustering methodology, so that the most likely communities are found on the basis of the likely intended connections rather than just the observed ones. We derive a maximum likelihood estimation algorithm for the identification of our model, which turns out to be computationally efficient for the analysis of historical data and it scales linearly with the number of non-zero observed (L +  1)-grams, where L is the Markov memory length. In addition, we also derive an incremental version of the algorithm, which could be used for real-time analysis. Results obtained on both synthetic and real-world data sets demonstrate the approach is flexible and able to reveal novel and insightful structural aspects of online interactions. In particular, the analysis of a full day worth synchronous Internet relay chat participation sequence, reveals the formation of an extremely clear community structure.

[1]  Ata Kabán,et al.  Deconvolutive Clustering of Markov States , 2006, ECML.

[2]  Tom Heskes,et al.  Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov Models , 2002, WEBKDD.

[3]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[6]  Xin Wang,et al.  Context based identification of user communities from Internet chat , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Aristides Gionis,et al.  Segmentation and dimensionality reduction , 2006, SDM.

[11]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Method and Algorithms , 2002 .

[12]  A. Raftery A model for high-order Markov chains , 1985 .

[13]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[14]  Ata Kabán,et al.  State Aggregation in Higher Order Markov Chains for Finding Online Communities , 2006, IDEAL.

[15]  Sumit Basu,et al.  Modeling Conversational Dynamics as a Mixed-Memory Markov Process , 2004, NIPS.

[16]  A. Raftery,et al.  The Mixture Transition Distribution Model for High-Order Markov Chains and Non-Gaussian Time Series , 2002 .

[17]  Fernando Pereira,et al.  Aggregate and mixed-order Markov models for statistical language processing , 1997, EMNLP.

[18]  Naonori Ueda,et al.  A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers , 1994, Neural Networks.

[19]  Ata Kabán,et al.  Predictive Modelling of Heterogeneous Sequence Collections by Topographic Ordering of Histories , 2007, Machine Learning.

[20]  KleinbergJon Bursty and Hierarchical Structure in Streams , 2003 .

[21]  Michael I. Jordan,et al.  Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones , 1999, Machine Learning.

[22]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[23]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[24]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[25]  Padhraic Smyth,et al.  Model-Based Clustering and Visualization of Navigation Patterns on a Web Site , 2003, Data Mining and Knowledge Discovery.

[26]  Gilles Celeux,et al.  A Component-Wise EM Algorithm for Mixtures , 2001, 1201.5913.

[27]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[28]  Chris H. Q. Ding,et al.  Automatic topic identification using webpage clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[29]  Daoqiang Zhang,et al.  Improving the Robustness of ‘Online Agglomerative Clustering Method’ Based on Kernel-Induce Distance Measures , 2005, Neural Processing Letters.

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  Michael I. Jordan,et al.  Link Analysis, Eigenvectors and Stability , 2001, IJCAI.

[32]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms: Baldi/Probabilistic , 2002 .

[33]  Michael Werman,et al.  An On-Line Agglomerative Clustering Method for Nonstationary Data , 1999, Neural Computation.