Scalable Event-Based Clustering of Social Media Via Record Linkage Techniques

We tackle the problem of grouping content available in social media applications such as Flickr, Youtube, Panoramino etc. into clusters of documents describing the same event. This task has been referred to as event identification before.We present a new formalization of the event identification task as a record linkage problem and show that this formulation leads to a principled and highly efficient solution to the problem. We present results on two datasets derived from Flickr – last.fm and upcoming – comparing the results in terms of Normalized Mutual Information and F-Measure with respect to several baselines, showing that a record linkage approach outperforms all baselines as well as a state-of-the-art system. We demonstrate that our approach can scale to large amounts of data, reducing the processing time considerably compared to a state-of-the-art approach. The scalability is achieved by applying an appropriate blocking strategy and relying on a Single Linkage clustering algorithm which avoids the exhaustive computation of pairwise similarities.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[3]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[4]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[5]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[6]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[7]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[8]  Lars Schmidt-Thieme,et al.  Active Learning of Equivalence Relations by Minimizing the Expected Loss Using Constraint Inference , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[11]  Lars Schmidt-Thieme,et al.  Fusion of Similarity Measures for Time Series Classification , 2011, HAIS.

[12]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[13]  Lars Schmidt-Thieme,et al.  Object Identification with Constraints , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[15]  Wolfgang Nejdl,et al.  Bringing order to your photos: event-driven classification of flickr images based on social knowledge , 2010, CIKM.

[16]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  Lars Schmidt-Thieme,et al.  Scaling Record Linkage to Non-uniform Distributed Class Sizes , 2008, PAKDD.

[19]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[21]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[22]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[23]  Hwee Tou Ng,et al.  A Machine Learning Approach to Coreference Resolution of Noun Phrases , 2001, CL.

[24]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[25]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[26]  Lars Schmidt-Thieme,et al.  XMedia: Web People Search by Clustering with Machinely Learned Similarity Measures , 2009 .