Event-based classification of social media streams

Events play a prominent role in our lives, such that many social media documents describe or are related to some event. Organizing social media documents with respect to events thus seems a promising approach to better manage and organize the ever-increasing amount of content in social media applications. A challenge is to automatize this process so that incoming documents can be assigned to their corresponding event without any user intervention. We present a system that is able to classify a stream of social media data into a growing and evolving set of events. By doing this, we successfully address two key problems that arise in this context: i) scaling to the data sizes and rates encountered in social media applications, and ii) tackling the new event detection problem, i.e. the problem of determining whether an incoming data item belongs to a new or a known event. We successfully address these problems by i) including a candidate retrieval step that retrieves a set of event candidates that the incoming data point is likely to belong to and ii) by including a function trained using machine learning techniques to determine whether the incoming data item belongs to the top scoring candidate or rather to a new event. We show that our system addresses the above mentioned challenging issues successfully and that it outperforms other state-of-the-art approaches in terms of quality and scalability.

[1]  Mor Naaman,et al.  Towards automatic extraction of event and place semantics from flickr tags , 2007, SIGIR.

[2]  Raphaël Troncy,et al.  Finding media illustrating events , 2011, ICMR '11.

[3]  A. N. Srivastava,et al.  Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences , 2006 .

[4]  Philipp Cimiano,et al.  Learning Similarity Functions for Event Identification using Support Vector Machines , 2011, KDIR.

[5]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[6]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[7]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[8]  LastMark Online classification of nonstationary data streams , 2002 .

[9]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[10]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[11]  Adam Vinueza,et al.  Unsupervised Outlier Detection and Semi-Supervised Learning , 2004 .

[12]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[13]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[14]  Salvatore J. Stolfo,et al.  Modeling system calls for intrusion detection with dynamic window sizes , 2001, Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX'01.

[15]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[16]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[17]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[18]  Philip K. Chan,et al.  A Machine Learning Approach to Anomaly Detection , 2003 .

[19]  Qiang Chen,et al.  An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems , 2001 .

[20]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[21]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[22]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[23]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[24]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[25]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[26]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Erik De Schutter,et al.  Novelty detection in a Kohonen-like network with a long-term depression learning rule , 2003, Neurocomputing.

[28]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[29]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[30]  Lars Schmidt-Thieme,et al.  Scalable Event-Based Clustering of Social Media Via Record Linkage Techniques , 2011, ICWSM.

[31]  Qiang Ding,et al.  Decision tree classification of spatial data streams using Peano Count Trees , 2002, SAC '02.

[32]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[33]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[34]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[35]  Cecilia Surace,et al.  A novelty detection method to diagnose damage in structures: An application to an offshore platform , 1998 .

[36]  Ling Chen,et al.  Event detection from flickr data through wavelet-based spatial analysis , 2009, CIKM.

[37]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[38]  Lars Schmidt-Thieme,et al.  Scaling Record Linkage to Non-uniform Distributed Class Sizes , 2008, PAKDD.

[39]  Gregory Z. Grudic,et al.  Unsupervised Outlier Detection and Semi-Supervised Learning ; CU-CS-976-04 , 2004 .