Event based classification of Web 2.0 text streams

Web 2.0 applications like Twitter or Facebook create a continuous stream of information. This demands new ways of analysis in order to offer insight into this stream right at the moment of the creation of the information, because lots of this data is only relevant within a short period of time. To address this problem real time search engines have recently received increased attention. They take into account the continuous flow of information differently than traditional web search by incorporating temporal and social features, that describe the context of the information during its creation. Standard approaches where data first get stored and then is processed from a peristent storage suffer from latency. We want to address the fluent and rapid nature of text stream by providing an event based approach that analyses directly the stream of information. In a first step we want to define the difference between real time search and traditional search to clarify the demands in modern text filtering. In a second step we want to show how event based features can be used to support the tasks of real time search engines. Using the example of Twitter we present in this paper a way how to combine an event based approach with text mining and information filtering concepts in order to classify incoming information based on stream features. We calculate stream dependant features and feed them into a neural network in order to classify the text streams. We show the separative capabilities of event based features as the foundation for a real time search engine.

[1]  Quoc V. Le,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, Neural Information Processing Systems.

[2]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[3]  Sharma Chakravarthy,et al.  Events and streams: harnessing and unleashing their synergy! , 2008, DEBS.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Karl Aberer,et al.  The gist of everything new: personalized top-k processing over web 2.0 streams , 2010, CIKM.

[6]  Thomas Gottron,et al.  LiveTweet: Monitoring and Predicting Interesting Microblog Posts , 2012, ECIR.

[7]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[8]  Michael Gertz,et al.  Temporal Information Retrieval: Challenges and Opportunities , 2011, TWAW.

[9]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[10]  Karl Aberer,et al.  Evaluating top-k queries over incomplete data streams , 2009, CIKM.

[11]  Bernard J. Jansen,et al.  Real time search on the web: Queries, topics, and economic value , 2011, Inf. Process. Manag..

[12]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[13]  D. Kossmann,et al.  Information Filtering on Micro-blogging Services , 2010 .

[14]  Manolis Koubarakis,et al.  Information filtering and query indexing for an information retrieval model , 2009, TOIS.

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[17]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[18]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[19]  Thomas Gottron,et al.  Bad news travel fast: a content-based analysis of interestingness on Twitter , 2011, WebSci '11.

[20]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[21]  Miles Efron,et al.  Information search and retrieval in microblogs , 2011, J. Assoc. Inf. Sci. Technol..

[22]  Danah Boyd,et al.  Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[23]  Opher Etzion,et al.  Event Processing in Action , 2010 .

[24]  David Luckham,et al.  The power of events - an introduction to complex event processing in distributed enterprise systems , 2002, RuleML.

[25]  Harry Shum,et al.  An Empirical Study on Learning to Rank of Tweets , 2010, COLING.

[26]  R. M. Chandrasekaran,et al.  Classifier Based Text Mining for Neural Network , 2007 .

[27]  Thomas Gottron,et al.  Searching microblogs: coping with sparsity and document quality , 2011, CIKM '11.

[28]  Michael S. Bernstein,et al.  Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[29]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[30]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[31]  Bernard J. Jansen,et al.  Real time search user behavior , 2010, CHI EA '10.

[32]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[33]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[34]  BarıŠGã¼ã§ Information filtering on micro-blogging services , 2010 .

[35]  Prasenjit Mitra,et al.  Event Detection and Visualization for Social Text Streams , 2007, ICWSM.

[36]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[37]  Ed H. Chi,et al.  Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network , 2010, 2010 IEEE Second International Conference on Social Computing.

[38]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[39]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[40]  Kyriakos Mouratidis,et al.  An Incremental Threshold Method for Continuous Text Search Queries , 2009, 2009 IEEE 25th International Conference on Data Engineering.