An Introduction to the Novel Challenges in Information Retrieval for Social Media

The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. This has caused a massive change in the documents being reached and retrieved. In this article we study how Information Retrieval models should change to reflect the changes that are happening to the documents being processed. We analyse the properties of the online user-generated documents of some of the most established services over the Internet (e.g. Kongregate, Twitter, Myspace and Slashdot) and compare them with a consolidated collection of standard information retrieval documents (e.g. Wall Street Journal, Associated Press, Financial Times). We study the statistical properties of these collections (e.g. Zipf’s Law and Heap’s Law) and investigate other important feature, such as document similarity, term burstiness, emoticons and part-of-speech analysis. We highlight the applicability and limits of traditional content analysis techniques to the new online user-generated documents and show the need for a specific processing for those documents in oder to be able to provide effective content analysis.

[1]  Tao Chen,et al.  Creating a live, public short message service corpus: the NUS SMS corpus , 2011, Lang. Resour. Evaluation.

[2]  Jane Lin,et al.  Automatic Author Profiling of Online Chat Logs , 2007 .

[3]  Roger L. Freeman Wiley Series in Telecommunications and Signal Processing , 2005 .

[4]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[5]  Dino Pedreschi,et al.  Machine Learning: ECML 2004 , 2004, Lecture Notes in Computer Science.

[6]  Katja Hofmann,et al.  The University of Amsterdam at TREC 2009: Blog, Web, Entity, and Relevance Feedback , 2009 .

[7]  Fabio Crestani,et al.  Finding Participants in a Chat: Authorship Attribution for Conversational Documents , 2013, 2013 International Conference on Social Computing.

[8]  Brett W. Bader,et al.  Algebraic Techniques for Multilingual Document Clustering , 2010 .

[9]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[10]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[11]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[12]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[13]  Joan Codina,et al.  Content Analysis in Web 2.0 , 2009 .

[14]  Brian D. Davison,et al.  Detection of Harassment on Web 2.0 , 2009 .

[15]  Eric N. Forsyth Improving automated lexical and discourse analysis of online chat dialog , 2007 .

[16]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[17]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[18]  Carol Peters,et al.  Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers , 2009, CLEF.

[19]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[20]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[21]  Steffen Staab,et al.  Management of Meta Knowledge for RDF Repositories , 2007 .

[22]  Gonzalo Navarro,et al.  Word-based self-indexes for natural language text , 2012, TOIS.

[23]  Craig MacDonald,et al.  Blog track research at TREC , 2010, SIGF.

[24]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[25]  Siu Cheung Hui,et al.  Structural analysis of chat messages for topic detection , 2006, Online Inf. Rev..

[26]  Jonathan S. Durham Topic detection in online chat , 2009 .

[27]  R. Layton,et al.  Authorship Attribution of IRC Messages Using Inverse Author Frequency , 2012, 2012 Third Cybercrime and Trustworthy Computing Workshop.

[28]  Fabio Crestani,et al.  Online conversation mining for author characterization and topic identification , 2011, PIKM '11.

[29]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[30]  Graham Wilcock,et al.  Introduction to Linguistic Annotation and Text Analytics , 2009, Synthesis Lectures on Human Language Technologies.

[31]  Henry Tirri,et al.  Combining Topic Models and Social Networks for Chat Data Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[32]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[33]  Kam-Fai Wong,et al.  Introduction to Chinese Natural Language Processing , 2009, Introduction to Chinese Natural Language Processing.

[34]  Fabio Crestani,et al.  Overview of the International Sexual Predator Identification Competition at PAN-2012 , 2012, CLEF.

[35]  Rodrygo L. T. Santos,et al.  Information Retrieval on the Blogosphere , 2012, Found. Trends Inf. Retr..

[36]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[37]  Craig MacDonald,et al.  Overview of the TREC-2012 Microblog Track , 2012, Text Retrieval Conference.

[38]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[39]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[40]  Micha Elsner,et al.  You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement , 2008, ACL.

[41]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[42]  Douglas W. Oard,et al.  Context-based Message Expansion for Disentanglement of Interleaved Text Conversations , 2009, NAACL.