Mining User-Generated Content for Social Research and Other Applications

User-generated content is currently becoming a valuable means for sensing and measuring real world variables and parameters that are of interest to several actors in the society: politicians, government departments, security agencies, marketing researchers, service providers, etc. In response to this new scenario, large research efforts are being invested in the so-called “social media” phenomenon by a wide spectrum of institutions and organizations around the world, with many different objectives and a diverse scope of fields and disciplines. As a consequence, new technologies and applications are currently emerging on the grounds of human participation, interaction, and behavior on the Internet. The main objective of this chapter is to present a general overview of the most relevant applications of text mining and natural language processing technologies evolving and emerging around the Web 2.0 phenomenon (such as automatic categorization, document summarization, question answering, dialogue management, opinion mining, sentiment analysis, outlier identification, misbehavior detection, and social estimation and forecasting) along with the main challenges and new research opportunities that are directly and indirectly derived from them.

[1]  Rafael E. Banchs,et al.  Emotional Reactions and the Pulse of Public Opinion: Measuring the Impact of Political Events on the Sentiment of Online Discussions , 2010, ArXiv.

[2]  Michael D. Smith,et al.  Predicting the Political Sentiment of Web Log Posts Using Supervised Machine Learning Techniques Coupled with Feature Selection , 2006, WEBKDD.

[3]  T. Daim,et al.  Building a Sustainable Regional Eco System for Green Technologies: Case of Cellulosic Ethanol in Oregon , 2011 .

[4]  Daantje Derks,et al.  Emoticons and social interaction on the Internet: the importance of social context , 2007, Comput. Hum. Behav..

[5]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[6]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[7]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[8]  Fang Li,et al.  Hot Topic Detection on BBS Using Aging Theory , 2009, WISM.

[9]  Benoît Sagot,et al.  Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish , 2009, LTC.

[10]  Marta R. Costa-jussà,et al.  Where are you From? - Tell Me HOW you Write and I Will Tell you WHO you are , 2010, ICAART.

[11]  Paolo Rosso,et al.  Linking Humour to Blogs Analysis: Affective Traits in Posts , 2009 .

[12]  Duen-Ren Liu,et al.  Expert finding in question-answering websites: a novel hybrid approach , 2010, SAC '10.

[13]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[14]  Janyce Wiebe,et al.  Articles: Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis , 2009, CL.

[15]  Cecilia Ovesdotter Alm,et al.  Emotions from Text: Machine Learning for Text-based Emotion Prediction , 2005, HLT.

[16]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[17]  Jon Oberlander,et al.  Whose Thumb Is It Anyway? Classifying Author Personality from Weblog Text , 2006, ACL.

[18]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[19]  R. MacGregor,et al.  Small Business Clustering Technologies: Applications in Marketing, Management, IT and Economics , 2006 .

[20]  Yves Punie,et al.  The Impact of Social Computing on the EU Information Society and Economy , 2009 .

[21]  A. Furnham Response bias, social desirability and dissimulation , 1986 .

[22]  Jerry R. Hobbs Resolving pronoun references , 1986 .

[23]  Andrei Mikheev,et al.  Document centered approach to text normalization , 2000, SIGIR '00.

[24]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[25]  April Kontostathis,et al.  Text Mining and Cybercrime , 2010 .

[26]  Danyel Fisher,et al.  You Are Who You Talk To: Detecting Roles in Usenet Newsgroups , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[27]  Hsinchun Chen,et al.  Sentiment and affect analysis of Dark Web forums: Measuring radicalization on the internet , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[28]  Iryna Gurevych,et al.  Sentence and Expression Level Annotation of Opinions in User-Generated Discourse , 2010, ACL.

[29]  Mihai Surdeanu,et al.  TALP-QA System at TREC 2004: Structural and Hierarchical Relaxation Over Semantic Constraints , 2004, TREC.

[30]  Jessica Vitak,et al.  Digital footprints: online identity management and search in the age of transparency , 2007 .

[31]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[32]  Enrico Motta,et al.  AquaLog: An ontology-driven Question Answering System to interface the Semantic Web , 2006, NAACL.

[33]  Gaston Burek,et al.  Maximal Phrases Based Analysis for Prototyping Online Discussion Forums Postings , 2009 .

[34]  Valentin Jijkoun,et al.  Named entity normalization in user generated content , 2008, AND '08.

[35]  Michael Kaisser,et al.  The QuALiM Question Answering Demo: Supplementing Answers with Paragraphs drawn from Wikipedia , 2008, ACL.

[36]  Gillian Youngs,et al.  Blogging and globalization: the blurring of the public/private spheres , 2009, Aslib Proc..

[37]  Hyoil Han,et al.  Answer Credibility: A Language Modeling Approach to Answer Validation , 2009, NAACL.

[38]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[39]  Alev M. Efendioglu Cluster Development: Issues, Progress, and Key Success Factors , 2007 .

[40]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[41]  Massimo Poesio,et al.  State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes , 2009, ACL.

[42]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[43]  Naoki Mukawa,et al.  Emoticons convey emotions without cognition of faces: an fMRI study , 2006, CHI Extended Abstracts.

[44]  Francesco Ricci,et al.  Introduction to the Special Issue: Recommender Systems , 2006, Int. J. Electron. Commer..

[45]  Lada A. Adamic,et al.  Knowledge sharing and yahoo answers: everyone knows something , 2008, WWW.

[46]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[47]  Kalina Bontcheva,et al.  Opinion analysis for business intelligence applications , 2008, OBI '08.

[48]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[49]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[50]  Andrew Hickl,et al.  Question Answering with LCC's CHAUCER-2 at TREC 2007 , 2006, TREC.

[51]  Bernardo Magnini,et al.  Is It the Right Answer? Exploiting Web Redundancy for Answer Validation , 2002, ACL.

[52]  Paul Resnick,et al.  Slash(dot) and burn: distributed moderation in a large online conversation space , 2004, CHI.

[53]  Lina Zhou,et al.  Ontology-supported polarity mining , 2008 .

[54]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[55]  Julia Hirschberg,et al.  Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization , 2005 .

[56]  Moshe Koppel,et al.  THE IMPORTANCE OF NEUTRAL EXAMPLES FOR LEARNING SENTIMENT , 2006, Comput. Intell..

[57]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[58]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[59]  Miriam J. Metzger Making sense of credibility on the Web: Models for evaluating online information and recommendations for future research , 2007 .

[60]  Soo-Min Kim,et al.  Automatic Identification of Pro and Con Reasons in Online Reviews , 2006, ACL.

[61]  Gary Geunbae Lee,et al.  CHAT AND GOAL-ORIENTED DIALOG TOGETHER: A UNIFIED EXAMPLE-BASED ARCHITECTURE FOR MULTI-DOMAIN DIALOG MANAGEMENT , 2006, 2006 IEEE Spoken Language Technology Workshop.

[62]  Sheizaf Rafaeli,et al.  Predictors of answer quality in online Q&A sites , 2008, CHI.

[63]  Ben Light,et al.  More Than Just Friends? Facebook, Disclosive Ethics and the Morality of Technology , 2008, ICIS.

[64]  Andrea Esuli,et al.  Determining Term Subjectivity and Term Orientation for Opinion Mining , 2006, EACL.

[65]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[66]  Sudeshna Sarkar,et al.  Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario , 2007, ACL.

[67]  Shanyang Zhao,et al.  Humanoid social robots as a medium of communication , 2006, New Media Soc..

[68]  Andreas Kaltenbrunner,et al.  Analyzing and ranking the Spanish speaking MySpace community by their contributions in forums , 2009 .

[69]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[70]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[71]  Hui Ye,et al.  The Hidden Information State Approach to Dialog Management , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[72]  A. Kaplan,et al.  Users of the world, unite! The challenges and opportunities of Social Media , 2010 .

[73]  Joseph B. Walther,et al.  The Impacts of Emoticons on Message Interpretation in Computer-Mediated Communication , 2001 .

[74]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[75]  Ramanathan V. Guha,et al.  The predictive power of online chatter , 2005, KDD '05.

[76]  Lynn Lambert,et al.  Modeling Negotiation Subdialogues , 1992, ACL.

[77]  M. Omar,et al.  Marketing in SMEs: The Sales Process of SMEs on the Food and Drink Industry , 2013 .

[78]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[79]  Lipika Dey,et al.  Opinion mining from noisy text data , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[80]  Justin W. Patchin,et al.  Cyberbullying: An Exploratory Analysis of Factors Related to Offending and Victimization , 2008 .

[81]  Stephen M. Mutula Digital Economies: SMEs and E-Readiness , 2009 .

[82]  John P. Robinson,et al.  Social Implications of the Internet , 2001 .

[83]  Rada Mihalcea,et al.  Characterizing Humour: An Exploration of Features in Humorous Texts , 2009, CICLing.

[84]  Adwait Ratnaparkhi,et al.  IBM's Statistical Question Answering System , 2000, TREC.

[85]  Carl Vogel,et al.  Parsing Ill-Formed Text Using an Error Grammar , 2004, Artificial Intelligence Review.

[86]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[87]  Kai Wang,et al.  A syntactic tree matching approach to finding similar questions in community-based qa services , 2009, SIGIR.

[88]  Mitsuru Ishizuka,et al.  Emerging topic tracking system in WWW , 2006, Knowl. Based Syst..

[89]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[90]  J. Gajadhar,et al.  An analysis of nonverbal communication in an online chat group. , 2003 .

[91]  Lynette Hirschman,et al.  Natural language question answering: the view from here , 2001, Natural Language Engineering.

[92]  Peter Wallis,et al.  A Robot in the Kitchen , 2010 .

[93]  Diego Molla Aliod,et al.  Question Answering in Restricted Domains: An Overview , 2007, CL.

[94]  Elisabeth Maier,et al.  Dialogue Processing in Spoken Language Systems , 1996, Lecture Notes in Computer Science.

[95]  Simone Teufel,et al.  Examining the consensus between human summaries: initial experiments with factoid analysis , 2003, HLT-NAACL 2003.

[96]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[97]  Andrew Tomkins,et al.  Toward a PeopleWeb , 2007, Computer.

[98]  Michael F. McTear,et al.  Modelling spoken dialogues with state transition diagrams: experiences with the CSLU toolkit , 1998, ICSLP.

[99]  Luisa Doldi,et al.  Effective Web Presence Solutions for Small Businesses: Strategies for Successful Implementation , 2009 .

[100]  Nayer M. Wanas,et al.  Using automatic keyword extraction to detect off-topic posts in online discussion boards , 2009 .

[101]  Paul Lamere,et al.  Social Tagging and Music Information Retrieval , 2008 .

[102]  Dianne P. O'Leary,et al.  Text Summarization via Hidden Markov Models and Pivoted QR Matrix Decomposition , 2001 .

[103]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[104]  Janyce Wiebe,et al.  Learning Subjective Language , 2004, CL.

[105]  Leila Kosseim,et al.  Summarizing Blog Entries versus News Texts , 2009 .

[106]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[107]  Sara Owsley Sood,et al.  ESSE: Exploring mood on the web , 2009 .

[108]  Weiming Hu,et al.  Topic Detection and Tracking for Threaded Discussion Communities , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[109]  Hermann Ney One decade of statistical machine translation: 1996-2005 , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[110]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[111]  Vibhu O. Mittal,et al.  Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries , 1999, SIGIR '99.

[112]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[113]  Alexander I. Rudnicky,et al.  Olympus: an open-source framework for conversational spoken language interface research , 2007, HLT-NAACL 2007.

[114]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[115]  Janyce Wiebe,et al.  Effects of Adjective Orientation and Gradability on Sentence Subjectivity , 2000, COLING.

[116]  Yllias Chali,et al.  QUESTION ANSWERING USING QUESTION CLASSIFICATION AND DOCUMENT TAGGING , 2009, Appl. Artif. Intell..

[117]  Géraldine Walther,et al.  Developing a Large-Scale Lexicon for a Less-Resourced Language: General Methodology and Preliminary Experiments on Sorani Kurdish , 2010 .

[118]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[119]  Gilad Mishne,et al.  Leave a Reply: An Analysis of Weblog Comments , 2006 .

[120]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, Web Intelligence.

[121]  Peter A. Todd,et al.  Consumer Reactions to Electronic Shopping on the World Wide Web , 1996, Int. J. Electron. Commer..

[122]  Valentina Ndou,et al.  Digital Marketplaces as a Viable Model for SME Networking , 2011 .

[123]  M. Bradley,et al.  Affective Normsfor English Words (ANEW): Stimuli, instruction manual and affective ratings (Tech Report C-1) , 1999 .

[124]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[125]  Keith N. Hampton,et al.  Capitalizing on the Net: Social Contact, Civic Engagement, and Sense of Community , 2008 .

[126]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[127]  Matthew Richardson,et al.  Yes, there is a correlation: - from social networks to personal behavior on the web , 2008, WWW.

[128]  Angela Cora Garcia,et al.  Ethnographic Approaches to the Internet and Computer-Mediated Communication , 2009 .

[129]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[130]  Asli Çelikyilmaz,et al.  Semantic approach to text entailment for question answering - new domain for uncertainty modeling , 2008, 2008 7th IEEE International Conference on Cognitive Informatics.

[131]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[132]  Fredrik Olsson,et al.  Methods for Amharic Part-of-Speech Tagging , 2009 .

[133]  R. Provine,et al.  Emotional Expression Online , 2007 .

[134]  Brian D. Davison,et al.  Detection of Harassment on Web 2.0 , 2009 .

[135]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[136]  Mark S. Ackerman,et al.  Expertise networks in online communities: structure and algorithms , 2007, WWW '07.

[137]  Maria Manuela Cruz-Cunha,et al.  E-Business Issues, Challenges and Opportunities for SMEs: Driving Competitiveness , 2010 .

[138]  Benoît Sagot,et al.  A Morphological Lexicon for the Persian Language , 2010, LREC.

[139]  Daniel Pimienta Twelve years of measuring linguistic diversity in the Internet: balance and perspectives , 2009 .

[140]  Benoît Sagot,et al.  Automatic Acquisition of a Slovak Lexicon from a Raw Corpus , 2005, TSD.

[141]  Scott Nowson The Language of Weblogs: A study of genre and individual differences , 2006 .

[142]  Ronald L. Breiger Introduction to special issue: ethical dilemmas in social network research , 2005, Soc. Networks.

[143]  Sanda M. Harabagiu,et al.  Methods for Using Textual Entailment in Open-Domain Question Answering , 2006, ACL.

[144]  B. Danet,et al.  The Multilingual Internet , 2007 .

[145]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[146]  Susan Gauch,et al.  ChatTrack: Chat Room Topic Detection Using Classification , 2004, ISI.

[147]  Malvina Nissim,et al.  Comparing Knowledge Sources for Nominal Anaphora Resolution , 2005, Computational Linguistics.