Information Extraction From Microblogs: A Survey

Microblogging(e.g. Twitter, http://twitter.com), as a new form of online communication in which users talk about their daily lives, publish opinions or share information by short posts, has become one of the most popular social networking services today, which makes it potentially a large information base attracting increasing attention of researchers in the field of knowledge discovery and data mining. In this paper, we conduct a survey about existing research on information extraction from microblogging services and their applications, and then address some promising future works. We specifically analyze three types of information: personal, social and travel information.

[1]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[2]  Ana-Maria Popescu,et al.  Democrats, republicans and starbucks afficionados: user classification in twitter , 2011, KDD.

[3]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[4]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[5]  Henry A. Kautz,et al.  Finding your friends and following them to where you are , 2012, WSDM '12.

[6]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[7]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[8]  Luis Gravano,et al.  Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[9]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[10]  Xin Shuai,et al.  Loose tweets: an analysis of privacy leaks on twitter , 2011, WPES.

[11]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[12]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[13]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[14]  Regina Barzilay,et al.  Event Discovery in Social Media Feeds , 2011, ACL.

[15]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[16]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[17]  Gerhard Paass,et al.  Composite Kernels For Relation Extraction , 2009, ACL.

[18]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[19]  Vasudeva Varma,et al.  User context as a source of topic retrieval in Twitter , 2011 .

[20]  Jon Whittle,et al.  A Feasibility Study on Extracting Twitter Users' Interests Using NLP Tools for Serendipitous Connections , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[21]  Brian D. Davison,et al.  Structural link analysis and prediction in microblogs , 2011, CIKM '11.

[22]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[23]  Michael S. Bernstein,et al.  A Torrent of Tweets: Managing Information Overload in Online Social Streams , 2010 .

[24]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[25]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[26]  Gerhard Weikum,et al.  The YAGO-NAGA approach to knowledge discovery , 2009, SGMD.

[27]  Danah Boyd,et al.  Tweeting from the Town Square: Measuring Geographic Local Networks , 2010, ICWSM.

[28]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[29]  Ed H. Chi,et al.  Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[30]  B. Krishnamurthy,et al.  How Much Is Too Much? Privacy Issues on Twitter , 2010 .

[31]  Christophe G. Giraud-Carrier,et al.  Bonding vs. Bridging Social Capital: A Case Study in Twitter , 2010, 2010 IEEE Second International Conference on Social Computing.

[32]  Gerhard Paass,et al.  Semantic relation extraction with kernels over typed dependency trees , 2010, KDD.

[33]  Jie Tang,et al.  Who will follow you back?: reciprocal relationship prediction , 2011, CIKM '11.

[34]  Qing Yang,et al.  Discovering User Interest on Twitter with a Modified Author-Topic Model , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[35]  Danushka Bollegala,et al.  Relational duality: unsupervised extraction of semantic relations between entities on the web , 2010, WWW '10.

[36]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[37]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[38]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[39]  Brian D. Davison,et al.  Link formation analysis in microblogs , 2011, SIGIR.

[40]  Panagiotis Takis Metaxas,et al.  What Edited Retweets Reveal about Online Political Discourse , 2011, Analyzing Microtext.

[41]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[42]  Eric P. Xing,et al.  Social Links from Latent Topics in Microblogs , 2010, HLT-NAACL 2010.

[43]  John Hannon,et al.  Recommending twitter users to follow using content and collaborative filtering approaches , 2010, RecSys '10.

[44]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[45]  Eni Mustafaraj,et al.  From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search , 2010 .

[46]  Miles Efron The liberal media and right-wing conspiracies: using cocitation information to estimate political orientation in web documents , 2004, CIKM.

[47]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[48]  Hang Li,et al.  Named entity recognition in query , 2009, SIGIR.

[49]  Rizal Setya Perdana What is Twitter , 2013 .

[50]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[51]  Jon M. Kleinberg,et al.  The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter , 2010, ICWSM.

[52]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[53]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[54]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[55]  Alessandro Moschitti,et al.  Convolution Kernels on Constituent, Dependency and Sequential Structures for Relation Extraction , 2009, EMNLP.

[56]  Scott A. Golder,et al.  Structural Predictors of Tie Formation in Twitter: Transitivity and Mutuality , 2010, 2010 IEEE Second International Conference on Social Computing.

[57]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[58]  Fang Wu,et al.  Social Networks that Matter: Twitter Under the Microscope , 2008, First Monday.

[59]  Marc Cheong,et al.  Integrating web-based intelligence retrieval and decision-making from the twitter trends knowledge base , 2009, CIKM-SWSM.

[60]  Rajeev Rastogi,et al.  Exploiting content redundancy for web information extraction , 2010, Proc. VLDB Endow..

[61]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[62]  Alice Oh,et al.  Analysis of Twitter Lists as a Potential Source for Discovering Latent Characteristics of Users , 2010 .

[63]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[64]  Scott A. Golder A Structural Approach to Contact Recommendations in Online Social Networks , 2009 .

[65]  Frederick Reiss,et al.  An Algebraic Approach to Rule-Based Information Extraction , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[66]  Satoshi Sekine,et al.  Preemptive Information Extraction using Unrestricted Relation Discovery , 2006, NAACL.

[67]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[68]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[69]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[70]  Michael S. Bernstein,et al.  Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[71]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[72]  Nitesh V. Chawla,et al.  New perspectives and methods in link prediction , 2010, KDD.

[73]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[74]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[75]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[76]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[77]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[78]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[79]  Derek L. Hansen,et al.  Computing political preference among twitter followers , 2011, CHI.

[80]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[81]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[82]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[83]  William W. Cohen,et al.  Character-level Analysis of Semi-Structured Documents for Set Expansion , 2009, EMNLP.

[84]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[85]  Bertrand De Longueville,et al.  "OMG, from here, I can see the flames!": a use case of mining location based social networks to acquire spatio-temporal data on forest fires , 2009, LBSN '09.

[86]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[87]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[88]  Tomohiro Takagi,et al.  Recommendations in Twitter using conceptual fuzzy sets , 2010, 2010 Annual Meeting of the North American Fuzzy Information Processing Society.

[89]  Marc Cheong,et al.  Twitmographics: Learning the Emergent Properties of the Twitter Community , 2010, From Sociology to Computing in Social Networks.

[90]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[91]  A.R.M. Teutle,et al.  Twitter: Network properties analysis , 2010, 2010 20th International Conference on Electronics Communications and Computers (CONIELECOMP).

[92]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[93]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[94]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[95]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[96]  Kenny Gruchalla,et al.  Integration and Dissemination of Citizen Reported and Seismically Derived Earthquake Information via Social Network Technologies , 2010, IDA.

[97]  Junghoo Cho,et al.  Topical semantics of twitter links , 2011, WSDM '11.

[98]  SchwartzRichard,et al.  An Algorithm that Learns Whats in a Name , 1999 .

[99]  Timothy W. Finin,et al.  Why We Twitter: An Analysis of a Microblogging Community , 2009, WebKDD/SNA-KDD.