A keyword extraction method from twitter messages represented as graphs

Abstract Twitter is a microblog service that generates a huge amount of textual content daily. All this content needs to be explored by means of text mining, natural language processing, information retrieval, and other techniques. In this context, automatic keyword extraction is a task of great usefulness. A fundamental step in text mining techniques consists of building a model for text representation. The model known as vector space model, VSM, is the most well-known and used among these techniques. However, some difficulties and limitations of VSM, such as scalability and sparsity, motivate the proposal of alternative approaches. This paper proposes a keyword extraction method for tweet collections that represents texts as graphs and applies centrality measures for finding the relevant vertices (keywords). To assess the performance of the proposed approach, three different sets of experiments are performed. The first experiment applies TKG to a text from the Time magazine and compares its performance with that of the literature. The second set of experiments takes tweets from three different TV shows, applies TKG and compares it with TFIDF and KEA, having human classifications as benchmarks. Finally, these three algorithms are applied to tweets sets of increasing size and their computational running time is measured and compared. Altogether, these experiments provide a general overview of how TKG can be used in practice, its performance when compared with other standard approaches, and how it scales to larger data instances. The results show that TKG is a novel and robust proposal to extract keywords from texts, particularly from short messages, such as tweets.

[1]  A. Kaplan,et al.  Users of the world, unite! The challenges and opportunities of Social Media , 2010 .

[2]  Armin R. Mikler,et al.  Text and Structural Data Mining of Influenza Mentions in Web and Social Media , 2010, International journal of environmental research and public health.

[3]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[4]  Svetlana Hensman,et al.  Construction of Conceptual Graph Representation of Texts , 2004, NAACL.

[5]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[6]  Michelle R. Guy,et al.  Twitter earthquake detection: earthquake monitoring in a social world , 2012 .

[7]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[8]  Abraham Kandel,et al.  Classification of Web documents using a graph model , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Hiroshi Nakagawa,et al.  ITC-UT: Tweet Categorization by Query Categorization for On-line Reputation Management , 2010, CLEF.

[10]  Nathalie Chaignaud,et al.  Context and Keyword Extraction in Plain Text Using a Graph Representation , 2008, 2008 IEEE International Conference on Signal Image Technology and Internet Based Systems.

[11]  Juan-Zi Li,et al.  Keyword Extraction Using Support Vector Machine , 2006, WAIM.

[12]  Rohini K. Srihari,et al.  Graph-based text representation and knowledge discovery , 2007, SAC '07.

[13]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[14]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[15]  Gerald Kowalski,et al.  Information Retrieval Architecture and Algorithms , 2010 .

[16]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[17]  Lon Safko,et al.  The Social Media Bible: Tactics, Tools, and Strategies for Business Success , 2009 .

[18]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[19]  Yukio Ohsawa,et al.  KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[20]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[23]  Dehghantanha Ali Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, Github, and More, by Matthew A. Russell , 2015 .

[24]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[25]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[26]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[27]  Gurpreet Singh Lehal,et al.  A Survey of Text Mining Techniques and Applications , 2009 .

[28]  F. Harary,et al.  Eccentricity and centrality in networks , 1995 .

[29]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[30]  Lynette Hirschman,et al.  Overview of evaluation in speech and natural language processing , 1997 .

[31]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[32]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[33]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[34]  R. Cole,et al.  Survey of the State of the Art in Human Language Technology , 2010 .

[35]  David A. Bader,et al.  Massive Social Network Analysis: Mining Twitter for Social Good , 2010, 2010 39th International Conference on Parallel Processing.

[36]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[37]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[38]  Ke Chen,et al.  Applied Mathematics and Computation , 2022 .

[39]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[40]  Bernard Harris,et al.  Graph theory and its applications , 1970 .

[41]  S. Gottwald,et al.  Fuzzy set theory and its applications. Second edition , 1992 .

[42]  A. Smeaton,et al.  On Using Twitter to Monitor Political Sentiment and Predict Election Results , 2011 .

[43]  Girish Keshav Palshikar Keyword Extraction from a Single Document Using Centrality Measures , 2007, PReMI.

[44]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[45]  Bruno S. Silvestre,et al.  Social Media? Get Serious! Understanding the Functional Building Blocks of Social Media , 2011 .

[46]  G. Sabidussi The centrality of a graph. , 1966, Psychometrika.

[47]  Bingru Yang,et al.  Graph-based text representation model and its realization , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[48]  Brian Lott,et al.  Survey of Keyword Extraction Techniques , 2012 .

[49]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[50]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[51]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[52]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, Web Intelligence.