Developing insights from social media using semantic lexical chains to mine short text structures

Abstract Social media is increasingly being used for communication by individuals and organizations. Social media stores vast amounts of publicly available data that provides a rich source of information and insights. Often, social media users can easily infer meaning from short text such as microblogs and Facebook posts because they understand the context and terminology used. Although automated data-mining can be effective for gaining insights from text data, a significant challenge is to accurately infer meaning from social media text derived from a single social media account. This is difficult because social media communication uses very short, or sparse, text, which yields a relatively small sample of usable words for analysis. Furthermore, interpreting the contextual meaning from a relatively small set of words is challenging. This research proposes a methodology for extracting semantic lexical chains from frequently occurring words in a single social media account and using these chains to mine short text structures to infer the overall themes of the user. The methodology is based on a proposed clustering algorithm and illustrated with examples from Facebook posts. The algorithm is tested and illustrated by comparing it to existing work and further applying it to a variety of news posts. This methodology could be useful for gaining decision-making insights from social media, or other online forms with short or sparse text.

[1]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[2]  Kathleen McKeown,et al.  Improving Word Sense Disambiguation in Lexical Chaining , 2003, IJCAI.

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[5]  Evgeniy Gabrilovich,et al.  Feature generation for textual information retrieval using world knowledge , 2007, SIGF.

[6]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[7]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[8]  Desheng Dash Wu,et al.  Using text mining and sentiment analysis for online forums hotspot detection and forecast , 2010, Decis. Support Syst..

[9]  Stefan Stieglitz,et al.  Emotions and Information Diffusion in Social Media—Sentiment of Microblogs and Sharing Behavior , 2013, J. Manag. Inf. Syst..

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Yiannis Kompatsiaris,et al.  Sensing Trending Topics in Twitter , 2013, IEEE Transactions on Multimedia.

[12]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[13]  Victoria Y. Yoon,et al.  Semantic similarity of ontology instances using polarity mining , 2013, J. Assoc. Inf. Sci. Technol..

[14]  Mirella Lapata,et al.  An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[16]  Yibo Wang,et al.  Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud , 2018, Decis. Support Syst..

[17]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[18]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[20]  Diyi Yang,et al.  Incorporating Word Correlation Knowledge into Topic Modeling , 2015, NAACL.

[21]  T. Murata,et al.  Breaking News Detection and Tracking in Twitter , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[22]  M. Dolores del Castillo,et al.  SyMSS: A syntax-based measure for short-text semantic similarity , 2011, Data Knowl. Eng..

[23]  Yo-Sub Han,et al.  An abusive text detection system based on enhanced abusive and non-abusive word lists , 2018, Decis. Support Syst..

[24]  Alexander J. Smola,et al.  Word Features for Latent Dirichlet Allocation , 2010, NIPS.

[25]  Kathleen F. McCoy,et al.  Efficient text summarization using lexical chains , 2000, IUI '00.

[26]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[27]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[28]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[29]  Balaraman Ravindran,et al.  Document Clustering using Lexical Chains , 2007 .

[30]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[31]  Michael Halliday,et al.  Cohesion in English , 1976 .

[32]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[33]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[34]  Iris Vessey,et al.  Research Report - The Relevance of Application Domain Knowledge: The Case of Computer Program Comprehension , 1995, Inf. Syst. Res..

[35]  Yiannis Kompatsiaris,et al.  Two-level Message Clustering for Topic Detection in Twitter , 2014, SNOW-DC@WWW.

[36]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[37]  Chafik Aloulou,et al.  Word Sense Disambiguation using Skip Gram Model to Create a Historical Dictionary for Arabic , 2018, 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA).

[38]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[39]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[40]  Mohsen Pourvali,et al.  Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion , 2020, J. Intell. Syst..

[41]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[42]  Anatole Gershman,et al.  Topical Clustering of Tweets , 2011 .

[43]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[44]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[45]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[46]  Stefan T. Mol,et al.  Text Mining in Organizational Research , 2017, Organizational research methods.

[47]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[48]  A. Kaplan,et al.  Users of the world, unite! The challenges and opportunities of Social Media , 2010 .

[49]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[50]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[51]  Balaraman Ravindran,et al.  Lexical Chains as Document Features , 2008, IJCNLP.

[52]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[53]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[54]  Rainer Alt,et al.  Towards an Ontology-Based Approach for Social Media Analysis , 2014, ECIS.

[55]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[56]  Joshua Zhexue Huang,et al.  A Text Clustering System based on k-means Type Subspace Clustering and Ontology , 2008 .

[57]  Wenyin Liu,et al.  A short text modeling method combining semantic and statistical information , 2010, Inf. Sci..

[58]  Raymond Y. K. Lau,et al.  The determinants of crowdfunding success: A semantic text analytics approach , 2016, Decis. Support Syst..

[59]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..