Constructing Multiple Domain Taxonomy for Text Processing Tasks

In recent years large volumes of short text data can be easily collected from platforms such as microblogs and product review sites. Very often the obtained short text data contains several domains, which poses many challenges in effective multi-domain text processing because it is challenging to distinguish among the multiple domains in the text data. The concept of multiple domain taxonomy (MDT) has shown promising performance in processing multi-domain text data. However, MDT has to be constructed manually, which requires much expert knowledge about the relevant domains and is time consuming. To address such issues, in this paper, we introduce a semi-automatic method to construct an MDT that only requires a small amount of manual input, in combination of an unsupervised method for ranking multi-domain concepts based on semantic relationships learned from unlabeled data. We show that the iteratively-constructed MDT using our semi-automatic method can achieve higher accuracy than existing methods in domain classification, where the accuracy can be improved by up to 11%.

[1]  Jun Zhao,et al.  Collective entity linking in web text: a graph-based method , 2011, SIGIR.

[2]  John Davies,et al.  Event identification and assertion from social media using auto-extendable knowledge base , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[3]  Rui Li,et al.  TEDAS: A Twitter-based Event Detection and Analysis System , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[4]  Mohamed A. Sharaf,et al.  Predicting Elections from Social Networks Based on Sub-event Detection and Sentiment Analysis , 2014, WISE.

[5]  Elena Ferrari,et al.  EgoCentric: Ego Networks for Knowledge-based Short Text Classification , 2014, CIKM.

[6]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[7]  Quan Z. Sheng,et al.  Identifying Domains and Concepts in Short Texts via Partial Taxonomy and Unlabeled Data , 2017, CAiSE.

[8]  Fernando Diaz,et al.  CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises , 2014, ICWSM.

[9]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[10]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[12]  Quan Z. Sheng,et al.  Improving Object and Event Monitoring on Twitter Through Lexical Analysis and User Profiling , 2016, WISE.

[13]  Nick Bassiliades,et al.  Ontology-based sentiment analysis of twitter posts , 2013, Expert Syst. Appl..

[14]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[15]  Michael J. Paul,et al.  Carmen: A Twitter Geolocation System with Applications to Public Health , 2013 .