Task-oriented keyphrase extraction from social media

Keyphrase extraction from social media is a crucial and challenging task. Previous studies usually focus on extracting keyphrases that provide the summary of a corpus. However, they do not take users’ specific needs into consideration. In this paper, we propose a novel three-stage model to learn a keyphrase set that represents or related to a particular topic. Firstly, a phrase mining algorithm is applied to segment the documents into human-interpretable phrases. Secondly, we propose a weakly supervised model to extract candidate keyphrases, which uses a few pre-specific seed keyphrases to guide the model. The model consequently makes the extracted keyphrases more specific and related to the seed keyphrases (which reflect the user’s needs). Finally, to further identify the implicitly related phrases, the PMI-IR algorithm is employed to obtain the synonyms of the extracted candidate keyphrases. We conducted experiments on two publicly available datasets from news and Twitter. The experimental results demonstrate that our approach outperforms the state-of-the-art baselines and has the potential to extract high-quality task-oriented keyphrases.

[1]  Brian Lott,et al.  Survey of Keyword Extraction Techniques , 2012 .

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Qing Xie,et al.  Exploiting link structure for web page genre identification , 2015, Data Mining and Knowledge Discovery.

[4]  Balaraman Ravindran,et al.  Latent dirichlet allocation based multi-document summarization , 2008, AND '08.

[5]  Nicu Sebe,et al.  The Many Shades of Negativity , 2017, IEEE Transactions on Multimedia.

[6]  Qiang Yang,et al.  Diverse Topic Phrase Extraction through Latent Semantic Analysis , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Qiang Yang,et al.  Diverse Topic Phrase Extraction from Text Collection , 2006 .

[9]  Lei Zhu,et al.  Unsupervised Visual Hashing with Semantic Assistant for Content-Based Image Retrieval , 2017, IEEE Transactions on Knowledge and Data Engineering.

[10]  Xiaojun Chang,et al.  Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Claire Cardie,et al.  Adapting a Polarity Lexicon using Integer Linear Programming for Domain-Specific Sentiment Classification , 2009, EMNLP.

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  K. P. Chow,et al.  An Information Extraction Framework for Digital Forensic Investigations , 2015, IFIP Int. Conf. Digital Forensics.

[14]  Xiaojun Chang,et al.  Feature Interaction Augmented Sparse Learning for Fast Kinect Motion Detection , 2017, IEEE Transactions on Image Processing.

[15]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[16]  Heng Ji,et al.  A Language-Independent Neural Network for Event Detection , 2016, ACL 2016.

[17]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[18]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[19]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[20]  Min Yang,et al.  Ordering-Sensitive and Semantic-Aware Topic Modeling , 2015, AAAI.

[21]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[22]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[23]  Feiping Nie,et al.  Compound Rank- $k$ Projections for Bilinear Analysis , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[25]  David A. Shamma,et al.  Tweet the debates: understanding community annotation of uncollected sources , 2009, WSM@MM.

[26]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[27]  Dingju Zhu,et al.  Title A topic model for building fine-grained domain-specific emotionlexicon , 2014 .

[28]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[29]  Alex A. Freitas,et al.  Document Clustering and Text Summarization , 2000 .

[30]  Kuo Zhang,et al.  Keyword extraction based on tf/idf for Chinese news document , 2007, Wuhan University Journal of Natural Sciences.

[31]  Xueqin Lin,et al.  An examination of on-line machine learning approaches for pseudo-random generated data , 2016, Cluster Computing.

[32]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[33]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Nikos Mamoulis,et al.  Real-time Detection and Sorting of News on Microblogging Platforms , 2015, PACLIC.

[35]  K. P. Chow,et al.  Learning Domain-specific Sentiment Lexicon with Supervised Sentiment-aware LDA , 2014, ECAI.

[36]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.