Online Detection of Domain-Specific New Words in Text Streams

With the tremendous development of Internet, many domain-specific new words appear in various media text streams such as forums, Sina Weibo, Wechat, etc. These new words are always a group of important words in specific domains and are significant for NLP tasks. Most existing models have time-consuming processing or cannot handle out of vocabulary (OOV) words on streaming and online scenes. In this paper, we propose an unsupervised method, D-TopWords with Gaussian LDA, to perform online detection of domain-specific new words effectively. Different from traditional new words detection models, our method is a joint statistical model based on a finite word dictionary without any handcraft features. By further introducing Gaussian LDA into our model, we solve properly the problem of OOV words from new text streams. Experimental results show that our work can successfully extract domain-specific new words and it has a better performance in online detection task than some state-of-the-art methods.

[1]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[2]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[3]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[4]  J. Silva,et al.  A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora , 2009 .

[5]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[6]  Zhiyuan Liu,et al.  Incorporating User Behaviors in New Word Detection , 2009, IJCAI.

[7]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[8]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[9]  Tu Bao Ho,et al.  Improving effectiveness of mutual information for substantival multiword expression extraction , 2009, Expert Syst. Appl..

[10]  Maosong Sun,et al.  Domain-Specific New Words Detection in Chinese , 2017, *SEM.

[11]  Changning Huang,et al.  The Use of SVM for Chinese New Word Identification , 2004, IJCNLP.

[12]  Ke Deng,et al.  On the unsupervised analysis of domain-specific Chinese texts , 2016, Proceedings of the National Academy of Sciences.

[13]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[14]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[15]  Shlomo Argamon,et al.  A Memory-Based Approach to Learning Shallow Natural Language Patterns , 1998, ACL.

[16]  Xiaoyan Zhu,et al.  Measuring the Non-compositionality of Multiword Expressions , 2010, COLING.

[17]  Pavel Pecina An Extensive Empirical Study of Collocation Extraction Methods , 2005, ACL.

[18]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[19]  Guodong Zhou A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation , 2005, IJCNLP.

[20]  Maosong Sun,et al.  Chinese New Word Detection from Query Logs , 2010, ADMA.