Research on open domain Named entity recognition based on Chinese query logs

Search engine query logs contain quantities of Named Entities. As the basic work of information extraction, traditional Named-entity extraction methods only can extract specific categories of entities. It is very difficult for them to be applied to the query log Named-entity recognition directly for their limitation. In this paper, a novel approach is proposed to extract Named Entities from user query logs. In order to avoid the dependence on large-scale tagging corpus, we annotate the data automatically by using distant supervision method. Thus the problem that the training data needs human-annotation effort is solved. Moreover, open domain Named Entities are extracted from user query logs based on the conditional random field model. Evaluation on user query logs shows the effectiveness of our approach in extracting Named Entities in open domain.

[1]  Noah A. Smith,et al.  Conditional Random Field Autoencoders for Unsupervised Structured Prediction , 2014, NIPS.

[2]  Rahul Gupta,et al.  Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited , 2011, WSDM '11.

[3]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[4]  Dilek Z. Hakkani-Tür,et al.  Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[5]  Nan Ye,et al.  Conditional random field with high-order dependencies for sequence labeling and segmentation , 2014, J. Mach. Learn. Res..

[6]  Denilson Barbosa,et al.  Inferencing in information extraction: Techniques and applications , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[7]  Daniel S. Weld,et al.  Using Wikipedia to bootstrap open information extraction , 2009, SGMD.

[8]  Cheng Xueqi Named Entity Mining from Query Log through Semi-supervised Topic Modeling , 2012 .

[9]  Mathew J. Palakal,et al.  Event Causality Identification Using Conditional Random Field in Geriatric Care Domain , 2013, 2013 12th International Conference on Machine Learning and Applications.

[10]  Fu Ruij Chinese Open-domain Named Entity Boundary Identification based on A Self-Training Method , 2014 .

[11]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[12]  Rafael Berlanga Llavori,et al.  Exploiting semantic annotations for open information extraction: an experience in the biomedical domain , 2014, Knowledge and Information Systems.

[13]  Wu Li-hui Mining special named entities from Chinese Web search query logs , 2011 .

[14]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.