Large-Scale Chinese Cross-Document Entity Disambiguation and Information Fusion

Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities and rises in information fusion and automated knowledge base construction. In this paper, we describe a Chinese Information Extraction (IE) and fusion system based on Hadoop Framework, which involves document-level IE and corpus-level IE, a pipeline and multi-level modular approach to Name Entity Recognitions (EDR), entity relationship extraction and information fusion. In document-level IE, information associated with each mention of the name can be merged into rich profiles for entities based on our co-reference and alias modular, in corpus-level IE, entity disambiguation is performed based on agglomerative hierarchical clustering using Map Reduce. The visualized results of the entity centric information graph have been demonstrated.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Text , 1993, HLT.

[3]  Liu Qun Chinese Lexical Analysis Using Cascaded Hidden Markov Model , 2004 .

[4]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[7]  Doug Downey,et al.  Analysis of a probabilistic model of redundancy in unsupervised information extraction , 2010, Artif. Intell..

[8]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[9]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[10]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[11]  David Yarowsky,et al.  Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[12]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[13]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[14]  Elaine Marsh,et al.  Appendix D: MUC-7 Information Extraction Task Definition (version 5.1) , 1998, MUC.

[15]  Ling Liu,et al.  Efficient data partitioning model for heterogeneous graphs in the cloud , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Xiang Li,et al.  Joint inference for cross-document information extraction , 2011, CIKM '11.

[17]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[18]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[19]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[20]  Cheng Niu,et al.  InfoXtract location normalization: a hybrid approach to geographic references in information extraction , 2003, HLT-NAACL 2003.

[21]  SchwartzRichard,et al.  An Algorithm that Learns Whats in a Name , 1999 .

[22]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[23]  Julio Gonzalo,et al.  Web people search: results of the first evaluation and the plan for the second , 2008, WWW.

[24]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.