Unsupervised Active Learning of CRF Model for Cross Lingual Information Extraction

Manual annotation of the training data of information extraction models is a time consuming and expensive process but necessary for the building of information extraction systems. Active learning has been proven to be effective in reducing manual annotation efforts for supervised learning tasks where a human judge is asked to annotate the most informative examples with respect to a given model. However, in most cases reliable human judges are not available for all languages. In this paper, we propose a cross-lingual unsupervised active learning paradigm (XLADA) that generates high-quality automatically annotated training data from a word-aligned parallel corpus. To evaluate our paradigm, we applied XLADA on English-French and English-Chinese bilingual corpora then we trained French and Chinese information extraction models. The experimental results show that XLADA can produce effective models without manually-annotated training data.

[1]  Andrea Esuli,et al.  Sentence-Based Active Learning Strategies for Information Extraction , 2010, IIR.

[2]  Heng Ji,et al.  Joint bilingual name tagging for parallel corpora , 2012, CIKM '12.

[3]  Rong Pan,et al.  Bi-Weighting Domain Adaptation for Cross-Language Text Classification , 2011, IJCAI.

[4]  Benno Stein,et al.  Cross-Lingual Adaptation Using Structural Correspondence Learning , 2010, TIST.

[5]  Min Xiao,et al.  Transductive Representation Learning for Cross-Lingual Text Classification , 2012, 2012 IEEE 12th International Conference on Data Mining.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Ion Muslea,et al.  Active Learning with Multiple Views , 2009, Encyclopedia of Data Warehousing and Mining.

[8]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[9]  Kristina Toutanova,et al.  Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia , 2012, ACL.

[10]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[11]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[12]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[13]  Xiaodong He Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation , 2007, WMT@ACL.

[14]  Ting Liu,et al.  Generating Chinese Named Entity Data from a Parallel Corpus , 2011, IJCNLP.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.