Entropy-based Training Data Selection for Domain Adaptation

Training data selection is a common method for domain adaptation, the goal of which is to choose a subset of training data that works well for a given test set. It has been shown to be effective for tasks such as machine translation and parsing. In this paper, we propose several entropy-based measures for training data selection and test their effectiveness on two tasks: Chinese word segmentation and part-of-speech tagging. The experimental results on the Chinese Penn Treebank indicate that some of the measures provide a statistically significant improvement over random selection for both tasks.

[1]  Yan Song,et al.  Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation , 2012, LREC.

[2]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[3]  Chunyu Kit,et al.  Unsupervised Lexical Learning As Inductive Inference via Compression , 2000 .

[4]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[5]  Hai Zhao,et al.  Improving Chinese Word Segmentation with Description Length Gain , 2007, IC-AI.

[6]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[7]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[8]  Yorick Wilks,et al.  The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora , 2002 .

[9]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Eugene Charniak,et al.  Automatic Domain Adaptation for Parsing , 2010, NAACL.

[12]  Barbara Plank,et al.  Effective Measures of Domain Similarity for Parsing , 2011, ACL.

[13]  Hai Zhao,et al.  Integrating unsupervised and supervised word segmentation: The role of goodness measures , 2011, Inf. Sci..

[14]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[15]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.