论文信息 - Entropy-based Training Data Selection for Domain Adaptation

Entropy-based Training Data Selection for Domain Adaptation

Training data selection is a common method for domain adaptation, the goal of which is to choose a subset of training data that works well for a given test set. It has been shown to be effective for tasks such as machine translation and parsing. In this paper, we propose several entropy-based measures for training data selection and test their effectiveness on two tasks: Chinese word segmentation and part-of-speech tagging. The experimental results on the Chinese Penn Treebank indicate that some of the measures provide a statistically significant improvement over random selection for both tasks.

[1] Yan Song,et al. Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation , 2012, LREC.

[2] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[3] Chunyu Kit,et al. Unsupervised Lexical Learning As Inductive Inference via Compression , 2000 .

[4] Nianwen Xue,et al. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[5] Hai Zhao,et al. Improving Chinese Word Segmentation with Description Length Gain , 2007, IC-AI.

[6] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[7] William D. Lewis,et al. Intelligent Selection of Language Model Training Data , 2010, ACL.

[8] Yorick Wilks,et al. The Virtual Corpus Approach to Deriving Ngram Statistics from Large Scale Corpora , 2002 .

[9] Eugene Charniak,et al. Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[10] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11] Eugene Charniak,et al. Automatic Domain Adaptation for Parsing , 2010, NAACL.

[12] Barbara Plank,et al. Effective Measures of Domain Similarity for Parsing , 2011, ACL.

[13] Hai Zhao,et al. Integrating unsupervised and supervised word segmentation: The role of goodness measures , 2011, Inf. Sci..

[14] Yorick Wilks,et al. Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[15] Jianfeng Gao,et al. Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.