Transliteration Extraction from Classical Chinese Buddhist Literature Using Conditional Random Fields

Extracting plausible transliterations from historical literature is a key issues in historical linguistics and other resaech fields. In Chinese historical literature, the characters used to transliterate the same loanword may vary because of different translation eras or different Chinese language preferences among translators. To assist historical linguiatics and digial humanity researchers, this paper propose a transliteration extraction method based on the conditional random field method with the features based on the characteristics of the Chinese characters used in transliterations which are suitable to identify transliteration characters. To evaluate our method, we compiled an evaluation set from the two Buddhist texts, the Samyuktagama and the Lotus Sutra. We also construct a baseline approach with suffix array based extraction method and phonetic similarity measurement. Our method outperforms the baseline approach a lot and the recall of our method achieves 0.9561 and the precision is 0.9444. The results show our method is very effective to extract transliterations in classical Chinese texts.

[1]  Nathan Green,et al.  Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation , 2011, ACL.

[2]  Jun'ichi Tsujii,et al.  Descriptive and Empirical Approaches to Capturing Underlying Dependencies among Parsing Errors , 2009, EMNLP.

[3]  Grzegorz Kondrak,et al.  Bootstrapping a Stochastic Transducer for Arabic-English Transliteration Extraction , 2007, ACL.

[4]  Joakim Nivre,et al.  Characterizing the Errors of Data-Driven Dependency Parsing Models , 2007, EMNLP.

[5]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[6]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[7]  Key-Sun Choi,et al.  A Statistical Model for Automatic Extraction of Korean Transliterated Foreign Words , 2003, Int. J. Comput. Process. Orient. Lang..

[8]  Haizhou Li,et al.  A phonetic similarity model for automatic extraction of transliteration pairs , 2007, TALIP.

[9]  C. MogotsiI. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze , 2010 .

[10]  Lluís Màrquez i Villodre,et al.  Towards Heterogeneous Automatic MT Error Analysis , 2008, LREC.

[11]  Fei Xia,et al.  Improving a Statistical MT System with Automatically Learned Rewrite Patterns , 2004, COLING.

[12]  Jun'ichi Tsujii,et al.  Incremental Joint POS Tagging and Dependency Parsing in Chinese , 2011, IJCNLP.

[13]  Yoshimasa Tsuruoka,et al.  Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data , 2011, IJCNLP.

[14]  Pascual Martínez-Gómez,et al.  Using unlabeled dependency parsing for pre-reordering for Chinese-to-Japanese statistical machine translation , 2013, HyTra@ACL.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Peng Xu,et al.  Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages , 2009, NAACL.

[17]  Kevin Duh,et al.  Head Finalization: A Simple Reordering Rule for SOV Languages , 2010, WMT@ACL.

[18]  Kevin Duh,et al.  Head Finalization Reordering for Chinese-to-Japanese Machine Translation , 2012, SSST@ACL.

[19]  Kun Yu,et al.  Analysis of the Difficulties in Chinese Deep Parsing , 2011, IWPT.

[20]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.

[21]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[22]  Yuji Matsumoto,et al.  Japanese Dependency Structure Analysis Based on Support Vector Machines , 2000, EMNLP.

[23]  Slav Petrov,et al.  Training a Parser for Machine Translation Reordering , 2011, EMNLP.

[24]  Wen-Lian Hsu,et al.  On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching , 2006, SIGHAN@COLING/ACL.

[25]  Jun'ichi Tsujii,et al.  Feature Forest Models for Probabilistic HPSG Parsing , 2008, CL.

[26]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[27]  Chris Quirk,et al.  The impact of parse quality on syntactically-informed statistical machine translation , 2006, EMNLP.

[28]  Yoav Goldberg,et al.  Identification of Transliterated Foreign Words in Hebrew Script , 2008, CICLing.

[29]  Dmitriy Genzel,et al.  Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation , 2010, COLING.

[30]  John Blitzer,et al.  Frustratingly Hard Domain Adaptation for Dependency Parsing , 2007, EMNLP.

[31]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .