A Quantitative Analysis and Sentence Alignment for Parallel Corpora of ShiJi

Abstract We conducted quantitative and qualitative analyses of ShiJi (Records of the Grand Historian) in parallel corpora. Our research reveals that the basic word order in both texts remains similar. Long sentences in Ancient Chinese texts tend to be translated into long sentences in Contemporary Chinese versions; and short sentences tend to be translated into short sentences. The evaluation function δ of paragraph length and sentence length in both texts is consistent with a normal distribution. A considerable amount of identical Chinese characters can be found in source sentences and target sentences. The alignment mode of sentences and clauses is mainly 1-to-1. The maximum entropy model combines sentence/clause length, alignment mode and co-occurring Chinese characters to align sentences and clauses for parallel corpora of ShiJi. The precision and recall rate of clause alignment are higher than those of sentence alignment for ShiJi.

[1]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2017, Lecture Notes in Computer Science.

[2]  Hongyuan Dong,et al.  A History of the Chinese Language , 2014 .

[3]  Zhao Xiao-dong,et al.  A Dynamic Study of English Intertextual Lexical Repetition Rates* , 2014, J. Quant. Linguistics.

[4]  Zhiwei Feng,et al.  A Dynamic Study of English Intertextual Lexical Repetition Rates , 2014, J. Quant. Linguistics.

[5]  H. S. Dhami,et al.  Mathematical Modelling of the Pattern of Occurrence of Words in Different Corpora of the Hindi Language∗ , 2013, J. Quant. Linguistics.

[6]  J. Mendoza Book Review: Mendoza: Lomax, R. G. (2007). Statistical Concepts: A Second Course (3rd ed.). Mahwah, NJ: Lawrence Erlbaum , 2010 .

[7]  Yang Fei-yu,et al.  Chinese-Uyhur Sentence Alignment Based on Hybrid Strategy , 2010 .

[8]  Yang Chunhua,et al.  Particle Swarm Optimization with Chaotic Mutation , 2010 .

[9]  Yu Long,et al.  Chinese-Uyhur sentence alignment based on hybrid strategy , 2010 .

[10]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[11]  Fuji Ren,et al.  Chinese-Japanese Clause Alignment , 2005, CICLing.

[12]  L Xue,et al.  Sub-Sentence Alignment of Chinese-English Law Literature Based on Statistical Approach , 2003 .

[13]  Jane Shuter The Ancient Chinese , 1998 .

[14]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[15]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[16]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[17]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[18]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[19]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[20]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[21]  Joseph H. Greenberg,et al.  Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements , 1990, On Language.

[22]  R. D'Agostino,et al.  Goodness-of-Fit-Techniques , 1987 .

[23]  S. Shapiro,et al.  An Approximate Analysis of Variance Test for Normality , 1972 .

[24]  Leonard Robert Palmer,et al.  An introduction to modern linguistics , 1936 .