Detecting Content-Heavy Sentences: A Cross-Language Case Study

The information conveyed by some sentences would be more easily understood by a reader if it were expressed in multiple sentences. We call such sentences content heavy: these are possibly grammatical but difficult to comprehend, cumbersome sentences. In this paper we introduce the task of detecting content-heavy sentences in cross-lingual context. Specifically we develop methods to identify sentences in Chinese for which English speakers would prefer translations consisting of more than one sentence. We base our analysis and definitions on evidence from multiple human translations and reader preferences on flow and understandability. We show that machine translation quality when translating content heavy sentences is markedly worse than overall quality and that this type of sentence are fairly common in Chinese news. We demonstrate that sentence length and punctuation usage in Chinese are not sufficient clues for accurately detecting heavy sentences and present a richer classification model that accurately identifies these sentences.

[1]  David Kauchak,et al.  Improving Text Simplification Language Modeling Using Unsimplified Text Data , 2013, ACL.

[2]  David Kauchak,et al.  Learning to Simplify Sentences Using Wikipedia , 2011, Monolingual@ACL.

[3]  Owen Rambow,et al.  Applied Text Generation , 1992, ANLP.

[4]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[5]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[6]  F. Ren,et al.  Chinese complex long sentences processing method for Chinese-Japanese machine translation , 2007, 2007 International Conference on Natural Language Processing and Knowledge Engineering.

[7]  Mirella Lapata,et al.  Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming , 2011, EMNLP.

[8]  Nianwen Xue,et al.  Chinese sentence segmentation as comma classification , 2011, ACL.

[9]  Marie-Francine Moens,et al.  Text simplification for children , 2010, SIGIR 2010.

[10]  Nianwen Xue,et al.  Chinese Comma Disambiguation for Discourse Analysis , 2012, ACL.

[11]  Lucia Specia,et al.  SHEF-Lite 2.0: Sparse Multi-task Gaussian Processes for Translation Quality Estimation , 2014, WMT@ACL.

[12]  Advaith Siddharthan,et al.  A survey of research on text simplification , 2014 .

[13]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[14]  Hermann Ney,et al.  Sentence segmentation using IBM word alignment model 1 , 2005, EAMT.

[15]  Chew Lim Tan,et al.  Automatic Alignment of English-Chinese Bilingual Texts of CNS News , 1996, ArXiv.

[16]  Mi-Young Kim,et al.  Segmentation of Chinese Long Sentences Using Commas , 2004, SIGHAN@ACL.

[17]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[18]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[19]  Paul L. Garvin,et al.  CHINESE-ENGLISH MACHINE TRANSLATION. , 1967 .

[20]  Chengqing Zong,et al.  A Hierarchical Parsing Approach with Punctuation Processing for Long Chinese Sentences , 2005, IJCNLP.

[21]  Marilyn A. Walker,et al.  SPoT: A Trainable Sentence Planner , 2001, NAACL.

[22]  Marilyn A. Walker,et al.  Trainable Sentence Planning for Complex Information Presentations in Spoken Dialog Systems , 2004, ACL.

[23]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .

[24]  Dipti Misra Sharma,et al.  Exploring the effects of Sentence Simplification on Hindi to English Machine Translation System , 2014 .

[25]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[26]  Zhiying Liu,et al.  Improving Chinese-English patent machine translation using sentence segmentation , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[27]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[28]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[29]  Junyi Jessy Li,et al.  Assessing the Discourse Factors that Influence the Quality of Machine Translation , 2014, ACL.

[30]  Bonnie L. Webber,et al.  Applying the semantics of negation to SMT through n-best list re-ranking , 2014, EACL.

[31]  Renata Pontin de Mattos Fortes,et al.  Towards Brazilian Portuguese automatic text simplification systems , 2008, DocEng '08.

[32]  Shashi Narayan,et al.  Hybrid Simplification using Deep Semantics and Machine Translation , 2014, ACL.

[33]  Lucia Specia Translating from Complex to Simplified Sentences , 2010, PROPOR.

[34]  Daniel Jurafsky,et al.  Disambiguating “DE” for Chinese-English Machine Translation , 2009, WMT@EACL.

[35]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[36]  Shimei Pan,et al.  Instance-based Sentence Boundary Determination by Optimization for Natural Language Generation , 2005, ACL.

[37]  Daniel Jurafsky,et al.  Discriminative Reordering with Chinese Grammatical Relations Features , 2009, SSST@HLT-NAACL.

[38]  Junyi Jessy Li,et al.  Fast and Accurate Prediction of Sentence Specificity , 2015, AAAI.