Modeling Comma Placement in Chinese Text for Better Readability using Linguistic Features and Gaze Information

Comma placements in Chinese text are relatively arbitrary although there are some syntactic guidelines for them. In this research, we attempt to improve the readability of text by optimizing comma placements through integration of linguistic features of text and gaze features of readers. We design a comma predictor for general Chinese text based on conditional random field models with linguistic features. After that, we build a rule-based filter for categorizing commas in text according to their contribution to readability based on the analysis of gazes of people reading text with and without commas. The experimental results show that our predictor reproduces the comma distribution in the Penn Chinese Treebank with 78.41 in F1-score and commas chosen by our filter smoothen certain gaze behaviors.

[1]  W. Chafe Punctuation and the Prosody of Written Language , 1988 .

[2]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[3]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[4]  Judy P. Walker,et al.  Prosodic Facilitation in the Resolution of Syntactic Ambiguities in Subjects with Left and Right Hemisphere Damage , 2001, Brain and Language.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[7]  Mi-Young Kim,et al.  Segmentation of Chinese Long Sentences Using Commas , 2004, SIGHAN@ACL.

[8]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[9]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[10]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[11]  Ming Yue Discursive Usage of Six Chinese Punctuation Marks , 2006, ACL.

[12]  K. Rayner,et al.  Punctuation and intonation effects on clause and sentence wrap-up: Evidence from eye movements , 2006 .

[13]  Yu Hang,et al.  CRF-based approach to sentence segmentation and punctuation for ancient Chinese prose , 2009 .

[14]  Josef van Genabith,et al.  A Linguistically Inspired Statistical Model for Chinese Punctuation Generation , 2010, TALIP.

[15]  Hwee Tou Ng,et al.  Better Punctuation Prediction with Dynamic Conditional Random Fields , 2010, EMNLP.

[16]  Zhixing Jin,et al.  The effects of punctuations in Chinese sentence comprehension: An ERP study , 2010, Journal of Neurolinguistics.

[17]  Yufang Yang,et al.  Syntactic boundaries and comma placement during silent reading of Chinese text: evidence from eye movements , 2010 .

[18]  Hsin-Hsi Chen,et al.  Pause and Stop Labeling for Chinese Sentence Boundary Detection , 2011, RANLP.

[19]  Nianwen Xue,et al.  Chinese sentence segmentation as comma classification , 2011, ACL.

[20]  Markus Freitag,et al.  Modeling punctuation prediction as machine translation , 2011, IWSLT.

[21]  Pascual Martínez-Gómez,et al.  Synthesizing Image Representations of Linguistic and Topological Features for Predicting Areas of Attention , 2012, PRICAI.

[22]  Pascual Martínez-Gómez,et al.  Image registration for text-gaze alignment , 2012, IUI '12.