CRIE: An automated analyzer for Chinese texts

Textual analysis has been applied to various fields, such as discourse analysis, corpus studies, text leveling, and automated essay evaluation. Several tools have been developed for analyzing texts written in alphabetic languages such as English and Spanish. However, currently there is no tool available for analyzing Chinese-language texts. This article introduces a tool for the automated analysis of simplified and traditional Chinese texts, called the Chinese Readability Index Explorer (CRIE). Composed of four subsystems and incorporating 82 multilevel linguistic features, CRIE is able to conduct the major tasks of segmentation, syntactic parsing, and feature extraction. Furthermore, the integration of linguistic features with machine learning models enables CRIE to provide leveling and diagnostic information for texts in language arts, texts for learning Chinese as a foreign language, and texts with domain knowledge. The usage and validation of the functions provided by CRIE are also introduced.

[1]  Giles M. Foody,et al.  Feature Selection for Classification of Hyperspectral Data by SVM , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[2]  Y. T. Sung,et al.  A Chinese word segmentation and POS tagging system for readability research , 2012 .

[3]  Yao-Ting Sung,et al.  Investigating Chinese Text Readability: Linguistic Features, Modeling, and Validation , 2013 .

[4]  Rebekah George Benjamin Reconstructing Readability: Recent Developments and Recommendations in the Analysis of Text Difficulty , 2012 .

[5]  Cindy K. Chung,et al.  The development of the Chinese linguistic inquiry and word count dictionary. , 2012 .

[6]  S. Jay Samuels,et al.  Developmental changes in character-complexity and word-length effects when reading Chinese script , 2010 .

[7]  Eva Ceulemans,et al.  CHull: A generic convex-hull-based model selection method , 2012, Behavior Research Methods.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Hermann Ney,et al.  Dynamic programming parsing for context-free grammars in continuous speech recognition , 1991, IEEE Trans. Signal Process..

[10]  Arthur C. Graesser,et al.  Automated Evaluation of Text and Discourse with Coh-Metrix: Coh-Metrix Measures , 2014 .

[11]  P. Lewis Ethnologue : languages of the world , 2009 .

[12]  Ming-Syan Chen,et al.  On the Design and Analysis of the Privacy-Preserving SVM Classifier , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13]  Eleni Miltsakaki,et al.  Matching Readers’ Preferences and Reading Skills with Appropriate Web Texts , 2009, EACL.

[14]  D. Balota,et al.  Are lexical decisions a good measure of lexical access? The role of word frequency in the neglected decision stage. , 1984, Journal of experimental psychology. Human perception and performance.

[15]  K. Forster,et al.  Lexical Access and Naming Time. , 1973 .

[16]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[17]  Lawrence M. Rudner,et al.  Automated Essay Scoring Using Bayes' Theorem , 2002 .

[18]  Jill Burstein,et al.  AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0 , 2004 .

[19]  Jill Burstein,et al.  The E-rater® scoring engine: Automated essay scoring with natural language processing. , 2003 .

[20]  Peter W. Foltz,et al.  The intelligent essay assessor: Applications to educational technology , 1999 .

[21]  Daniela B. Friedman,et al.  A Systematic Review of Readability and Comprehension Instruments Used for Print and Web-Based Cancer Information , 2006, Health education & behavior : the official publication of the Society for Public Health Education.

[22]  Constantine Kotropoulos,et al.  Long distance bigram models applied to word clustering , 2011, Pattern Recognit..

[23]  Laura B. Smolkin,et al.  Searching for Explanations in Science Trade Books: What can we learn from Coh-Metrix? , 2013 .

[24]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[25]  Arthur C. Graesser,et al.  Coh-Metrix: Capturing Linguistic Features of Cohesion , 2010 .

[26]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[27]  Ion Androutsopoulos,et al.  An Open-Source Natural Language Generator for OWL Ontologies and its Use in Protege and Second Life , 2009, EACL.

[28]  Marcel Adam Just,et al.  17 – What Your Eyes Do while Your Mind Is Reading1 , 1983 .

[29]  C. P. Whaley Word–nonword classification time. , 1978 .

[30]  Wendy G. Lehnert,et al.  Strategies for Natural Language Processing , 1982 .

[31]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[32]  清川 英男,et al.  CHALL, J. S. and DALE, E. (1995) Readability Revisited : The New Dale-Chall Readability Formula., Brookline Books , 1996 .

[33]  Yang Zhao Review Article: A Tree in the Wood--A Review of Research on L2 Chinese Acquisition. , 2011 .

[34]  Yao-Ting Sung,et al.  Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning , 2015, Behavior research methods.

[35]  Mark Johnson,et al.  PCFG Models of Linguistic Tree Representations , 1998, CL.

[36]  D. McNamara,et al.  Assessing Text Readability Using Cognitively Based Indices , 2008 .

[37]  David Graddol,et al.  The Future of Language , 2004, Science.

[38]  Emma Marsden,et al.  Using CHILDES tools for researching second language acquisition , 2003 .

[39]  Keh-Jiann Chen,et al.  Reliable and Cost-Effective Pos-Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[40]  Arthur C. Graesser,et al.  Automated Evaluation of Text and Discourse with Coh-Metrix: List of Tables , 2014 .

[41]  Yao-Ting Sung,et al.  Chen, J.-L., Cha, J.-H., Chang, T.-H., Sung, Y.-T., & Hsieh, K.-S. (2012, Nov). CRIE: A tool for analyzing Chinese text characteristics. Paper presented at the 42nd annual meeting of the Society for Computers in Psychology (SCiP), Minnesota, USA. , 2012 .

[42]  Yao-Ting Sung,et al.  Evaluating the Difficulty of Concepts on Domain Knowledge Using Latent Semantic Analysis , 2013, 2013 International Conference on Asian Language Processing.

[43]  Manuel Perea,et al.  EsPal: One-stop shopping for Spanish word properties , 2013, Behavior Research Methods.

[44]  Rasmus Bååth,et al.  ChildFreq: An Online Tool to Explore Word Frequencies in Child Language. , 2010 .