A Tagging Approach to Identify Complex Constituents for Text Simplification

The occurrence of syntactic phenomena such as coordination and subordination is characteristic of long, complex sentences. Text simplification systems need to detect and categorise constituents in order to generate simpler sentences. These constituents are typically bounded or linked by signs of syntactic complexity, which include conjunctions, complementisers, whwords, and punctuation marks. This paper proposes a supervised tagging approach to classify these signs in accordance with their linking and bounding functions. The performance of the approach is evaluated both intrinsically, using an annotated corpus covering three different genres, and extrinsically, by evaluating the impact of classification errors on an automatic text simplification system. The results are encouraging.

[1]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[2]  Rajeev Agarwal,et al.  A Simple but Useful Approach to Conjunct Identification , 1992, ACL.

[3]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[4]  Joakim Nivre,et al.  Analyzing and Integrating Dependency Parsers , 2011, CL.

[5]  Erhard W. Hinrichs,et al.  Annotating Coordination in the Penn Treebank , 2012, LAW@ACL.

[6]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[7]  Richard J. Evans,et al.  Comparing methods for the syntactic simplification of sentences in information extraction , 2011, Literary and Linguistic Computing.

[8]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[11]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[12]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[13]  Lawrence Hunter,et al.  Extracting Molecular Binding Relationships from Biomedical Text , 2000, ANLP.

[14]  Eduard H. Hovy,et al.  Improving Translation Quality by Manipulating Sentence Length , 1998, AMTA.

[15]  Masaru Tomita,et al.  Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems , 1985 .

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  M. Just,et al.  Brain Activation Modulated by Sentence Comprehension , 1996, Science.

[18]  Richard Evans,et al.  Annotating Signs of Syntactic Complexity to Support Sentence Simplification , 2013, TSD.

[19]  Sebastian van Delden,et al.  Combining finite state automata and a greedy learning algorithm to determine the syntactic roles of commas , 2002, 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings..

[20]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .