Shallow Processing with Unification and Typed Feature Structures - Foundations and Applications

Nowadays, we are witnessing an ever-growing trend of deploying lightweight linguistic analysis for solving problems that deal with the conversion of the vast bulk of raw textual information from myriads of digital data repositories into structured and valuable knowledge. Recent advances in the areas of information extraction, text mining, and textual question answering demonstrate the benefit of applying shallow text processing (STP) techniques, which are assumed to be considerably less time-consuming and more robust than deep processing systems, but are still sufficient to cover a broad range of linguistic phenomena. This article also gives a walkthrough on the foundations and applications of SProUT (Shallow Processing with Unification and Typed feature structures), a novel platform for the development of multilingual STP systems. It consists of several linguistic processing resources which can be coupled in a flexible way for building higher-level linguistic engines, and provides an integrated grammar development and testing environment. The motivation for developing SProUT comes from the need to have a system that (i) allows a flexible integration of different processing modules and (ii) to find a good trade-off between processing efficiency and expressiveness of the formalism. On the one hand, very efficient finite-state (FS) devices have been successfully applied to real-world applications. On the other hand, unification-based grammars (UBGs) are designed to capture fine-grained syntactic and semantic constraints, resulting in better descriptions of natural language phenomena. In contrast to FS devices, unification-based grammars are also assumed to be more transparent and more easily modifiable. The idea of SProUT is to take the best of these two worlds, having a FS machine that operates on typed feature structures (TFSs). I.e., transduction rules in SProUT do not rely on simple atomic symbols, but instead on TFSs, where the left-hand side (LHS) of a rule is a regular expression over TFSs, representing the recognition pattern, and the right-hand side (RHS) is a TFS, specifying the output structure. Consequently, equality of atomic symbols is replaced by unifiability of TFSs and the output is constructed using TFS unification w.r.t. a type hierarchy. Shallow Processing with Unification and Typed Feature Structures – Foundations and Applications

[1]  Günter Neumann,et al.  An Integrated Archictecture for Shallow and Deep Processing , 2002, ACL.

[2]  Günter Neumann,et al.  An Information Extraction Core System for Real World German Text Processing , 1997, ANLP.

[3]  Martin C. Emele Unification with Lazy Non-Redundant Copying , 1991, ACL.

[4]  Yves Schabes,et al.  Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[5]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[6]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[7]  Günter Neumann,et al.  An Intelligent Text Extraction and Navigation System , 2000, RIAO.

[8]  Yuji Matsumoto,et al.  Extended Models and Tools for High-performance Part-of-speech , 2000, COLING.

[9]  Feiyu Xu,et al.  A Flexible XML-based Regular Compiler for Creation and Conversion of Linguistic Resources , 2002, LREC.

[10]  Hans-Ulrich Krieger,et al.  TDL-A Type Description Language for Constraint-Based Grammars , 1994, COLING.

[11]  Fernando Pereira,et al.  The Formalism and Implementation of PATR-II , 1983 .

[12]  Ulrich Schäfer,et al.  WHAT: An XSLT-based Infrastructure for the Integration of Natural Language Processing Components , 2003, HLT-NAACL 2003.

[13]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[14]  Diana Maynard,et al.  JAPE: a Java Annotation Patterns Engine , 2000 .

[15]  Jakub Piskorski,et al.  DFKI finite-state machine toolkit , 2002 .

[16]  Gertjan van Noord,et al.  Finite State Transducers with Predicates and Identities , 2001, Grammars.

[17]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[18]  Jan Hajic Disambiguation of Rich Inflection - Computational Morphology of Czech , 2004 .

[19]  Hans-Ulrich Krieger,et al.  A Type-Driven Method for Compacting MMorph Resources , 2003 .

[20]  Adam Przepiórkowski,et al.  Information Extraction for Polish Using the SProUT Platform , 2004, Intelligent Information Systems.

[21]  Stephan Oepen,et al.  The (new) LKB system , 1999 .

[22]  Günter Neumann,et al.  DISCO-An HPSG-based NLP System and its Application for Appointment Scheduling Project Note , 1994, COLING.

[23]  Adam Przepiórkowski,et al.  A Flexemic Tagset for Polish , 2003 .

[24]  Hans Uszkoreit,et al.  Integrating Information Extraction and Automatic Hyperlinking , 2003, ACL.

[25]  Jia-Heng Zheng,et al.  Research of automatic Chinese word segmentation , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[26]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.