Practical NLP-Based Text Indexing

We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for conflating words by means of derivational mechanisms, and a shallow parser to extract syntactic-dependency pairs. We propose to use these techniques in order to improve the performance of standard indexing engines.

[1]  Miguel A. Alonso,et al.  A Common Solution for Tokenization and Part-of-Speech Tagging , 2002, TSD.

[2]  Jean-pierre Chanod A Non-deterministic Tokeniser for Finite-State Parsing , 1996 .

[3]  Miguel A. Alonso,et al.  Towards the Development of Heuristics for Automatic Query Expansion , 2001, DEXA.

[4]  Christer Samuelsson,et al.  Morphological Tagging Based Entirely on Bayesian Inference , 1993, NODALIDA.

[5]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[6]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[7]  Miguel A. Alonso,et al.  Using Syntactic Dependency-Pairs Conflation to Improve Retrieval Performance in Spanish , 2002, CICLing.

[8]  Miguel A. Alonso,et al.  Tokenization and proper noun recognition for information retrieval , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[9]  Evelyne Tzoukermann,et al.  NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax , 1999 .

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  Jean-Cédric Chappelier,et al.  Integrating external dictionaries into stochastic part-of-speech taggers , 2001 .

[12]  M. F. Lang Spanish word formation : productive derivational morphology in the modern lexis , 1991 .

[13]  Jesús Vilares,et al.  Formal Methods of Tokenization for Part-of-Speech Tagging , 2002, CICLing.

[14]  Miguel A. Alonso,et al.  Applying Productive Derivational Morphology to Term Indexing of Spanish Texts , 2001, CICLing.