论文信息 - Modélisation du prétraitement des textes

Modélisation du prétraitement des textes

Dans cet article, nous definissons un modele pour l'etape de pretraitement des textes dans le cadre de la fouille de textes et plus generalement de l'extraction d'informations a partir de textes. Cet article ne contient pas les details de l'implementation. L'objectif est d'obtenir un modele generique de normalisation des textes bruts. La motivation de cet article est de generaliser les travaux assez confidentiels et specialises qui existent pour cette etape de pretraitement. Cette etape est pourtant incontournable et d'elle depend grandement la qualite des analyses obtenues a toutes les etapes ulterieures.

Thomas Heitz | Thomas Heitz

[1] Pasi Tapanainen,et al. What is a word, What is a sentence? Problems of Tokenization , 1994 .

[2] Martti Juhola,et al. Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[3] Gilles Adda,et al. Towards tokenization evaluation , 1998, LREC.

[4] Andrei Mikheev,et al. Document centered approach to text normalization , 2000, SIGIR '00.

[5] José Gabriel Pereira Lopes,et al. EXTRACTION AUTOMATIQUE D'ASSOCIATIONS TEXTUELLES PARTIR DE CORPORA NON TRAITS , 2000 .

[6] Lori Lamel,et al. Text normalization and speech recognition in French , 1997, EUROSPEECH.

[7] David A. Hull. Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[8] Thomas Heitz,et al. From the Texts to the Contexts They Contain: A Chain of Linguistic Treatments , 2004, TREC.

[9] Tong Zhang,et al. Updating an NLP system to fit new domains: an empirical study on the sentence segmentation problem , 2003, CoNLL.

[10] Stephen Tomlinson,et al. Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServerTM at CLEF 2003 , 2003, CLEF.

[11] Éric Villemonte de la Clergerie,et al. MAF: a Morphosyntactic Annotation Framework , 2005 .