Reconnaissance de critères de comparabilité dans un corpus multilingue spécialisé

Notre objectif est d'automatiser la construction de corpus comparables specialises a partir du Web. La comparabilite se base sur trois niveaux : le domaine, le theme et le type de discours. Le domaine et le theme peuvent etre filtres grâce aux mots-cles utilises lors de la recherche. Nous presentons dans cet article la reconnaissance automatique du type de discours dans des documents specialises francais et japonais, qui necessite une analyse linguistique poussee. Une analyse contrastive des documents nous permet de determiner quelles informations paraissent discriminantes. En s'inspirant des travaux classiques de recherche d'information, nous creons une typologie robuste et linguistiquement motivee basee sur trois niveaux d'analyse : structurel, modal et lexical. Cette typologie nous permet d'apprendre des modeles de classification qui donnent de bons resultats, ce qui montre l'efficacite de cette typologie. Our goal is to automate the compilation of smart specialized comparable corpora. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. We present in this paper the automatic detection of the type of discourse in French and Japanese documents, which needs a wide linguistic analysis. A contrastive analysis of the documents leads us to specify which information is relevant to distinguish them. Referring to classical studies on information retrieval, we create a robust and linguistically motivated typology based on three analysis levels: structural, modal and lexical. This typology is used to learn classification models using shallow parsing. We obtain good results, that demonstrates the efficiency of this typology.

[1]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[2]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[3]  Tony McEnery,et al.  Parallel and comparable corpora: What is happening? , 2007 .

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[6]  Margaret Rogers,et al.  Incorporating corpora: The linguist and the translator , 2008 .

[7]  Robert H. Baud,et al.  Defining and relating biomedical terms: Towards a cross-language morphosemantics-based system , 2007, Int. J. Medical Informatics.

[8]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[9]  Jennifer Pearson,et al.  Terms in context , 1998 .

[10]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[11]  Jacques Lerot,et al.  Corpus-based Approaches to Contrastive Linguistics and Translation Studies , 2003 .

[12]  Jennifer Pearson,et al.  Working with Specialized Language: A Practical Guide to Using Corpora , 2002 .

[13]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[14]  Natalia Grabar,et al.  Building a Text Corpus for Representing the Variety of Medical Language , 2001, MedInfo.

[15]  Yun-Chuang Chiao,et al.  Extraction lexicale bilingue à partir de textes médicaux comparables : application à la recherche d'information translangue. (Bilingual lexicon extraction from comparable medical texts: application for cross-language information retrieval) , 2004 .

[16]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[17]  Béatrice Daille Morphological Rule Induction for Terminology Acquisition , 2000, COLING.

[18]  Carol Peters,et al.  Using Linguistic Tools and Resources in Cross-Language Retrieval , 1997 .

[19]  François Rastier,et al.  Genres et variations morphosyntaxiques , 2000 .

[20]  A. Zanasi Text Mining and its Applications to Intelligence, CRM and Knowledge Management , 2007 .

[21]  Oswald Ducrot,et al.  Dictionnaire encyclopédique des sciences du langage , 1972 .

[22]  Adam Kilgarriff,et al.  Large Linguistically-Processed Web Corpora for Multiple Languages , 2006, EACL.

[23]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[24]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[25]  Pierre Zweigenbaum,et al.  Catégorisation automatique de pages web chinoises - documents spécialisés vs grand public sur le tabagisme , 2009, CORIA.

[26]  P. Charaudeau,et al.  Grammaire du sens et de l expression , 1992 .