Combining statistical data analysis techniques to extract topical keyword classes from corpora

We present an unsupervised method for the generation from a textual corpus of sets of keywords, that is, words whose occurrences in a text are strongly connected with the presence of a given topic. Each of these classes is associated with one of the main topics of the corpus, and can be used to detect the presence of that topic in any of its paragraphs, by a simple keyword co-occurrence criterion. The classes are extracted from the textual data in a fully automatic way, without requiring any a priori linguistic knowledge or making any assumptions about the topics to search for. The algorithms we have developed allow us to yield satisfactory and directly usable results despite the amount of noise inherent in textual data. That goal is reached thanks to a combination of several data analysis techniques. On a corpus of archives from the French monthly newspaper Le Monde Diplomatique, we obtain 40 classes of about 30 words each that accurately characterize precise topics, and allow us to detect their occurrences with a precision and recall of 85% and 65% respectively.

[1]  Rebecca J. Passonneau,et al.  Combining Multiple Knowledge Sources for Discourse Segmentation , 1995, ACL.

[2]  Nancy Ide,et al.  MULTEXT: Multilingual Text Tools and Corpora , 1994, COLING.

[3]  Marti A. Hearst Multi-Paragraph Segmentation of Expository Texts , 1994 .

[4]  Ronan Pichon,et al.  From Corpus to lexicon: from contexts to semantic features , 2000 .

[5]  Alan F. Smeaton,et al.  Using NLP or NLP Resources for Information Retrieval Tasks , 1999 .

[6]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  I. C. Lerman,et al.  Foundations of the likelihood linkage analysis (LLA) classification method , 1991 .

[8]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[9]  イヴォン ブレース,et al.  Presses Universitaires de France刊行の近著解題 , 1952 .

[10]  Brigitte Grau,et al.  A bootstrapping approach for robust topic analysis , 2002, Natural Language Engineering.

[11]  Armelle Brun Détection de thème et adaptation des modèles de langage pour la reconnaissance automatique de la parole , 2003 .

[12]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[13]  François Rastier Ah! Tonnerre! Quel trou dans la blanquette! Essai de sémantique interprétative , 1984 .

[14]  R. De Mori,et al.  Combined models for topic spotting and topic-dependent language modeling , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[15]  Philippe Peter,et al.  Principes et calculs de la méthode implantée dans le programme CHAVL (partie II) , 1994, Monde des Util. Anal. Données.

[16]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[17]  Rodolphe Priam Méthodes de carte auto-organisatrice par mélange de lois contraintes. Application à l'exploration dans les tableaux de contingence textuels , 2003 .

[18]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[19]  I. C. Lerman,et al.  Principes et calculs de la méthode implantée dans le programme CHAVL (Classification Hiérarchique par Analyse de la Vraisemblance des Liens). II , 1994 .