A Linguistic and Mathematical Method for Mapping Thematic Trends from Texts

We present a novel method for mapping thematic trends called "Classification by Preferential Clustered Link" (CPCL). This method clusters relevant textual units (terms) from a corpus of texts, based on meaningful linguistic relations (syntactic variations) identified amongst the units. Terms related through syntactic variations are represented in the form of a graph and are first clustered into connected components using the subset of variation relations affecting the modifier word(s) in a term. The connected components are in turn clustered into classes using the subset of variation relations affecting the head word in a term. Through a chronological analysis of the terms, the method pinpoints the evolution of research topics. The CPCL method differs from classical data analysis methods in that it integrates n meaningful linguistic relations as classification criteria. Also, the method avoids the bias caused by fixing class size before classification and thus splitting classes artificially during clustering. The graph formalism, the theoretical model underlying the CPCL method offers a powerful means of representing the linguistic relations between terms. INTRODUCTION Thematic trends mapping is an aspect of scientific and technological watch concerned with detecting the evolution of research topics in a given field. It involves the collection of written data from databases or from the Internet and their automatic analysis. The results obtained can be used for decision support. Up till recently, methods used to analyse textual data have been based solely on statistical techniques [1] that do not take into consideration the linguistic nature of the data. Although some researchers have come to consider the linguistic properties of textual data [2, 3], they still base classification on statistical criteria namely occurrences or co-occurrences of textual units (words, noun phrases, keywords or terms). We present a novel method for clustering relevant textual units using meaningful linguistic relations called "Classification by Preferential Clustered Link" (CPCL). Classification is used here in the data analysis sense which means gathering together things that are close. The test data used in this study consists of scientific abstracts and titles in English making up ≅29000 words and representing 187 publications made between 1981 1993 in the field of plant biotechnology. The corpus constituted in 1993, covered the publications of the four most productive authors in this field who belong to four different research laboratories worldwide. From these texts, 3159 terms appearing as noun phrases were extracted after prior morphological and syntactic analyses. Terms are meaningful textual units representing concepts or objects in a given knowledge field. As such, they are important in applications like automatic indexing, computer-aided translation, terminology updating, information retrieval and technology watch. Variations are changes affecting the structure and the form of a term thus producing another textual unit close to the initial one. Variations may be signs of terminological evolution and thus, of the evolution of concepts represented by the terms in the domain studied. For example, given the two terms "dna amplification" and "dna amplification fingerprinting", a shift in topic is apparent in the second term where "amplification" shifts from the head to the modifier position. The head is the subject of the noun phrase or the "noun focus". It is usually the last noun in a compound structure (dna amplification fingerprinting) and the last noun preceding the preposition in a syntagmatic structure, (amplification fingerprinting of dna). The modifiers are all the other nouns and adjectives in the noun phrase. Past studies have focused on the extraction of terms from large textual corpus [4, 5] or on the updating of terminological databases from a corpus of specialised texts through term syntactic variant extraction [6]. However these studies did not address the use of linguistic knowledge gained from terminological variation for tracking trends in a scientific or technical domain. Here, we show how syntactic variations become the key information for detecting research topics evolution. The CPCL method bases term classification on these syntactic relations and uses the graph theory formalism to represent terms as well as the relations between them. It thus combines linguistic and mathematical models to map out thematic trends from a textual corpus and, through a chronological analysis, enables us to trace their evolution. The rest of the paper is organised as follows : §1 briefly outlines the syntactic variations identified amongst terms; §2 presents the classification method and §3 illustrates its application to terms. The morpho-syntactic phases of analysis leading to the extraction of terms is detailed in [7] and thus will not be presented here. 1. SYNTACTIC VARIATIONS AMONGST TERMS We studied three types of syntactic variations occurring frequently in English texts : permutation, substitution and expansion. A further distinction is made amongst these types depending on the grammatical function of the word affected by the variation head or modifier. 1.1 Permutation Permutation marks the transformation of a term, from a syntagmatic structure (with a prepositional phrase) to a compound one and occurs in the following context : t1 = A N M1 h p m M2 t2 = A m M2 N M1 h where t1 and t2 are terms found in the corpus, A is a string of adjectives, N is a noun, M is a string of words, p is a preposition, h is a head noun and m is a word. The place of A, N and M can be empty. Thus "azolla-anabaena accession" is the permutation of "accession of azolla-anabaena". 37 terms were related by this variation type. Permutation is the pivot variation phenomenon which guides the identification of the other syntactic variations since it concerns terms in the two syntactic structures compound and syntagmatic. In fact, in order to identify substitution or expansion variants, it suffices to transform terms in a syntagmatic structure into their compound version. For example, given the two terms "nitrogenase activity" and "nitrogenase activity of cv. bragg", the latter transformed into its compound version "cv. bragg nitrogenase activity" allows us to detect immediately that the variation type involved is left-expansion (see 1.3 below). * University of Le Havre Institute of Technology Dept. of Information Communication, Place Robert Schuman B.P. 4006, F-76610 Le Havre, FRANCE E-mail : fidelia@iut.univ-lehavre.fr