论文信息 - Recent developments in the statistical processing of textual data

Recent developments in the statistical processing of textual data

Statisticians are accustomed to processing numerical, ordinal or nominal data. In many circumstances, such as socio-economic, epidemiologic sample surveys and documentary data bases, this data is juxtaposed with textual data (for example, responses to open questions in surveys). This article presents a series of language-independent procedures based upon applying multivariate techniques (such as correspondence analysis and clustering) to sets of generalized lexical profiles. The generalized lexical profile of a text is a vector whose components are the frequencies of each word (graphical form) or ‘repeated segment’ (sequence of words appearing with a significant frequency in the text). The processing of such large (and often sparse) vectors and matrices requires special algorithms. The main outputs are the following: (1) printouts of the characteristic words and characteristic responses for each category of respondent (these categories are generally derived from available nominal variables); (2) graphical displays of the proximities between words or segments and categories of respondents; (3) when analysing a combination of several texts: graphical displays of proximities between words or segments and each text, or between words or segments and groupings of texts. The systematic use of ‘repeated segments’ provides a valuable help in interpreting the results from a semantic point of view.

Ludovic Lebart | André Salem | Lisette Berry

[1] A. Morineau,et al. Multivariate descriptive statistical analysis , 1984 .

[2] R. Clarke,et al. Theory and Applications of Correspondence Analysis , 1985 .

[3] Max Reinert,et al. Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval , 1990 .

[4] André Salem,et al. La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes , 1984 .

[5] Pierre Lafon,et al. L'inventaire des segments répétés d'un texte , 1983 .

[6] André Salem,et al. Analyse factorielle et lexicomtrie : synthse de quelques expriences , 1982 .

[7] D Coulon,et al. Natural language and computers: a general survey of written text interpretation methods , 1986 .

[8] André Salem. Pratique des segments répétés : essai de statistique textuelle , 1987 .

[9] André Salem. Jean-Paul Benzécri et collaborateurs, Pratique de l'analyse des données : linguistique et lexicologie , 1982 .

[10] John P. Robinson,et al. Questions and answers in attitude surveys , 1982 .

[11] P. Lafon. Sur la variabilité de la fréquence des formes dans un corpus , 1980 .