Statisticians are accustomed to processing numerical, ordinal or nominal data. In many circumstances, such as socio-economic, epidemiologic sample surveys and documentary data bases, this data is juxtaposed with textual data (for example, responses to open questions in surveys). This article presents a series of language-independent procedures based upon applying multivariate techniques (such as correspondence analysis and clustering) to sets of generalized lexical profiles. The generalized lexical profile of a text is a vector whose components are the frequencies of each word (graphical form) or ‘repeated segment’ (sequence of words appearing with a significant frequency in the text). The processing of such large (and often sparse) vectors and matrices requires special algorithms. The main outputs are the following: (1) printouts of the characteristic words and characteristic responses for each category of respondent (these categories are generally derived from available nominal variables); (2) graphical displays of the proximities between words or segments and categories of respondents; (3) when analysing a combination of several texts: graphical displays of proximities between words or segments and each text, or between words or segments and groupings of texts. The systematic use of ‘repeated segments’ provides a valuable help in interpreting the results from a semantic point of view.
[1]
A. Morineau,et al.
Multivariate descriptive statistical analysis
,
1984
.
[2]
R. Clarke,et al.
Theory and Applications of Correspondence Analysis
,
1985
.
[3]
Max Reinert,et al.
Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval
,
1990
.
[4]
André Salem,et al.
La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes
,
1984
.
[5]
Pierre Lafon,et al.
L'inventaire des segments répétés d'un texte
,
1983
.
[6]
André Salem,et al.
Analyse factorielle et lexicomtrie : synthse de quelques expriences
,
1982
.
[7]
D Coulon,et al.
Natural language and computers: a general survey of written text interpretation methods
,
1986
.
[8]
André Salem.
Pratique des segments répétés : essai de statistique textuelle
,
1987
.
[9]
André Salem.
Jean-Paul Benzécri et collaborateurs, Pratique de l'analyse des données : linguistique et lexicologie
,
1982
.
[10]
John P. Robinson,et al.
Questions and answers in attitude surveys
,
1982
.
[11]
P. Lafon.
Sur la variabilité de la fréquence des formes dans un corpus
,
1980
.