Recent developments in the statistical processing of textual data

Statisticians are accustomed to processing numerical, ordinal or nominal data. In many circumstances, such as socio-economic, epidemiologic sample surveys and documentary data bases, this data is juxtaposed with textual data (for example, responses to open questions in surveys). This article presents a series of language-independent procedures based upon applying multivariate techniques (such as correspondence analysis and clustering) to sets of generalized lexical profiles. The generalized lexical profile of a text is a vector whose components are the frequencies of each word (graphical form) or ‘repeated segment’ (sequence of words appearing with a significant frequency in the text). The processing of such large (and often sparse) vectors and matrices requires special algorithms. The main outputs are the following: (1) printouts of the characteristic words and characteristic responses for each category of respondent (these categories are generally derived from available nominal variables); (2) graphical displays of the proximities between words or segments and categories of respondents; (3) when analysing a combination of several texts: graphical displays of proximities between words or segments and each text, or between words or segments and groupings of texts. The systematic use of ‘repeated segments’ provides a valuable help in interpreting the results from a semantic point of view.