Validation Techniques in Text Mining (with Application to the Processing of Open-ended Questions)

Clustering methods and principal axes techniques as well play a major role in the computerized exploration of textual corpora. However, most of the outputs of these unsupervised procedures are difficult to assess. We will focus on the two following issues: External validation, involving external data and allowing for classical statistical tests. Internal validation, based on resampling techniques such as bootstrap and other Monte Carlo methods. In the domain of textual data, these techniques can efficiently tackle the difficult problem of the plurality of statistical units (words, lemmas, segments, sentences, respondents).

[1]  Hans-Hermann Bock,et al.  PROBABILITY MODELS AND HYPOTHESES TESTING IN PARTITIONING CLUSTER ANALYSIS , 1996 .

[2]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[3]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[4]  M. Volle Analyse des données , 1978 .

[5]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[6]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[7]  A. Morineau,et al.  Multivariate descriptive statistical analysis , 1984 .

[8]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[9]  Hans-Hermann Bock,et al.  Data Science, Classification and Related Methods , 1998 .

[10]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[11]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[12]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Bootstrap Confidence Regions for Homogeneity Analysis; the Influence of Rotation on Coverage Percentages , 1994 .

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  M. Schader,et al.  New Approaches in Classification and Data Analysis , 1994 .

[17]  Frederic Chateau,et al.  Assessing Sample Variability in the Visualization Techniques Related to Principal Component Analysis: Bootstrap and Alternative Simulation Methods , 1996 .

[18]  Joe Whittaker,et al.  Application of the Parametric Bootstrap to Models that Incorporate a Singular Value Decomposition , 1995 .

[19]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[20]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[21]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[22]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[23]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[24]  D. Saville Multiple Comparison Procedures: The Practical Solution , 1990 .

[25]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[26]  P. Diaconis,et al.  Computer-Intensive Methods in Statistics , 1983 .

[27]  Susan Holmes Using the bootstrap and the RV coefficient in the multivariate context , 1989 .

[28]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[29]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[30]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[31]  H. Bock On some significance tests in cluster analysis , 1985 .

[32]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[33]  J. Hsu Multiple Comparisons: Theory and Methods , 1996 .