Discrepancy Analysis of Complex Objects Using Dissimilarities

In this article we consider objects for which we have a matrix of dissimilarities and we are interested in their links with covariates. We focus on state sequences for which pairwise dissimilarities are given for instance by edit distances. The methods discussed apply however to any kind of objects and measures of dissimilarities. We start with a generalization of the analysis of variance (ANOVA) to assess the link of complex objects (e.g. sequences) with a given categorical variable. The trick is to show that discrepancy among objects can be derived from the sole pairwise dissimilarities, which permits then to identify factors that most reduce this discrepancy.We present a general statistical test and introduce an original way of rendering the results for state sequences. We then generalize the method to the case with more than one factor and discuss its advantages and limitations especially regarding interpretation. Finally, we introduce a new tree method for analyzing discrepancy of complex objects that exploits the former test as splitting criterion. We demonstrate the scope of the methods presented through a study of the factors that most discriminate Swiss occupational trajectories. All methods presented are freely accessible in our TraMineR package for the R statistical environment.

[1]  Matthew A. Zapala,et al.  Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables , 2006, Proceedings of the National Academy of Sciences.

[2]  L. Excoffier,et al.  Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. , 1992, Genetics.

[3]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000 .

[4]  George P. McCabe,et al.  The Practice of Business Statistics , 2004 .

[5]  Raffaella Piccarreta,et al.  Clustering work and family trajectories by using a divisive algorithm , 2007 .

[6]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[7]  Stefani Scherer,et al.  Early Career Patterns - A Comparison of Great Britain and West Germany , 2001 .

[8]  V. Batagelj Generalized Ward and Related Clustering Problems ∗ , 1988 .

[9]  John C. Gower,et al.  Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance , 1999 .

[10]  Ruth G. Shaw,et al.  Anova for Unbalanced Data: An Overview , 1993 .

[11]  Hans-Hermann Bock,et al.  Classification and Related Methods of Data Analysis , 1988 .

[12]  Helmuth Spaeth,et al.  Cluster-Analyse-Algorithmen zur Objektklassifizierung und Datenreduktion , 1975 .

[13]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  Pierre Geurts,et al.  Kernelizing the output of tree-based methods , 2006, ICML '06.

[16]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[17]  R. Levy,et al.  Entre contraintes institutionnelle et domestique: les parcours de vie masculins et féminins en Suisse , 2006 .

[18]  Brian H. McArdle,et al.  FITTING MULTIVARIATE MODELS TO COMMUNITY DATA: A COMMENT ON DISTANCE‐BASED REDUNDANCY ANALYSIS , 2001 .

[19]  Tim Hesterberg,et al.  Bootstrap Methods and Permutation Tests* 14.1 the Bootstrap Idea 14.2 First Steps in Using the Bootstrap 14.3 How Accurate Is a Bootstrap Distribution? 14.4 Bootstrap Confidence Intervals 14.5 Significance Testing Using Permutation Tests Introduction , 2004 .