Archetypal Analysis With Missing Data: See All Samples by Looking at a Few Based on Extreme Profiles

Abstract In this article, we propose several methodologies for handling missing or incomplete data in archetype analysis (AA) and archetypoid analysis (ADA). AA seeks to find archetypes, which are convex combinations of data points, and to approximate the samples as mixtures of those archetypes. In ADA, the representative archetypal data belong to the sample, that is, they are actual data points. With the proposed procedures, missing data are not discarded or previously filled by imputation and the theoretical properties regarding location of archetypes are guaranteed, unlike the previous approaches. The new procedures adapt the AA algorithm either by considering the missing values in the computation of the solution or by skipping them. In the first case, the solutions of previous approaches are modified to fulfill the theory and a new procedure is proposed, where the missing values are updated by the fitted values. In this second case, the procedure is based on the estimation of dissimilarities between samples and the projection of these dissimilarities in a new space, where AA or ADA is applied, and those results are used to provide a solution in the original space. A comparative analysis is carried out in a simulation study, with favorable results. The methodology is also applied to two real datasets: a well-known climate dataset and a global development dataset. We illustrate how these unsupervised methodologies allow complex data to be understood, even by nonexperts. Supplementary materials for this article are available online.

[1]  David J. Hand,et al.  Data Mining: Statistics and More? , 1998 .

[2]  João P. P. Gomes,et al.  Euclidean distance estimation in incomplete datasets , 2017, Neurocomputing.

[3]  Julie Josse,et al.  Principal component analysis with missing values: a comparative survey of methods , 2015, Plant Ecology.

[4]  Lars Kai Hansen,et al.  Archetypal analysis for machine learning , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[5]  Manuel J. A. Eugster,et al.  Performance Profiles based on Archetypal Athletes , 2012 .

[6]  Tyler Davis,et al.  Memory for Category Information Is Idealized Through Contrast With Competing Options , 2010, Psychological science.

[7]  Upmanu Lall,et al.  Daily Precipitation and Tropical Moisture Exports across the Eastern United States: An Application of Archetypal Analysis to Identify Spatiotemporal Structure , 2015 .

[8]  Douglas M. Hawkins,et al.  A Tale of Two Matrix Factorizations , 2013 .

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  Giancarlo Ragozini,et al.  On the use of archetypes as benchmarks , 2008 .

[11]  Zhenchun Hao,et al.  Spatiotemporal Variability of Extreme Summer Precipitation over the Yangtze River Basin and the Associations with Climate Patterns , 2017 .

[12]  Amaury Lendasse,et al.  Mixture of Gaussians for distance estimation with missing data , 2014, Neurocomputing.

[13]  F. Palumbo,et al.  Archetypal analysis for data‐driven prototype identification , 2017, Stat. Anal. Data Min..

[14]  Jerome P. Reiter,et al.  Wilson Confidence Intervals for Binomial Proportions With Multiple Imputation for Missing Data , 2018, The American Statistician.

[15]  Desire L. Massart,et al.  Projection methods in chemistry , 2003 .

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  Adele Cutler,et al.  Introduction to archetypal analysis of spatio-temporal dynamics , 1996 .

[18]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[19]  Manuel J. A. Eugster,et al.  From Spider-man to Hero - archetypal analysis in R , 2009 .

[20]  Dallas E. Johnson,et al.  An Examination of Discrepancies in Multiple Imputation Procedures Between SAS® and SPSS® , 2018, The American Statistician.

[21]  Giancarlo Ragozini,et al.  Archetypal networks , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[22]  Sotiris Kotsiantis,et al.  Filling missing temperature values in weather data banks , 2006 .

[23]  C. Ji An Archetypal Analysis on , 2005 .

[24]  Giancarlo Ragozini,et al.  Interval Archetypes: A New Tool for Interval Data Analysis , 2012, Stat. Anal. Data Min..

[25]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[26]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[27]  Sandra Alemany,et al.  Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem , 2013, Comput. Ind. Eng..

[28]  Richard T. Carson,et al.  Archetypal analysis: a new way to segment markets based on extreme individuals , 2003 .

[29]  Morten Mørup,et al.  Archetypal Analysis for Modeling Multisubject fMRI Data , 2016, IEEE Journal of Selected Topics in Signal Processing.

[30]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[31]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[32]  Julie Josse,et al.  Handling missing values in exploratory multivariate data analysis methods , 2012 .

[33]  Lefteris Angelis,et al.  A novel single-trial methodology for studying brain response variability based on archetypal analysis , 2015, Expert Syst. Appl..

[34]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[35]  Morten Mørup,et al.  Archetypal analysis of diverse Pseudomonas aeruginosa transcriptomes reveals adaptation in cystic fibrosis airways , 2013, BMC Bioinformatics.

[36]  Xiaogang Wang,et al.  Clues: an R Package for Nonparametric Clustering Based on Local Shrinking , 2022 .

[37]  Sohan Seth,et al.  Probabilistic archetypal analysis , 2013, Machine Learning.

[38]  Ujjwal Das,et al.  Bias Reduction in Logistic Regression with Missing Responses When the Missing Data Mechanism is Nonignorable , 2018, The American Statistician.

[39]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[40]  Manuel J. A. Eugster,et al.  Archetypal Analysis for Nominal Observations , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[42]  David F. Midgley,et al.  Marketing strategy in MNC subsidiaries: pure versus hybrid archetypes , 2013 .

[43]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[44]  Irene Epifanio,et al.  h‐plots for displaying nonmetric dissimilarity matrices , 2013, Stat. Anal. Data Min..

[45]  Lynette A. Hunt,et al.  Mixture model clustering for mixed data with missing information , 2003, Comput. Stat. Data Anal..

[46]  Giovanni C. Porzio,et al.  Mining performance data through nonlinear PCA with optimal scaling , 2010 .

[47]  Yanyun Yang,et al.  Bias Introduced by Rounding in Multiple Imputation for Ordered Categorical Variables , 2016 .

[48]  Jerome P. Reiter,et al.  An Empirical Comparison of Multiple Imputation Methods for Categorical Data , 2015, 1508.05918.

[49]  John A. Rice,et al.  Displaying the important features of large collections of similar curves , 1992 .

[50]  Daniel T. Larose,et al.  Data mining methods and models , 2006 .

[51]  B. Chan,et al.  Archetypal analysis of galaxy spectra , 2003, astro-ph/0301491.

[52]  Christian Bauckhage,et al.  Descriptive matrix factorization for sustainability Adopting the principle of opposites , 2011, Data Mining and Knowledge Discovery.

[53]  Christian Seiler,et al.  Archetypal Scientists , 2012, J. Informetrics.

[54]  Stavros Valsamidis,et al.  Courseware usage archetyping , 2013, PCI '13.

[55]  Michel Verleysen,et al.  Distance estimation in numerical data sets with missing values , 2013, Inf. Sci..

[56]  M. Mørup,et al.  Archetypal analysis for machine learning , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[57]  Sandra Alemany,et al.  Archetypoids: A new approach to define representative archetypal data , 2015, Comput. Stat. Data Anal..

[58]  Guillermo Vinué,et al.  Anthropometry: An R Package for Analysis of Anthropometric Data , 2017 .

[59]  Anne-Béatrice Dufour,et al.  The ade4 Package: Implementing the Duality Diagram for Ecologists , 2007 .

[60]  Igor Kononenko,et al.  Multi-document summarization via Archetypal Analysis of the content-graph joint model , 2013, Knowledge and Information Systems.

[61]  Paolo Giudici,et al.  Applied Data Mining for Business and Industry , 2009 .

[62]  Richard G. Baraniuk,et al.  k-POD: A Method for k-Means Clustering of Missing Data , 2014, 1411.7013.

[63]  Igor Kononenko,et al.  Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization , 2014, Expert Syst. Appl..

[64]  Amelia Simó,et al.  Archetypal shapes based on landmarks and extension to handle missing data , 2018, Adv. Data Anal. Classif..

[65]  Michael Fernandez,et al.  Identification of Nanoparticle Prototypes and Archetypes. , 2015, ACS nano.

[66]  Irene Epifanio,et al.  Detection of Anomalies in Water Networks by Functional Data Analysis , 2018, Mathematical Problems in Engineering.

[67]  Irene Epifanio,et al.  Archetypoid analysis for sports analytics , 2017, Data Mining and Knowledge Discovery.