Robust multivariate analysis for mixed-type data: Novel algorithm and its practical application in socio-economic research

Abstract We propose a novel method and algorithm for the analysis and clustering of mixed-type data using a hierarchical approach based on Forward Search. In our procedure, the identification of groups is based on the identification of similar trajectories and then linked to very intuitive two-dimensional maps. The proposed algorithm can use different measures for the calculation of distance in the case of mixed-type data, such as Gower’s metric and Related metric scaling. A key feature of our algorithm is its ability to discard redundant information from a given set of variables. The practical usefulness of the algorithm is illustrated through two applications of high relevance for empirical economic research. The first one focuses on comparing different indicators of environmental policy stringency in different countries. The second one applies our procedure to identify clusters of countries based on information regarding their institutional characteristics.

[1]  Anthony C. Atkinson,et al.  The forward search: theory and data analysis , 2010 .

[2]  Marianthi Markatou,et al.  Distance Metrics and Clustering Methods for Mixed‐type Data , 2018, International Statistical Review.

[3]  Florencio Lopez-de-Silanes,et al.  The quality of government , 1999 .

[4]  Wojtek J. Krzanowski,et al.  Ordination in the presence of group structure, for general multivariate data , 1994 .

[5]  J. Gower Adding a point to vector diagrams in multivariate analysis , 1968 .

[6]  Marco Riani,et al.  Random Start Forward Searches with Envelopes for Detecting Clusters in Multivariate Data , 2006 .

[7]  Shehroz S. Khan,et al.  Survey of State-of-the-Art Mixed Data Clustering Algorithms , 2018, IEEE Access.

[8]  W J Krzanowski,et al.  Sensitivity in Metric Scaling and Analysis of Distance , 2006, Biometrics.

[9]  Anthony C. Atkinson,et al.  The forward search and data visualisation , 2004, Comput. Stat..

[10]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[11]  Francesca Torti,et al.  FSDA: A MATLAB toolbox for robust analysis and interactive data exploration , 2012 .

[12]  Silvia Salini,et al.  Reliable Robust Regression Diagnostics , 2016 .

[13]  E. Dietzenbacher,et al.  An Illustrated User Guide to the World Input–Output Database: The Case of Global Automotive Production , 2015 .

[14]  Silvia Salini,et al.  Measuring Environmental Policy Stringency: Approaches, Validity, and Impact on Energy Efficiency , 2017 .

[15]  Silvia Salini,et al.  Measuring environmental policy stringency: Approaches, validity, and impact on environmental innovation and energy efficiency , 2020 .

[16]  Michel van de Velden,et al.  Distance‐based clustering of mixed data , 2018, WIREs Computational Statistics.

[17]  Elena Verdolini,et al.  Threshold Policy Effects and Directed Technical Change in Energy Innovation , 2018 .

[18]  Rosario Romera,et al.  On Visualizing Mixed-Type Data , 2018 .

[19]  Carles M. Cuadras,et al.  Visualizing Categorical Data with Related Metric Scaling , 1998 .