A Dimensionally Reduced Clustering Methodology for Heterogeneous Occupational Medicine Data Mining

Clustering is a set of techniques of the statistical learning aimed at finding structures of heterogeneous partitions grouping homogenous data called clusters. There are several fields in which clustering was successfully applied, such as medicine, biology, finance, economics, etc. In this paper, we introduce the notion of clustering in multifactorial data analysis problems. A case study is conducted for an occupational medicine problem with the purpose of analyzing patterns in a population of 813 individuals. To reduce the data set dimensionality, we base our approach on the Principal Component Analysis (PCA), which is the statistical tool most commonly used in factorial analysis. However, the problems in nature, especially in medicine, are often based on heterogeneous-type qualitative-quantitative measurements, whereas PCA only processes quantitative ones. Besides, qualitative data are originally unobservable quantitative responses that are usually binary-coded. Hence, we propose a new set of strategies allowing to simultaneously handle quantitative and qualitative data. The principle of this approach is to perform a projection of the qualitative variables on the subspaces spanned by quantitative ones. Subsequently, an optimal model is allocated to the resulting PCA-regressed subspaces.

[1]  W. Marsden I and J , 2012 .

[2]  B. Escofier Traitement simultané de variables qualitatives et quantitatives en analyse factorielle , 1979 .

[3]  Gérard Govaert,et al.  Model-based cluster and discriminant analysis with the MIXMOD software , 2006, Comput. Stat. Data Anal..

[4]  Lynette A. Hunt,et al.  Clustering mixed data , 2011, WIREs Data Mining Knowl. Discov..

[5]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[6]  G. Saporta Simultaneous Analysis of Qualitative and Quantitative Data , 1990 .

[7]  Lipika Dey,et al.  A feature selection technique for classificatory analysis , 2005, Pattern Recognit. Lett..

[8]  R. Kruse,et al.  Fuzzy clustering of quantitative and qualitative data , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[9]  Marie Chavent,et al.  Orthogonal rotation in PCAMIX , 2012, Adv. Data Anal. Classif..

[10]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[11]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[13]  B. Escofier,et al.  Analyses factorielles simples et multiples : objectifs, méthodes et interprétation , 2008 .

[14]  H. Kiers Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables , 1991 .

[15]  Gérard Govaert,et al.  An improvement of the NEC criterion for assessing the number of clusters in a mixture model , 1999, Pattern Recognit. Lett..

[16]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.