A three-way clustering approach for handling missing data using GTRS

Abstract Clustering is an important data analysis task. It becomes a challenge in the presence of uncertainty due to reasons such as incomplete, missing or corrupted data. A three-way approach has recently been introduced to deal with uncertainty in clustering due to missing values. The essential idea is to make a deferment decision whenever it is not clear and possible to decide whether or not to include an object in a cluster. A key issue in the three-way approach is to determine the thresholds that are used to define the three types of decisions, namely, include an object in a cluster, exclude an object from a cluster, or delay (defer) the decision of inclusion or exclusion from a cluster. The existing studies do not sufficiently address the determination of thresholds and generally use its fix values. In this paper, we explore the use of game-theoretic rough set (GTRS) model to handle this issue. In particular, a game is defined where the determination of thresholds is approached based on a tradeoff between the properties of accuracy and generality of clusters. The determined thresholds are then used to induce three-way decisions for clustering uncertain objects. Experimental results on four datasets from UCI machine learning repository suggests that the GTRS significantly improves the generality while keeping similar levels of accuracy in comparison to other three-way and similar models.

[1]  Hong Yu,et al.  A Framework of Three-Way Cluster Analysis , 2017, IJCRS.

[2]  Yao,et al.  A game-theoretic perspective on rough set analysis , 2008 .

[3]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[4]  Hong Gu,et al.  A hybrid genetic algorithm–fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals , 2013, Soft Comput..

[5]  Dawei Li,et al.  Fuzzy clustering of incomplete data based on missing attribute interval size , 2015, 2015 IEEE 9th International Conference on Anti-counterfeiting, Security, and Identification (ASID).

[6]  C. Brown,et al.  Asymptotic comparison of missing data procedures for estimating factor loadings , 1983 .

[7]  Witold Pedrycz,et al.  Fuzzy C-Means clustering of incomplete data based on probabilistic information granules of missing values , 2016, Knowl. Based Syst..

[8]  Guoyin Wang,et al.  A tree-based incremental overlapping clustering method using the three-way decision theory , 2016, Knowl. Based Syst..

[9]  Tossapon Boongoen,et al.  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[11]  Y. Haitovsky Missing Data in Regression Analysis , 1968 .

[12]  Nouman Azam,et al.  Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets , 2014, Int. J. Approx. Reason..

[13]  Brian Everitt,et al.  Cluster analysis , 1974 .

[14]  Anton C. Pegis Cosmogony and Knowledge The Dilemma of Composite Essences , 1944 .

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Richard Weber,et al.  Soft clustering - Fuzzy and rough approaches and their extensions and derivatives , 2013, Int. J. Approx. Reason..

[17]  Anna Timperio,et al.  The clustering of diet, physical activity and sedentary behavior in children and adolescents: a review , 2014, International Journal of Behavioral Nutrition and Physical Activity.

[18]  Yi Peng,et al.  Evaluation of clustering algorithms for financial risk analysis using MCDM methods , 2014, Inf. Sci..

[19]  Yiyu Yao,et al.  An Outline of a Theory of Three-Way Decisions , 2012, RSCTC.

[20]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  M. Brusco,et al.  The p-median model as a tool for clustering psychological data. , 2010, Psychological methods.

[23]  Hong Yu,et al.  A Three-Way Decisions Clustering Algorithm for Incomplete Data , 2014, RSKT.

[24]  Ian R White,et al.  Allowing for uncertainty due to missing data in meta‐analysis—Part 1: Two‐stage methods , 2008, Statistics in medicine.

[25]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[26]  Nouman Azam,et al.  Formulating Game Strategies in Game-Theoretic Rough Sets , 2013, RSKT.

[27]  Heiko Timm,et al.  Different approaches to fuzzy clustering of incomplete datasets , 2004, Int. J. Approx. Reason..

[28]  Shen Yin,et al.  Performance Monitoring for Vehicle Suspension System via Fuzzy Positivistic C-Means Clustering Based on Accelerometer Measurements , 2015, IEEE/ASME Transactions on Mechatronics.

[29]  Daniel A. Newman,et al.  Missing Data , 2014 .

[30]  E. Carranza,et al.  Evaluation of uncertainty in mineral prospectivity mapping due to missing evidence: A case study with skarn-type Fe deposits in Southwestern Fujian Province, China , 2015 .

[31]  Jingtao Yao,et al.  Game-Theoretic Rough Sets , 2011, Fundam. Informaticae.

[32]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[33]  Yiyu Yao,et al.  Rough Sets and Three-Way Decisions , 2015, RSKT.

[34]  Yoav Shoham,et al.  Computer science and game theory , 2008, CACM.

[35]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.