An effective discretization method for disposing high-dimensional data

Abstract Feature discretization is an extremely important preprocessing task used for classification in data mining and machine learning as many classification methods require that each dimension of the training dataset contains only discrete values. Most of discretization methods mainly concentrate on discretizing low-dimensional data. In this paper, we focus on discretizing high-dimensional data that frequently present the nonlinear structures. Firstly, we present a novel supervised dimension reduction algorithm to map high-dimensional data into a low-dimensional space, which ensures to keep intrinsic correlation structure of the original data. This algorithm overcomes the deficiency that the geometric topology of the data is easily distorted when mapping data that present an uneven distribution in high-dimensional space. To the best of our knowledge, this is the first approach to solve high-dimensional nonlinear data discretization with a dimension reduction technique. Secondly, we propose a supervised area-based chi-square discretization algorithm to effectively discretize each continuous dimension in the low-dimensional space. This algorithm overcomes the deficiency that existing methods do not consider the possibility of being merged for each interval pair from the view of probability. Finally, we conduct the experiments to evaluate the performance of the proposed method. The results show that our method achieves higher classification accuracy and yields a more concise knowledge of the data especially for high-dimensional datasets than existing discretization methods. In addition, our discretization method has also been successfully applied to computer vision and image classification.

[1]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[2]  Xuelong Li,et al.  Patch Alignment for Dimensionality Reduction , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Cecilio Angulo,et al.  IDD: A Supervised Interval Distance-Based Method for Discretization , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Agma J. M. Traina,et al.  An Association Rule-Based Method to Support Medical Image Diagnosis With Efficiency , 2008, IEEE Transactions on Multimedia.

[5]  Yang Wang,et al.  A global optimal algorithm for class-dependent discretization of continuous data , 2004, Intell. Data Anal..

[6]  H. Zha,et al.  Principal manifolds and nonlinear dimensionality reduction via tangent space alignment , 2004, SIAM J. Sci. Comput..

[7]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[8]  Michel Loève,et al.  Probability Theory I , 1977 .

[9]  Srinivasan Parthasarathy,et al.  Toward unsupervised correlation preserving discretization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Béatrice Duval,et al.  A non-parametric semi-supervised discretization method , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Dimitrios Gunopulos,et al.  Non-linear dimensionality reduction techniques for classification and visualization , 2002, KDD.

[12]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[13]  Marc Boullé,et al.  MODL: A Bayes optimal discretization method for continuous attributes , 2006, Machine Learning.

[14]  Kemal Polat,et al.  Utilization of Discretization method on the diagnosis of optic nerve disease , 2008, Comput. Methods Programs Biomed..

[15]  Abdallah Bashir Musa A comparison of ℓ1-regularizion, PCA, KPCA and ICA for dimensionality reduction in logistic regression , 2013, International Journal of Machine Learning and Cybernetics.

[16]  Lukasz A. Kurgan,et al.  CLIP4: Hybrid inductive machine learning algorithm that generates inequality rules , 2004, Inf. Sci..

[17]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[18]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[19]  Stefano Ferilli,et al.  Unsupervised Discretization Using Kernel Density Estimation , 2007, IJCAI.

[20]  Juan Ramirez,et al.  Machine Learning for Seismic Signal Processing: Phase Classification on a Manifold , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[21]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[22]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[23]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[24]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[25]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[26]  Ruoming Jin,et al.  Data discretization unification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[27]  Eibe Frank,et al.  Unsupervised Discretization Using Tree-Based Density Estimation , 2005, PKDD.

[28]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[29]  Àngel García-Cerdaña,et al.  Refining Discretizations of Continuous-Valued Attributes , 2012, MDAI.

[30]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[33]  Keqiu Li,et al.  Combining Univariate and Multivariate Bottom-up Discretization , 2012, J. Multiple Valued Log. Soft Comput..

[34]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[35]  Mário A. T. Figueiredo,et al.  An Incremental Bit Allocation Strategy for Supervised Feature Discretization , 2013, IbPRIA.

[36]  Chao-Ton Su,et al.  An Extended Chi2 Algorithm for Discretization of Real Value Attributes , 2005, IEEE Trans. Knowl. Data Eng..

[37]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[38]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[39]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[40]  Marc Boullé,et al.  Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach , 2009, Adv. Data Anal. Classif..

[41]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[42]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[43]  Lior Rokach,et al.  Decision-tree instance-space decomposition with grouped gain-ratio , 2007, Inf. Sci..

[44]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[45]  Andrew K. C. Wong,et al.  Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Hongyuan Zha,et al.  Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment , 2002, ArXiv.

[48]  Yu-Lin He,et al.  Non-Naive Bayesian Classifiers for Classification Problems With Continuous Attributes , 2014, IEEE Transactions on Cybernetics.

[49]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[50]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[51]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[52]  Francis Eng Hock Tay,et al.  A Modified Chi2 Algorithm for Discretization , 2002, IEEE Trans. Knowl. Data Eng..