Correlation preserving discretization

Discretization is a crucial preprocessing primitive for a variety of data warehousing and mining tasks. In this article we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate datasets. The algorithm leverages the underlying correlation structure in the dataset to obtain the discrete intervals, and ensures that the inherent correlations are preserved. The approach also extends easily to datasets containing missing values. We demonstrate the efficacy of the approach on real datasets and as a preprocessing step for both classification and frequent item set mining tasks. We also show that the intervals are meaningful and can uncover hidden patterns in data.

[1]  Srinivasan Parthasarathy,et al.  On the use of constrained association rules for web mining , 2002 .

[2]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[5]  Yasuhiko Morimoto,et al.  Mining optimized association rules for numeric attributes , 1996, J. Comput. Syst. Sci..

[6]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.

[7]  Charu C. Aggarwal,et al.  On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[9]  Kyuseok Shim,et al.  Mining optimized support rules for numeric attributes , 2001, Inf. Syst..

[10]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[11]  Gerhard Widmer,et al.  Relative Unsupervised Discretization for Association Rule Mining , 2000, PKDD.

[12]  Ramesh Subramonian,et al.  A Visual Interactive Framework for Attribute Discretization , 1997, KDD.

[13]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[14]  Kyuseok Shim,et al.  Mining optimized association rules with categorical and numeric attributes , 1998, Proceedings 14th International Conference on Data Engineering.

[15]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[18]  Jae-On Kim,et al.  Factor Analysis: Statistical Methods and Practical Issues , 1978 .

[19]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.