Pattern discovery for large mixed-mode database

In business and industry today, large databases with mixed data types (continuous and categorical) are very common. There are great needs to discover patterns from them for knowledge interpretation and understanding. In the past, for classification, this problem is solved as a discrete data problem by first discretizing the continuous data based on the class-attribute interdependence relationship. However, so far no proper solution exists when class information is unavailable. Hence, important pattern post-processing tasks such as pattern clustering and summarization cannot be applied to mixed-mode data. This paper presents a new method for solving the problem. It is based on two essential concepts. (1) Though class information is absent, yet for a correlated dataset, the attribute with the strongest interdependence with others in the group can be used to drive the discretization of the continuous data. (2) For a large database, correlated attribute groups must first be obtained by attribute clustering before (1) can be applied. Based on (1) and (2), pattern discovery methods are developed for mixed-mode data. Extensive experiments using synthetic and real world data were conducted to validate the usefulness and effectiveness of the proposed method.

[1]  A. Wong,et al.  Statistical analysis of residue variability in cytochrome c. , 1976, Journal of molecular biology.

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Andrew K. C. Wong,et al.  Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Andrew K. C. Wong,et al.  Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Paul D. Scott,et al.  Zeta: A Global Method for Discretization of Continuous Variables , 1997, KDD.

[7]  Yang Wang,et al.  From Association to Classification: Inference Using Weight of Evidence , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[9]  Yang Wang,et al.  Pattern discovery: a data driven approach to decision support , 2003, IEEE Trans. Syst. Man Cybern. Part C.

[10]  Chung Lam Li,et al.  Association Pattern Analysis for Pattern Pruning, Clustering and Summarization , 2008 .

[11]  Yang Wang,et al.  A global optimal algorithm for class-dependent discretization of continuous data , 2004, Intell. Data Anal..

[12]  Nils J. Nilsson,et al.  MLC++, A Machine Learning Library in C++. , 1995 .

[13]  Andrew C. Wong,et al.  Classification of discrete data with feature space transformation , 1978, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[14]  Andrew K. C. Wong,et al.  Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[15]  Xin Yao,et al.  A novel evolutionary data mining algorithm with applications to churn prediction , 2003, IEEE Trans. Evol. Comput..

[16]  Lukasz Kurgan,et al.  Discretization Algorithm that Uses Class-Attribute Interdependence Maximization , 2003 .

[17]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[18]  Andrew K. C. Wong,et al.  Pattern Discovery by Residual Analysis and Recursive Partitioning , 1999, IEEE Trans. Knowl. Data Eng..

[19]  Yang Wang,et al.  High-Order Pattern Discovery from Discrete-Valued Data , 1997, IEEE Trans. Knowl. Data Eng..

[20]  Andrew K. C. Wong,et al.  Typicality, Diversity, and Feature Pattern of an Ensemble , 1975, IEEE Transactions on Computers.

[21]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[22]  Andrew K. C. Wong,et al.  Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis , 1991, Knowledge Discovery in Databases.

[23]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[24]  Andrew K. C. Wong,et al.  A discrete-valued clustering algorithm with applications to biomolecular data , 2001, Inf. Sci..