Structural geography of the space of emerging patterns

Describing and capturing significant differences between two classes of data is an important data mining and classification research topic. In this paper, we use emerging patterns to describe these significant differences. Such a pattern occurs in one class of samples -- its "home" class -- with a high frequency but does not exist in the other class, so it can be considered as a characteristic property of its home class. We call the collection of all such patterns a space. Beyond the space, there are patterns that occur in both of the classes or that do not occur in any of the two classes. Within the space, the most general and most specific patterns bound the other patterns in a lossless convex way. We decompose the space into a terrace of pattern plateaus based on their frequency. We use the most general patterns to construct accurate classifiers. We also use these patterns in the bio-medical domain to suggest treatment plans for adjusting the expression levels of certain genes so that patients can be cured.

[1]  Haym Hirsh,et al.  Learning DNF Via Probabilistic Evidence Combination , 1993, ICML.

[2]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[3]  Kotagiri Ramamohanarao,et al.  Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets , 2000, KDD '00.

[4]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[5]  R. Baron,et al.  Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data. , 2002, Genome research.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[8]  Devika Subramanian,et al.  The Common Order-Theoretic Structure of Version Spaces and ATMSs , 1991, Artif. Intell..

[9]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[10]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[11]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[12]  Inder M. Verma,et al.  Gene therapy: trials and tribulations , 2000, Nature Reviews Genetics.

[13]  Kotagiri Ramamohanarao,et al.  The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms , 2000, ICML.

[14]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[15]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[16]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[17]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[18]  A. Brazma,et al.  Towards reconstruction of gene networks from expression data by supervised learning , 2003, Genome Biology.

[19]  Jinyan Li,et al.  Feature Space Transformation and Decision Results Interpretation , 2003, APBC.

[20]  Luc De Raedt,et al.  An algebra for inductive query evaluation , 2003, Third IEEE International Conference on Data Mining.

[21]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[22]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Peter J. Braspenning,et al.  Version Space Learning with Instance-Based Boundary Sets , 1998, ECAI.

[24]  Haym Hirsh,et al.  Generalizing Version Spaces , 1994, Machine Learning.

[25]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[26]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[27]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[28]  Jinyan Li,et al.  Geography of Differences between Two Classes of Data , 2002, PKDD.

[29]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  Michèle Sebag,et al.  Delaying the Choice of Bias: A Disjunctive Version Space Approach , 1996, ICML.