Turning CARTwheels: an alternating algorithm for mining redescriptions

We present an unusual algorithm involving classification trees---CARTwheels---where two trees are grown in opposite directions so that they are joined at their leaves. This approach finds application in a new data mining task we formulate, called redescription mining. A redescription is a shift-of-vocabulary, or a different way of communicating information about a given subset of data; the goal of redescription mining is to find subsets of data that afford multiple descriptions. We highlight the importance of this problem in domains such as bioinformatics, which exhibit an underlying richness and diversity of data descriptors (e.g., genes can be studied in a variety of ways). CARTwheels exploits the duality between class partitions and path partitions in an induced classification tree to model and mine redescriptions. It helps integrate multiple forms of characterizing datasets, situates the knowledge gained from one dataset in the context of others, and harnesses high-level abstractions for uncovering cryptic and subtle features of data. Algorithm design decisions, implementation details, and experimental results are presented.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[3]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[4]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[5]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[8]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[9]  Chandrika Kamath,et al.  Classifying bent-double galaxies , 2002, Comput. Sci. Eng..

[10]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[11]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[14]  D. Mahler Record , 1970 .

[15]  Raúl E. Valdés-Pérez,et al.  Concise, intelligible, and approximate profiling of multiple classes , 2000, Int. J. Hum. Comput. Stud..

[16]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[17]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[18]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[19]  Stephen Muggleton,et al.  Scientific knowledge discovery using inductive logic programming , 1999, Commun. ACM.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  George W. Furnas,et al.  Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[22]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[23]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[24]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[25]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  John J. Wyrick,et al.  Chromosomal landscape of nucleosome-dependent gene expression and silencing in yeast , 1999, Nature.