Redescription Mining: Algorithms and Applications in Bioinformatics

Scientific data mining purports to extract useful knowledge from massive datasets curated through computational science efforts, e.g., in bioinformatics, cosmology, geographic sciences, and computational chemistry. In the recent past, we have witnessed major transformations of these applied sciences into data-driven endeavors. In particular, scientists are now faced with an overload of vocabularies for describing domain entities. All of these vocabularies offer alternative and mostly complementary (sometimes, even contradictory) ways to organize information and each vocabulary provides a different perspective into the problem being studied. To further knowledge discovery, computational scientists need tools to help uniformly reason across vocabularies, integrate multiple forms of characterizing datasets, and situate knowledge gained from one study in terms of others. This dissertation defines a new pattern class called redescriptions that provides high level capabilities for reasoning across domain vocabularies. A redescription is a shift of vocabulary, or a different way of communicating the same information; redescription mining finds concerted sets of objects that can be defined in (at least) two ways using given descriptors. We present the CARTwheels algorithm for mining redescriptions by exploiting equivalences of partitions induced by distinct descriptor classes as well as applications of CARTwheels to several bioinformatics datasets. We then outline how we can build more complex data mining operations by cascading redescriptions to realize a story, leading to a new data mining capability called storytelling. Besides applications to characterizing gene sets, we showcase its uses in other datasets as well. Finally, we extend the core CARTwheels algorithm by introducing a theoretical framework, based on partitions, to systematically explore redescription space; generalizing from mining redescriptions (and stories) within a single domain to relating descriptors across different domains, to support complex relational data mining scenarios; and exploiting structure of the underlying descriptor space to yield more effective algorithms for specific classes of datasets.

[1]  Naren Ramakrishnan,et al.  Reasoning about sets using redescription mining , 2005, KDD '05.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Michael R. Green,et al.  Dissecting the Regulatory Circuitry of a Eukaryotic Genome , 1998, Cell.

[4]  Raúl E. Valdés-Pérez,et al.  Concise, intelligible, and approximate profiling of multiple classes , 2000, Int. J. Hum. Comput. Stud..

[5]  Alberto O. Mendelzon,et al.  Concise descriptions of subsets of structured sets , 2003, PODS.

[6]  R. Klinger,et al.  Role of paired basic residues of protein C-termini in phospholipid binding. , 2002, Protein engineering.

[7]  Yannis Manolopoulos,et al.  Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes , 2003, ADBIS.

[8]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[9]  Naren Ramakrishnan,et al.  Mining scientific data , 2001, Adv. Comput..

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Jeffrey E. Barrick,et al.  Metabolite-binding RNA domains are present in the genes of eukaryotes. , 2003, RNA.

[12]  A. Covarrubias,et al.  Highly Hydrophilic Proteins in Prokaryotes and Eukaryotes Are Common during Conditions of Water Deficit* , 2000, The Journal of Biological Chemistry.

[13]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[14]  J. A. Gorman,et al.  Effect of CTP Synthetase Regulation by CTP on Phospholipid Synthesis in Saccharomyces cerevisiae * , 1998, The Journal of Biological Chemistry.

[15]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[16]  Cornelia I Bargmann,et al.  Comparing genomic expression patterns across species identifies shared transcriptional profile in aging , 2004, Nature Genetics.

[17]  Laks V. S. Lakshmanan,et al.  Constraint-Based Multidimensional Data Mining , 1999, Computer.

[18]  Jennifer Widom,et al.  Exploiting hierarchical domain structure to compute similarity , 2003, TOIS.

[19]  Philip S. Yu,et al.  A new method for similarity indexing of market basket data , 1999, SIGMOD '99.

[20]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[21]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[22]  Stephen Muggleton,et al.  Scientific knowledge discovery using inductive logic programming , 1999, Commun. ACM.

[23]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[24]  D. Hirata,et al.  Involvement of S-adenosylmethionine in G1 cell-cycle regulation in Saccharomyces cerevisiae. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Naren Ramakrishnan,et al.  Transcriptional Response of Saccharomyces cerevisiae to Desiccation and Rehydration , 2005, Applied and Environmental Microbiology.

[26]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[27]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[28]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[29]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[30]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[31]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[32]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[33]  Ramanathan V. Guha,et al.  Unweaving a web of documents , 2005, KDD '05.

[34]  Mogens Kruhøffer,et al.  Full genome gene expression analysis of the heat stress response in Drosophila melanogaster , 2005, Cell stress & chaperones.

[35]  Lise Getoor,et al.  Link mining: a new data mining challenge , 2003, SKDD.

[36]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[37]  John J. Wyrick,et al.  Chromosomal landscape of nucleosome-dependent gene expression and silencing in yeast , 1999, Nature.

[38]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[39]  David Haussler,et al.  Mining scientific data , 1996, CACM.

[40]  Kazuo Shinozaki,et al.  Classification and expression analysis of Arabidopsis F-box-containing protein genes. , 2002, Plant & cell physiology.

[41]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[42]  G. Carman,et al.  Regulation of Phospholipid Biosynthesis in Saccharomyces cerevisiae by CTP (*) , 1995, The Journal of Biological Chemistry.

[43]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[44]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[45]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[46]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[47]  Robert Tibshirani,et al.  Statistical Significance for Genome-Wide Experiments , 2003 .

[48]  Jennifer Neville,et al.  Supporting Relational Knowledge Discovery: Lessons in Architecture and Algorithm Design , 2002 .

[49]  Peter A. Flach,et al.  An extended transformation approach to inductive logic programming , 2001, ACM Trans. Comput. Log..

[50]  Joachim M. Buhmann,et al.  Coupled Clustering: A Method for Detecting Structural Correspondence , 2001, J. Mach. Learn. Res..

[51]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[52]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[53]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[54]  J. Segall,et al.  The SPS100 gene of Saccharomyces cerevisiae is activated late in the sporulation process and contributes to spore wall maturation , 1988, Molecular and cellular biology.

[55]  Naren Ramakrishnan,et al.  Mining Novellas from PubMed Abstracts using a Storytelling Algorithm , 2007 .

[56]  Alberto O. Mendelzon,et al.  Concise descriptions of subsets of structured sets , 2005, TODS.

[57]  Ali Nahvi,et al.  An mRNA structure that controls gene expression by binding S-adenosylmethionine , 2003, Nature Structural Biology.

[58]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[60]  Szymon Jaroszewicz,et al.  An axiomatization of partition entropy , 2002, IEEE Trans. Inf. Theory.

[61]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[62]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[63]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[64]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[65]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[66]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[67]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[68]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[69]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[70]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[71]  Naren Ramakrishnan,et al.  Algorithms for Storytelling , 2006, IEEE Transactions on Knowledge and Data Engineering.

[72]  William W. Cohen,et al.  Learning the Classic Description Logic: Theoretical and Experimental Results , 1994, KR.

[73]  R. Helm,et al.  Genomic DNA of Nostoc commune (Cyanobacteria) becomes covalently modified during long-term (decades) desiccation but is protected from oxidative damage and degradation. , 2003, Nucleic acids research.

[74]  Joachim M. Buhmann,et al.  A theory of proximity based clustering: structure detection by optimization , 2000, Pattern Recognit..

[75]  Allan Kuchinsky,et al.  Biological storytelling: a software tool for biological information organization based upon narrative structure , 2002, AVI '02.

[76]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[77]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[78]  J. Holthuis,et al.  HOR7, a Multicopy Suppressor of the Ca2+-induced Growth Defect in Sphingolipid Mannosyltransferase-deficient Yeast* , 2004, Journal of Biological Chemistry.

[79]  John Mingers,et al.  Rule Induction with Statistical Data—A Comparison with Multiple Regression , 1987 .

[80]  Gordon Plotkin,et al.  A Note on Inductive Generalization , 2008 .

[81]  Chandrika Kamath,et al.  Classifying bent-double galaxies , 2002, Comput. Sci. Eng..

[82]  Nikos Mamoulis,et al.  Similarity search in sets and categorical data using the signature tree , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[83]  M. Tyers,et al.  The GRID: The General Repository for Interaction Datasets , 2003, Genome Biology.

[84]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[85]  A. Covarrubias,et al.  Three genes whose expression is induced by stress in Saccharomyces cerevisiae , 1999, Yeast.

[86]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[87]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[88]  R. Serrano,et al.  A genomic locus in Saccharomyces cerevisiae with four genes up‐regulated by osmotic stress , 1995, Molecular microbiology.

[89]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[90]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[91]  Ehud Shapiro,et al.  Algorithmic Program Debugging , 1983 .

[92]  Stephen Muggleton,et al.  Machine Invention of First Order Predicates by Inverting Resolution , 1988, ML.

[93]  Yannis Manolopoulos,et al.  Efficient similarity search for market basket data , 2002, The VLDB Journal.

[94]  Deept Kumar,et al.  Turning CARTwheels: an alternating algorithm for mining redescriptions , 2003, KDD.

[95]  Prasenjit Mitra,et al.  Semi-automatic Integration of Knowledge Sources , 1999 .

[96]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.