Contrasting Subgroup Discovery

Subgroup discovery methods find interesting subsets of objects of a given class. Motivated by an application in bioinformatics, we first define a generalized subgroup discovery problem. In this setting, a subgroup is interesting if its members are characteristic for their class, even if the classes are not identical. Then we further refine this setting for the case where subsets of objects, for example, subsets of objects that represent different time points or different phenotypes, are contrasted. We show that this allows finding subgroups of objects that could not be found with classical subgroup discovery. To find such subgroups, we propose an approach that consists of two subgroup discovery steps and an intermediate, contrast set definition step. This approach is applicable in various application areas. An example is biology, where interesting subgroups of genes are searched by using gene expression data. We address the problem of finding enriched gene sets that are specific for virus-infected samples for a specific time point or a specific phenotype. We report on experimental results on a time series dataset for virus-infected Solanum tuberosum (potato) plants. The results on S. tuberosum's response to virus-infection revealed new research hypotheses for plant biologists.

[1]  Nada Lavrac,et al.  SegMine workflows for semantic microarray data analysis in Orange4WS , 2011, BMC Bioinformatics.

[2]  Ryszard S. Michalski,et al.  A Theory and Methodology of Inductive Learning , 1983, Artificial Intelligence.

[3]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[4]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[5]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[6]  William M. Smith,et al.  A Study of Thinking , 1956 .

[7]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[9]  Nada Lavrac,et al.  SEGS: Search for enriched gene sets in microarray data , 2008, J. Biomed. Informatics.

[10]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  David M. A. Martin,et al.  Genome sequence and analysis of the tuber crop potato , 2011, Nature.

[12]  S. Rhee,et al.  MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. , 2004, The Plant journal : for cell and molecular biology.

[13]  Lloyd D. Fisher,et al.  Biostatistics: A Methodology for the Health Sciences , 1993 .

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  Einoshin Suzuki,et al.  Autonomous Discovery of Reliable Exception Rules , 1997, KDD.

[16]  Ronnie Driver,et al.  Biostatistics: a Methodology for the Health Sciences , 2005 .

[17]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[18]  Christian Borgelt,et al.  Finding closed frequent item sets by intersecting transactions , 2011, EDBT/ICDT '11.

[19]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[20]  Paulo J. Azevedo,et al.  Rules for contrast sets , 2010, Intell. Data Anal..

[21]  Taneli Mielikäinen Intersecting data to closed sets with constraints , 2003, FIMI.

[22]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[23]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[24]  Mirko Böttcher,et al.  Contrast and change mining , 2011, WIREs Data Mining Knowl. Discov..

[25]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[26]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[27]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[28]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[29]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[30]  Nada Lavrac,et al.  Semantic Subgroup Discovery Systems and Workflows in the SDM-Toolkit , 2013, Comput. J..

[31]  Fabrice Guillet,et al.  Quality Measures in Data Mining , 2009, Studies in Computational Intelligence.

[32]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[34]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[35]  Irene Weber Levelwise Search and Pruning Strategies for First-Order Hypothesis Spaces , 2004, Journal of Intelligent Information Systems.

[36]  Shusaku Tsumoto,et al.  Evaluating Hypothesis-Driven Exception-Rule Discovery with Medical Data Sets , 2000, PAKDD.

[37]  María José del Jesús,et al.  Evolutionary Fuzzy Rule Induction Process for Subgroup Discovery: A Case Study in Marketing , 2007, IEEE Transactions on Fuzzy Systems.

[38]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[39]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[40]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[41]  Robert J. Hilderman,et al.  Statistical Methodologies for Mining Potentially Interesting Contrast Sets , 2007, Quality Measures in Data Mining.

[42]  Kiyoko F. Aoki-Kinoshita,et al.  Gene annotation and pathway mapping in KEGG. , 2007, Methods in molecular biology.

[43]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[44]  Lemonia Ragia,et al.  Spatial Subgroup Discovery Applied to the Analysis of Vegetation Data , 2002, PAKM.

[45]  S. Lange,et al.  Adjusting for multiple testing--when and how? , 2001, Journal of clinical epidemiology.

[46]  Lloyd D. Fisher,et al.  2. Biostatistics: A Methodology for the Health Sciences , 1994 .

[47]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.