Discriminative pattern mining and its applications in bioinformatics

Discriminative pattern mining is one of the most important techniques in data mining. This challenging task is concerned with finding a set of patterns that occur with disproportionate frequency in data sets with various class labels. Such patterns are of great value for group difference detection and classifier construction. Research on finding interesting discriminative patterns in class-labeled data evolves rapidly and lots of algorithms have been proposed to specifically address this problem. Discriminative pattern mining techniques have proven their considerable value in biological data analysis. The archetypical applications in bioinformatics include phosphorylation motif discovery, differentially expressed gene identification, discriminative genotype pattern detection, etc. In this article, we present an overview of discriminative pattern mining and the corresponding effective methods, and subsequently we illustrate their applications to tackling the bioinformatics problems. In the end, we give a general discussion of potential challenges and future work for this task.

[1]  Nada Lavrac,et al.  Expert-Guided Subgroup Discovery: Methodology and Application , 2011, J. Artif. Intell. Res..

[2]  Jinyan Li,et al.  Strong Compound-Risk Factors: Efficient Discovery Through Emerging Patterns and Contrast Sets , 2007, IEEE Transactions on Information Technology in Biomedicine.

[3]  Hongyan Liu,et al.  A Tree-Based Contrast Set-Mining Approach to Detecting Group Differences , 2014, INFORMS J. Comput..

[4]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[5]  SHIHONG MAO,et al.  Evaluation of Inter-Laboratory and Cross-Platform concordance of DNA microarrays through Discriminating genes and Classifier transferability , 2009, J. Bioinform. Comput. Biol..

[6]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[7]  Jianyong Wang,et al.  Direct mining of discriminative patterns for classifying uncertain data , 2010, KDD.

[8]  María José del Jesús,et al.  Multiobjective Genetic Algorithm for Extracting Subgroup Discovery Fuzzy Rules , 2007, 2007 IEEE Symposium on Computational Intelligence in Multi-Criteria Decision-Making.

[9]  Ole Winther,et al.  Discovery of Regulatory Elements is Improved by a Discriminatory Approach , 2009, PLoS Comput. Biol..

[10]  Luc De Raedt,et al.  Evaluating Pattern Set Mining Strategies in a Constraint Programming Framework , 2011, PAKDD.

[11]  Li Ma,et al.  An “almost exhaustive” search‐based sequential permutation method for detecting epistasis in disease association studies , 2010, Genetic epidemiology.

[12]  Daniel Paurat,et al.  Direct local pattern sampling by efficient two-step random procedures , 2011, KDD.

[13]  Zengyou He,et al.  Motif-All: discovering all phosphorylation motifs , 2011, BMC Bioinformatics.

[14]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[15]  Florian Lemmerich,et al.  Fast Discovery of Relevant Subgroup Patterns , 2010, FLAIRS Conference.

[16]  A. Knobbe,et al.  Flexible Enrichment with Cortana – Software Demo , 2011 .

[17]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[18]  María José del Jesús,et al.  Multiobjective Evolutionary Induction of Subgroup Discovery Fuzzy Rules: A Case Study in Marketing , 2006, ICDM.

[19]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[20]  Roded Sharan,et al.  A supervised approach for identifying discriminating genotype patterns and its application to breast cancer data , 2007, Bioinform..

[21]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[22]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[24]  Luc De Raedt,et al.  k-Pattern Set Mining under Constraints , 2013, IEEE Transactions on Knowledge and Data Engineering.

[25]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[26]  Arno J. Knobbe,et al.  Diverse subgroup set discovery , 2012, Data Mining and Knowledge Discovery.

[27]  Jana Schmidt,et al.  Interpreting PET Scans by Structured Patient Data: A Data Mining Case Study in Dementia Research , 2008, ICDM.

[28]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[29]  Luc De Raedt,et al.  Itemset mining: A constraint programming perspective , 2011, Artif. Intell..

[30]  Albert J R Heck,et al.  Identification of enriched PTM crosstalk motifs from large-scale experimental data sets. , 2014, Journal of proteome research.

[31]  Peter A. Flach,et al.  Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009 , 2009, KDD.

[32]  M. Steinbach,et al.  High-Order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions , 2012, PloS one.

[33]  Korris Fu-Lai Chung,et al.  Using Emerging Pattern Based Projected Clustering and Gene Expression Data for Cancer Detection , 2004, APBC.

[34]  Gregory Shakhnarovich,et al.  Discovery of phosphorylation motif mixtures in phosphoproteomics data , 2008, Bioinform..

[35]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[36]  Valerie J. Gillet,et al.  Emerging Pattern Mining To Aid Toxicological Knowledge Discovery , 2014, J. Chem. Inf. Model..

[37]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Manish Gupta,et al.  Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[39]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[40]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[41]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[42]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[43]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[44]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[45]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[46]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[47]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[48]  S. Gygi,et al.  An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets , 2005, Nature Biotechnology.

[49]  L. Pusztai,et al.  Gene expression profiling in breast cancer: classification, prognostication, and prediction , 2011, The Lancet.

[50]  A. Syvänen Accessing genetic variation: genotyping single nucleotide polymorphisms , 2001, Nature Reviews Genetics.

[51]  Hongyu Zhao,et al.  COSINE: COndition-SpecIfic sub-NEtwork identification using a global optimization method , 2011, Bioinform..

[52]  Luc De Raedt,et al.  Declarative Heuristic Search for Pattern Set Mining , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[53]  Paulo J. Azevedo,et al.  Rules for contrast sets , 2010, Intell. Data Anal..

[54]  Valerie J. Gillet,et al.  Automating Knowledge Discovery for Toxicity Prediction Using Jumping Emerging Pattern Mining , 2012, J. Chem. Inf. Model..

[55]  Bart Goethals,et al.  A primer to frequent itemset mining for bioinformatics , 2013, Briefings Bioinform..

[56]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.

[57]  Henrik Grosskreutz,et al.  Non-redundant Subgroup Discovery Using a Closure System , 2009, ECML/PKDD.

[58]  Liang Chen,et al.  A statistical method for identifying differential gene-gene co-expression patterns , 2004, Bioinform..

[59]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[60]  Guozhu Dong,et al.  CPCQ: Contrast pattern based clustering quality index for categorical data , 2012, Pattern Recognit..

[61]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[62]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[63]  N. Lavra,et al.  Predictive Performance of Weighted Relative Accuracy , 2000 .

[64]  Chris Bailey-Kellogg,et al.  MMFPh: a maximal motif finder for phosphoproteomics datasets , 2012, Bioinform..

[65]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[66]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[67]  Jinyan Li,et al.  Relative risk and odds ratio: a data mining perspective , 2005, PODS '05.

[68]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[69]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[70]  Peter A. Flach,et al.  Predictive Performance of Weghted Relative Accuracy , 2000, PKDD.

[71]  María José del Jesús,et al.  Evolutionary Fuzzy Rule Induction Process for Subgroup Discovery: A Case Study in Marketing , 2007, IEEE Transactions on Fuzzy Systems.

[72]  L. Wong,et al.  Emerging patterns and gene expression data. , 2001, Genome informatics. International Conference on Genome Informatics.

[73]  James Bailey,et al.  Contrast Data Mining: Concepts, Algorithms, and Applications , 2012 .

[74]  Nada Lavrac,et al.  Contrast Set Mining Through Subgroup Discovery Applied to Brain Ischaemina Data , 2007, PAKDD.

[75]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2008, J. Mach. Learn. Res..

[76]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[77]  Nils J. Nilsson,et al.  MLC++, A Machine Learning Library in C++. , 1995 .

[78]  Vipin Kumar,et al.  Characterizing Discriminative Patterns , 2011, ArXiv.

[79]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[80]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[81]  K. Bussell Signalling: Friendly rivalry , 2005, Nature Reviews Molecular Cell Biology.

[82]  Jun Wu,et al.  Mining Conditional Phosphorylation Motifs , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[83]  Qiang Yang,et al.  SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies , 2009, Bioinform..

[84]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[85]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[86]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[87]  Wouter Duivesteijn,et al.  Exploiting False Discoveries -- Statistical Validation of Patterns and Quality Measures in Subgroup Discovery , 2011, 2011 IEEE 11th International Conference on Data Mining.

[88]  Johannes Fürnkranz,et al.  From Local Patterns to Global Models: The LeGo Approach to Data Mining , 2008 .

[89]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.

[90]  Philip S. Yu,et al.  Direct mining of discriminative and essential frequent patterns via model-based search tree , 2008, KDD.

[91]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[92]  James Bailey,et al.  Discovery of Emerging Patterns and Their Use in Classification , 2003, Australian Conference on Artificial Intelligence.

[93]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[94]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[95]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. , 2002 .

[96]  Christian Borgelt,et al.  Frequent item set mining , 2012, WIREs Data Mining Knowl. Discov..

[97]  I. Chung,et al.  Discovery of Protein Phosphorylation Motifs through Exploratory Data Analysis , 2011, PloS one.

[98]  Geoffrey I. Webb Magnum Opus version 1 , 2001 .

[99]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[100]  Nada Lavrac,et al.  Contrast Set Mining for Distinguishing Between Similar Diseases , 2007, AIME.

[101]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[102]  Kotagiri Ramamohanarao,et al.  The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms , 2000, ICML.

[103]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[104]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[105]  María José del Jesús,et al.  NMEEF-SD: Non-dominated Multiobjective Evolutionary Algorithm for Extracting Fuzzy Rules in Subgroup Discovery , 2010, IEEE Transactions on Fuzzy Systems.

[106]  Thomas Gärtner,et al.  Linear space direct pattern sampling using coupling from the past , 2012, KDD.