A primer to frequent itemset mining for bioinformatics

Over the past two decades, pattern mining techniques have become an integral part of many bioinformatics solutions. Frequent itemset mining is a popular group of pattern mining techniques designed to identify elements that frequently co-occur. An archetypical example is the identification of products that often end up together in the same shopping basket in supermarket transactions. A number of algorithms have been developed to address variations of this computationally non-trivial problem. Frequent itemset mining techniques are able to efficiently capture the characteristics of (complex) data and succinctly summarize it. Owing to these and other interesting properties, these techniques have proven their value in biological data analysis. Nevertheless, information about the bioinformatics applications of these techniques remains scattered. In this primer, we introduce frequent itemset mining and their derived association rules for life scientists. We give an overview of various algorithms, and illustrate how they can be used in several real-life bioinformatics application domains. We end with a discussion of the future potential and open challenges for frequent itemset mining in the life sciences.

[1]  Gary Geunbae Lee,et al.  Subcellular Localization Prediction through Boosting Association Rules , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Ujjwal Maulik,et al.  A Novel Biclustering Approach to Association Rule Mining for Predicting HIV-1–Human Protein Interactions , 2012, PloS one.

[3]  Y. Benjamini,et al.  A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence , 1999 .

[4]  Bart Goethals,et al.  Frequent Set Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[5]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[6]  Jilles Vreeken,et al.  Unraveling tobacco BY-2 protein complexes with BN PAGE/LC-MS/MS and clustering methods. , 2011, Journal of proteomics.

[7]  José A. Reyes,et al.  Prediction of protein-protein interaction types using association rule based classification , 2009, BMC Bioinformatics.

[8]  Mikhail S. Gelfand,et al.  Mining sequence annotation databanks for association patterns , 2005, Bioinform..

[9]  Francisco-Javier Lopez,et al.  Fuzzy association rules for biological data analysis: A case study on yeast , 2008, BMC Bioinformatics.

[10]  Dr. Hui Xiong Association Analysis: Basic Concepts and Algorithms , 2005 .

[11]  Anthony K. H. Tung,et al.  COBBLER: combining column and row enumeration for closed pattern discovery , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[12]  J. van Leeuwen,et al.  Intelligent Data Engineering and Automated Learning , 2003, Lecture Notes in Computer Science.

[13]  Gediminas Adomavicius,et al.  Handling very large numbers of association rules in the analysis of microarray data , 2002, KDD.

[14]  Boris Cule,et al.  Mining spatially cohesive itemsets in protein molecular structures , 2013, BioKDD '13.

[15]  Ruichu Cai,et al.  Two novel interestingness measures for gene association rule mining , 2012, Neural Computing and Applications.

[16]  Martin Vingron,et al.  DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach , 2011, Algorithms for Molecular Biology.

[17]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[18]  Mohammed J. Zaki,et al.  Mining residue contacts in proteins using local structure predictions , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[19]  Stefan Kramer,et al.  Analyzing microarray data using quantitative association rules , 2005, ECCB/JBI.

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Anne M. Denton,et al.  Differential Association Rule Mining for the Study of Protein-Protein Interaction Networks , 2004, BIOKDD.

[22]  Vincent S. Tseng,et al.  Efficient mining of multilevel gene association rules from microarray and gene ontology , 2009, Inf. Syst. Frontiers.

[23]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[24]  Wojciech Szpankowski,et al.  Detecting Conserved Interaction Patterns in Biological Networks , 2006, J. Comput. Biol..

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Ping Luo,et al.  Incorporating occupancy into frequent pattern mining for high quality pattern recommendation , 2012, CIKM.

[27]  Vipin Kumar,et al.  Association analysis-based transformations for protein interaction networks: a function prediction case study , 2007, KDD '07.

[28]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[29]  Patrik D'haeseleer,et al.  Microbial genotype–phenotype mapping by class association rule mining , 2008, Bioinform..

[30]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[31]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[32]  Hasan H. Otu,et al.  Prediction of peptides binding to MHC class I and II alleles by temporal motif mining , 2013, BMC Bioinformatics.

[33]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[34]  Siu-Ming Yiu,et al.  A data-mining approach for multiple structural alignment of proteins , 2010, Bioinformation.

[35]  Xiaoyun Chen,et al.  Emerging Patterns and Classification Algorithms for DNA Sequence , 2011, J. Softw..

[36]  Kenji Satou,et al.  Extraction of knowledge on protein-protein interaction by association rule discovery , 2002, Bioinform..

[37]  Heikki Mannila,et al.  Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining , 1997 .

[38]  Maria-Luiza Antonie,et al.  Classifying microarray data with association rules , 2011, SAC.

[39]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[40]  Dino Pedreschi,et al.  Knowledge Discovery in Databases: PKDD 2004 , 2004, Lecture Notes in Computer Science.

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  Ricardo Martínez,et al.  Mining Association Rule Bases from Integrated Genomic Data and Annotations , 2008, CIBB.

[43]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[44]  Finn Verner Jensen,et al.  Bayesian networks , 1998, Data Mining and Knowledge Discovery Handbook.

[45]  Susan M. Bridges,et al.  Cross-Ontology Multi-level Association Rule Mining in the Gene Ontology , 2012, PloS one.

[46]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[47]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[48]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[49]  Hongyan Liu,et al.  Top-Down Mining of Interesting Patterns from Very High Dimensional Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[50]  Kurt Hornik,et al.  The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets , 2011, J. Mach. Learn. Res..

[51]  Fabrice Guillet,et al.  Quality Measures in Data Mining , 2009, Studies in Computational Intelligence.

[52]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[53]  HornikKurt,et al.  The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets , 2011 .

[54]  Kwong-Sak Leung,et al.  Discovering protein–DNA binding sequence patterns using association rule mining , 2010, Nucleic acids research.

[55]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[56]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[57]  Yanqing Zhang,et al.  Granular support vector machines with association rules mining for protein homology prediction , 2005, Artif. Intell. Medicine.

[58]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[59]  José María Carazo,et al.  BMC Bioinformatics BioMed Central Methodology article Integrated analysis of gene expression by association rules discovery , 2022 .

[60]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[61]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 2004, Data Mining and Knowledge Discovery.

[62]  Kian-Lee Tan,et al.  Automatic protein structure classification through structural fingerprinting , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[63]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[64]  James Bailey,et al.  Fast Algorithms for Mining Emerging Patterns , 2002, PKDD.

[65]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[66]  Jesús S. Aguilar-Ruiz,et al.  Gene association analysis: a survey of frequent pattern mining from gene expression data , 2010, Briefings Bioinform..

[67]  Bart Goethals,et al.  MIME: a framework for interactive visual pattern mining , 2011, KDD.

[68]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[69]  Guimei Liu,et al.  FastTagger: an efficient algorithm for genome-wide tag SNP selection using multi-marker linkage disequilibrium , 2010, BMC Bioinformatics.

[70]  Pan e Panov,et al.  Inductive Databases and Constraint-Based Data Mining , 2010 .

[71]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[72]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[73]  Christian Panse,et al.  Identification of Combinatorial Patterns of Post-Translational Modifications on Individual Histones in the Mouse Brain , 2012, PloS one.

[74]  Christian Borgelt,et al.  Frequent item set mining , 2012, WIREs Data Mining Knowl. Discov..

[75]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[76]  Carolina Ruiz,et al.  Distance-enhanced association rules for gene expression , 2003, BIOKDD.

[77]  Jihye Kim,et al.  Finding association rules of cis-regulatory elements involved in alternative splicing , 2007, ACM-SE 45.

[78]  Wen Wen,et al.  Kernel based gene expression pattern discovery and its application on cancer classification , 2010, Neurocomputing.

[79]  T. Mcintosh,et al.  High Confidence Rule Mining for Microarray Analysis , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[80]  Ingrid Lohmann,et al.  COPS: Detecting Co-Occurrence and Spatial Arrangement of Transcription Factor Binding Motifs in Genome-Wide Datasets , 2012, PloS one.

[81]  Carson Kai-Sang Leung,et al.  FpViz: a visualizer for frequent pattern mining , 2009, VAKD '09.

[82]  Osmar R. Zaïane,et al.  Mining Positive and Negative Association Rules: An Approach for Confined Rules , 2004, PKDD.

[83]  M. Cevdet Ince,et al.  An expert system for detection of breast cancer based on association rules and neural network , 2009, Expert Syst. Appl..

[84]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[85]  Nikolaj Tatti,et al.  Using background knowledge to rank itemsets , 2010, Data Mining and Knowledge Discovery.

[86]  Edward C. Uberbacher,et al.  Analyzing large biological datasets with association networks , 2012, Nucleic acids research.

[87]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[88]  Jean-François Boulicaut,et al.  Generalizing Itemset Mining in a Constraint Programming Setting , 2010, Inductive Databases and Constraint-Based Data Mining.

[89]  Mohammed J. Zaki,et al.  GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets , 2005, Data Mining and Knowledge Discovery.

[90]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[91]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[92]  Li Ma,et al.  An “almost exhaustive” search‐based sequential permutation method for detecting epistasis in disease association studies , 2010, Genetic epidemiology.

[93]  Guimei Liu,et al.  Controlling False Positives in Association Rule Mining , 2011, Proc. VLDB Endow..

[94]  Dan A. Simovici,et al.  Generating an informative cover for association rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[95]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[96]  Vincent S. Tseng,et al.  Discovering relational-based association rules with multiple minimum supports on microarray datasets , 2011, Bioinform..

[97]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[98]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[99]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[100]  Olivier Teytaud,et al.  Association Rule Interestingness: Measure and Statistical Validation , 2007, Quality Measures in Data Mining.

[101]  Siu Cheung Hui,et al.  Exploring ant-based algorithms for gene expression data analysis , 2009, Artif. Intell. Medicine.

[102]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[103]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[104]  Yi Pan,et al.  Rule Extraction from SVM for Protein Structure Prediction , 2008, Rule Extraction from Support Vector Machines.

[105]  M. Steinbach,et al.  High-Order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions , 2012, PloS one.

[106]  Pourang Irani,et al.  WiFIsViz: Effective Visualization of Frequent Itemsets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[107]  Sai-Ping Li,et al.  A guided Monte Carlo approach to optimization problems , 2003 .

[108]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[109]  Alfredo Ferro,et al.  MIDClass: Microarray Data Classification by Association Rules and Gene Expression Intervals , 2013, PloS one.

[110]  Joost N. Kok,et al.  The Gaston Tool for Frequent Subgraph Mining , 2005, GraBaTs.

[111]  William C. Chu,et al.  Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011 , 2011, SAC.

[112]  Siegfried Nijssen,et al.  What Is Frequent in a Single Graph? , 2007, PAKDD.

[113]  Carolina Ruiz,et al.  Association Rule Mining Algorithms for Set-Valued Data , 2003, IDEAL.

[114]  Song Liu,et al.  FUSIM: a software tool for simulating fusion transcripts , 2013, BMC Bioinformatics.

[115]  Pu-Jen Cheng,et al.  Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams , 2012, CIKM.

[116]  Mohammed J. Zaki,et al.  Mining residue contacts in proteins using local structure predictions , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[117]  Kathleen Marchal,et al.  The Condition‐Dependent Transcriptional Network in Escherichia coli , 2009, Annals of the New York Academy of Sciences.