Discovering Knowledge from Local Patterns in SAGE Data

ABsTRAcT The discovery of biologically interpretable knowledge from gene expression data is a crucial issue. Current gene data analysis is often based on global approaches such as clustering. An alternative way is to utilize local pattern mining techniques for global modeling and knowledge discovery. Nevertheless , moving from local patterns to models and knowledge is still a challenge due to the overwhelming number of local patterns and their summarization remains an open issue. This chapter is an attempt to fulfill BLOCKINthis BLOCKINneed: BLOCKINthanks BLOCKINto BLOCKINrecent BLOCKINprogress BLOCKINin BLOCKINconstraint-based paradigm, it proposes three data mining methods to deal with the use of local patterns by highlighting the most promising ones or summarizing them. Ideas at the core of these processes are removing redundancy, integrating background knowledge, and recursive mining. This approach is effective and useful in large and real-world data: from the case study of the SAGE gene expression data, we demonstrate that it allows generating new biological hypotheses with clinical applications.

[1]  Roberto J. Bayardo The Hows, Whys, and Whens of Constraints in Itemset and Rule Discovery , 2004, Constraint-Based Mining and Inductive Databases.

[2]  Bruno Crémilleux,et al.  Constraint-Based Knowledge Discovery from SAGE Data , 2008, Silico Biol..

[3]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[4]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.

[5]  Bruno Crémilleux,et al.  Mining constraint-based patterns using automatic relaxation , 2009, Intell. Data Anal..

[6]  Patricia Cerrito Text Mining Techniques for Healthcare Provider Quality Determination: Methods for Rank Comparisons , 2009 .

[7]  Ruggero G. Pensa,et al.  A Bi-clustering Framework for Categorical Data , 2005, PKDD.

[8]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[9]  Andrea Tagarelli XML Data Mining: Models, Methods, and Applications , 2011 .

[10]  James Bailey,et al.  Fast Algorithms for Mining Emerging Patterns , 2002, PKDD.

[11]  Ruggero G. Pensa,et al.  Clustering Formal Concepts to Discover Biologically Relevant Knowledge from Gene Expression Data , 2007, Silico Biol..

[12]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[13]  Jean-Gabriel Ganascia,et al.  Pattern Detection and Discovery: The Case of Music Data Mining , 2002, Pattern Detection and Discovery.

[14]  Jean-François Boulicaut,et al.  Local Pattern Detection , 2008 .

[15]  Olivier Gandrillon,et al.  Large-scale analysis by SAGE reveals new mechanisms of v-erbA oncogene action , 2007, BMC Genomics.

[16]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[17]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[18]  Luc De Raedt,et al.  A Theory of Inductive Query Answering , 2010, Inductive Databases and Constraint-Based Data Mining.

[19]  R. N. Korotkina,et al.  Activity of Glutathione-Metabolizing and Antioxidant Enzymes in Malignant and Benign Tumors of Human Lungs , 2002, Bulletin of Experimental Biology and Medicine.

[20]  S. Bakken,et al.  Implications for Nursing Research and Generation of Evidence , 2011 .

[21]  Francesco Bonchi,et al.  On condensed representations of constrained frequent patterns , 2005, Knowledge and Information Systems.

[22]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[23]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[24]  R. Brigelius-Flohé,et al.  Glutathione peroxidases and redox-regulated transcription factors , 2006, Biological chemistry.

[25]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[26]  Bruno Crémilleux,et al.  Condensed Representation of Emerging Patterns , 2004, PAKDD.

[27]  C. Becquet,et al.  Strong-association-rule mining for large-scale gene-expression data analysis: a case study on human SAGE data , 2002, Genome Biology.

[28]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[29]  Jean-François Boulicaut,et al.  Using transposition for pattern discovery from microarray data , 2003, DMKD '03.

[30]  K. Esser,et al.  GPx-1 modulates Akt and P70S6K phosphorylation and Gadd45 levels in MCF-7 cells. , 2004, Free radical biology & medicine.

[31]  Bruno Crémilleux,et al.  An Efficient Framework for Mining Flexible Constraints , 2005, PAKDD.

[32]  B. Halliwell,et al.  Biochemistry of oxidative stress. , 2007, Biochemical Society transactions.

[33]  M. Vaarala,et al.  Several genes encoding ribosomal proteins are over‐expressed in prostate‐cancer cell lines: Confirmation of L7a and L37 over‐expression in prostate‐cancer tissue samples , 1998, International journal of cancer.

[34]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.