Incremental Fuzzy Mining of Gene Expression Data for Gene Function Prediction

Due to the complexity of the underlying biological processes, gene expression data obtained from DNA microarray technologies are typically noisy and have very high dimensionality and these make the mining of such data for gene function prediction very difficult. To tackle these difficulties, we propose to use an incremental fuzzy mining technique called incremental fuzzy mining (IFM). By transforming quantitative expression values into linguistic terms, such as highly or lowly expressed, IFM can effectively capture heterogeneity in expression data for pattern discovery. It does so using a fuzzy measure to determine if interesting association patterns exist between the linguistic gene expression levels. Based on these patterns, IFM can make accurate gene function predictions and these predictions can be made in such a way that each gene can be allowed to belong to more than one functional class with different degrees of membership. Gene function prediction problem can be formulated both as classification and clustering problems, and IFM can be used either as a classification technique or together with existing clustering algorithms to improve the cluster groupings discovered for greater prediction accuracies. IFM is characterized also by its being an incremental data mining technique so that the discovered patterns can be continually refined based only on newly collected data without the need for retraining using the whole dataset. For performance evaluation, IFM has been tested with real expression datasets for both classification and clustering tasks. Experimental results show that it can effectively uncover hidden patterns for accurate gene function predictions.

[1]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[2]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[5]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  I. Good,et al.  Information, weight of evidence, the singularity between probability measures and signal detection , 1974 .

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[10]  Aidong Zhang,et al.  Advanced Analysis of Gene Expression Microarray Data , 2006, Science, Engineering, and Biology Informatics.

[11]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[12]  L. A. Zadeh,et al.  Fuzzy logic and approximate reasoning , 1975, Synthese.

[13]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[14]  Elizabeth W. Jones,et al.  Genetics: Analysis of Genes and Genomes , 2001 .

[15]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[16]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[17]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S. Haberman The Analysis of Residuals in Cross-Classified Tables , 1973 .

[19]  Andrew K. C. Wong,et al.  Learning sequential patterns for probabilistic inductive prediction , 1994 .

[20]  Lotfi A. Zadeh,et al.  A Theory of Approximate Reasoning , 1979 .

[21]  Madhuri S. Mulekar Data Mining: Multimedia, Soft Computing, and Bioinformatics , 2004, Technometrics.

[22]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[23]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[24]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[25]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[26]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[27]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[28]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[29]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[30]  Lei Liu,et al.  Knowledge guided analysis of microarray data , 2006, J. Biomed. Informatics.

[31]  Yanqing Zhang,et al.  Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification , 2008, IEEE Transactions on Information Technology in Biomedicine.

[32]  E. Winzeler,et al.  Genomics, gene expression and DNA arrays , 2000, Nature.

[33]  Rency S Varghese,et al.  Increasing the efficiency of fuzzy logic-based gene expression data analysis. , 2003, Physiological genomics.

[34]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[35]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[36]  P. Woolf,et al.  A fuzzy logic approach to analyzing gene expression data. , 2000, Physiological genomics.

[37]  Andrew K. C. Wong,et al.  Statistical Technique for Extracting Classificatory Knowledge from Databases , 1991, Knowledge Discovery in Databases.

[38]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.