Finding Rule Groups to Classify High Dimensional Gene Expression Datasets

Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods can not be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes), which are most discriminative to classify samples in different classes, to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches.

[1]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[2]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[4]  Jinyan Li,et al.  A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts , 2006, Nucleic acids research.

[5]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[6]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[7]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[8]  Jiyuan An,et al.  DDR: an index method for large time-series datasets , 2005, Inf. Syst..

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Yi-Ping Phoebe Chen,et al.  An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy , 2007, Nucleic acids research.

[11]  Chengqi Zhang,et al.  Detecting inconsistency in biological molecular databases using ontologies , 2007, Data Mining and Knowledge Discovery.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[14]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[15]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ron Rymon,et al.  Search through Systematic Set Enumeration , 1992, KR.

[17]  Yi-Ping Phoebe Chen,et al.  Kernel-based naive bayes classifier for breast cancer prediction , 2007 .

[18]  T. Pham,et al.  Analysis of Microarray Gene Expression Data , 2006 .

[19]  Jiyuan An,et al.  Yet another induction algorithm , 2005 .

[20]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[21]  Blaz Zupan,et al.  Data and text mining Visualization-based cancer microarray data classification analysis , 2007 .