Two Phase Integrated Rule based Model (TPC-IRBM) for Clustering of Gene Expression Data of CA1 Region of Rat Hippocampus

This paper propose a semi supervised clustering model TPCIRBM(Two phase clustering-Integrated rule based model) for clustering large data set such as gene expression data. TPCIRBM works in two phases to cluster the gene expression data set. The proposed model is based on rule based models CRT,C5,CHAID and QUEST. In the first phase of the model 30 % data(which may vary) is extracted to prepare training, testing and validation data (TTV data)using suitable heuristic or neural network based clustering techniques. The output of first phase is used as build the models and generate the rule base fitting to TTV data using aforesaid models. The proposed model is then constructed by selecting and integrating the quality rules of various models using qualifying criteria corresponding to every cluster.The number of quality rules in proposed model is much more compared to that of CRT,C5,CHAID and QUEST.The performance in terms of accuracy is better compared to the models. Although in some cases Neural Network based models performance is slightly better but a very high cost of complexity for very large data set.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[3]  Fabyano Fonseca e Silva,et al.  Bayesian model-based clustering of temporal gene expression using autoregressive panel data approach , 2012, Bioinform..

[4]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  E. Lander Array of hope , 1999, Nature Genetics.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Gregory Piatetsky-Shapiro,et al.  Capturing best practice for microarray gene expression data analysis , 2003, KDD '03.

[10]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[13]  Ron Shamir,et al.  An algorithm for clustering cDNAs for gene expression analysis , 1999, RECOMB.

[14]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[15]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[16]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[18]  M. Gallagher,et al.  Prominent hippocampal CA3 gene expression profile in neurocognitive aging , 2011, Neurobiology of Aging.

[19]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[20]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .