Fuzzy clustering with biological knowledge for gene selection

This paper presents an application of Fuzzy Clustering of Large Applications based on Randomized Search (FCLARANS) for attribute clustering and dimensionality reduction in gene expression data. Domain knowledge based on gene ontology and differential gene expressions are employed in the process. The use of domain knowledge helps in the automated selection of biologically meaningful partitions. Gene ontology (GO) study helps in detecting biologically enriched and statistically significant clusters. Fold-change is measured to select the differentially expressed genes as the representatives of these clusters. Tools like Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each enriched gene partition to form the reduced gene space. While the reduced gene set forms a biologically meaningful attribute space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology on four sets of publicly available microarray gene expression data.

[1]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[2]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Sushmita Mitra,et al.  Data Mining: Concepts and Algorithms From Multimedia to Bioinformatics , 2003 .

[5]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[6]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[7]  Sushmita Mitra,et al.  Clustering large data with uncertainty , 2013, Appl. Soft Comput..

[8]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[9]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Sankar K. Pal,et al.  Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing , 1999 .

[11]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[12]  Sampsa Hautaniemi,et al.  Fast Gene Ontology based clustering for microarray experiments , 2008, BioData Mining.

[13]  David R. Bickel,et al.  Robust Cluster Analysis of Microarray Gene Expression Data with the Number of Clusters Determined Biologically , 2003, Bioinform..

[14]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Sushmita Mitra,et al.  Evolutionary Rough Feature Selection in Gene Expression Data , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[17]  Michael K. Ng,et al.  Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters , 2008, IEEE Transactions on Knowledge and Data Engineering.

[18]  Blaise Hanczar,et al.  Improving classification of microarray data using prototype-based feature selection , 2003, SKDD.

[19]  Wei Pan,et al.  Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data , 2006, Bioinform..

[20]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[21]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[22]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[23]  Frederick P. Roth,et al.  Next generation software for functional trend analysis , 2009, Bioinform..

[24]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[27]  Andreas Rytz,et al.  The limit fold change model: A practical approach for selecting differentially expressed genes from microarray data , 2002, BMC Bioinformatics.

[28]  David G. Stork,et al.  Pattern Classification , 1973 .

[29]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[30]  Eduardo R. Hruschka,et al.  Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[31]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[32]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[33]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[34]  Carla E. Brodley,et al.  Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Jiebo Luo,et al.  Data Mining. Multimedia, Soft Computing, and Bioinformatics , 2005, IEEE Transactions on Neural Networks.

[36]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[37]  Yidong Chen,et al.  A novel significance score for gene selection and ranking , 2014, Bioinform..

[38]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[40]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.