Study On Clustering Techniques And Application To Microarray Gene Expression Bioinformatics Data

With the explosive growth of the amount of publicly available genomic data, a new field of computer science i.e., bioinformatics has been emerged, focusing on the use of computing systems for efficiently deriving, storing, and analyzing the character strings of genome to help to solve problems in molecular biology. The flood of data from biology, mainly in the form of DNA, RNA and Protein sequences, puts heavy demand on computers and computational scientists. At the same time, it demands a transformation of basic ethos of biological sciences. Hence, Data mining techniques can be used efficiently to explore hidden pattern underlying in biological data. Un-supervised classification, also known as Clustering; which is one of the branch of Data Mining can be applied to biological data and this can result in a better era of rapid medical development and drug discovery. In the past decade, the advent of efficient genome sequencing tools have led to enormous progress in life sciences. Among the most important innovations, microarray technology allows to quantify the expression for thousand of genes simultaneously. The characteristic of these data which makes it different from machine-learning/pattern recognition data includes, a fair amount of random noise, missing values, a dimension in the range of thousands, and a sample size in few dozens. A particular application of the microarray technology is in the area of cancer research, where the goal is for precise and early detection of tumorous cells with high accuracy. The challenge for a biologist and computer scientist is to provide solution based on terms of automation, quality and efficiency.

[1]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[3]  Sankar K. Pal,et al.  Genetic Algorithms for Pattern Recognition , 2017 .

[4]  Robert Swan. Sturtevant,et al.  Bulletin of the American Iris Society , 1931 .

[5]  Aidong Zhang,et al.  Advanced Analysis of Gene Expression Microarray Data , 2006, Science, Engineering, and Biology Informatics.

[6]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[7]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[8]  Ujjwal Maulik,et al.  A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification , 2005, Fuzzy Sets Syst..

[9]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[10]  James C. Bezdek,et al.  Generalized clustering networks and Kohonen's self-organizing scheme , 1993, IEEE Trans. Neural Networks.

[11]  Carlos Ordonez,et al.  Efficient disk-based K-means clustering for relational databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Sanghamitra Bandyopadhyay,et al.  Analysis of Biological Data: A Soft Computing Approach , 2007, Science, Engineering, and Biology Informatics.

[17]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[18]  Santanu Kumar Rath,et al.  FCM for Gene Expression Bioinformatics Data , 2009, IC3.

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  James C. Bezdek,et al.  Cluster validation with generalized Dunn's indices , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[22]  James C. Bezdek,et al.  Repairs to GLVQ: a new family of competitive learning schemes , 1996, IEEE Trans. Neural Networks.

[23]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[24]  L. Darrell Whitley,et al.  Transforming the search space with Gray coding , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[25]  James F. Power,et al.  Using Fuzzy Logic: Towards Intelligent Systems , 1994 .

[26]  Jian Pei,et al.  DHC: a density-based hierarchical clustering method for time series gene expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[27]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[28]  Taizo Hanai,et al.  Analysis of expression profile using fuzzy adaptive resonance theory , 2002, Bioinform..

[29]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[30]  Tian-Wei Sheu,et al.  A New Fuzzy Possibility Clustering Algorithms Based on Unsupervised Mahalanobis Distances , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[31]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[33]  Pascal Nsoh,et al.  Large-scale temporal gene expression mapping of central nervous system development , 2007 .

[34]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[35]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  M. Narasimha Murty,et al.  A near-optimal initial seed value selection in K-means means algorithm using a genetic algorithm , 1993, Pattern Recognit. Lett..

[38]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[39]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[40]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[41]  P. Kudova Clustering Genetic Algorithm , 2007 .

[42]  Tingting Mu,et al.  Breast cancer detection from FNA using SVM with different parameter tuning systems and SOM-RBF classifier , 2007, J. Frankl. Inst..

[43]  Roberto Marcondes Cesar Junior,et al.  Inference from Clustering with Application to Gene-Expression Microarrays , 2002, J. Comput. Biol..

[44]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[45]  Rowena Cole,et al.  Clustering with genetic algorithms , 1998 .

[46]  Sanghamitra Bandyopadhyay,et al.  Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[47]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[48]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[49]  Ujjwal Maulik,et al.  Multiobjective Genetic Clustering for Pixel Classification in Remote Sensing Imagery , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[50]  O. Castillo,et al.  Comparative study of fuzzy methods in breast cancer diagnosis , 2008, NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society.

[51]  Ujjwal Maulik,et al.  Fuzzy partitioning using a real-coded variable-length genetic algorithm for pixel classification , 2003, IEEE Trans. Geosci. Remote. Sens..

[52]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[53]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[54]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[55]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Nabil Belacel,et al.  Fuzzy J-Means and VNS methods for clustering genes from microarray data , 2004, Bioinform..

[57]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[58]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[59]  Chien-Hsing Chou,et al.  Symmetry as A new Measure for Cluster Validity , 2002 .

[60]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[61]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[63]  Sung-Bae Cho,et al.  A Fuzzy Clustering Algorithm for Analysis of Gene Expression Profiles , 2004, PRICAI.

[64]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[65]  Vasudha Bhatnagar,et al.  K-means Clustering Algorithm for Categorical Attributes , 1999, DaWaK.

[66]  Zülal Güngör,et al.  K-Harmonic means data clustering with tabu-search method , 2008 .

[67]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[68]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[69]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[70]  Seo Young Kim,et al.  Effect of data normalization on fuzzy clustering of DNA microarray data , 2005, BMC Bioinformatics.

[71]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Hsiang-Chuan Liu,et al.  Fuzzy C-Mean Algorithm Based on Mahalanobis Distances and Better Initial Values , 2007 .

[73]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .

[74]  V. Estivill-Castro,et al.  A Fast and Robust General Purpose Clustering Algorithm , 2000 .

[75]  Hannu Toivonen,et al.  Data Mining In Bioinformatics , 2005 .

[76]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[77]  Yi Lu,et al.  FGKA: a Fast Genetic K-means Clustering Algorithm , 2004, SAC '04.

[78]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, Data Mining and Knowledge Discovery.

[79]  L. Hubert,et al.  Comparing partitions , 1985 .

[80]  Eric J. Pauwels,et al.  Finding Salient Regions in Images: Nonparametric Clustering for Image Segmentation and Grouping , 1999, Comput. Vis. Image Underst..

[81]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[82]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[83]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[84]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[85]  Aidong Zhang,et al.  An iterative strategy for pattern discovery in high-dimensional data sets , 2002, CIKM '02.

[86]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[87]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[88]  Chunheng Wang,et al.  A clustering algorithm combine the FCM algorithm with supervised learning normal mixture model , 2008, 2008 19th International Conference on Pattern Recognition.

[89]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[90]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[91]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[92]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[93]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[94]  A. Butte,et al.  Microarrays for an Integrative Genomics , 2002 .

[95]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[96]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[97]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[98]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[99]  Yi Pan,et al.  Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property , 2005, IEEE Transactions on NanoBioscience.

[100]  Yanqing Zhang,et al.  FCM-SVM-RFE Gene Feature Selection Algorithm for Leukemia Classification from Microarray Gene Expression Data , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[101]  Murali Ramanathan,et al.  Flow Cytometric Analysis of In Vitro Proinflammatory Cytokine Secretion in Peripheral Blood from Multiple Sclerosis Patients , 1999, Journal of Clinical Immunology.

[102]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[103]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[104]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[105]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[106]  Sung-Bae Cho,et al.  Evolutionary Fuzzy Clustering Algorithm with Knowledge-Based Evaluation and Applications for Gene Expression Profiling , 2005 .

[107]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[108]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[109]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[110]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[111]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[112]  Ujjwal Maulik,et al.  Genetic clustering for automatic evolution of clusters and application to image classification , 2002, Pattern Recognit..

[113]  Bin Zhang,et al.  Generalized K-Harmonic Means - Dynamic Weighting of Data in Unsupervised Learning , 2001, SDM.

[114]  Manuel Graña,et al.  An analysis of the GLVQ algorithm , 1995, IEEE Trans. Neural Networks.

[115]  G Patane,et al.  Fully automatic clustering system , 2002, IEEE Trans. Neural Networks.

[116]  Ujjwal Maulik,et al.  An evolutionary technique based on K-Means algorithm for optimal clustering in RN , 2002, Inf. Sci..

[117]  A. Clark,et al.  Trading spaces: Computation, representation, and the limits of uninformed learning , 1997, Behavioral and Brain Sciences.

[118]  Gp Babu,et al.  Simulated annealing for selecting optimal initial seeds in the K-means algorithm , 1994 .

[119]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[120]  Edzard S. Gelsema,et al.  Editorial Special issue on genetic algorithms , 1995, Pattern Recognit. Lett..

[121]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[122]  李幼升,et al.  Ph , 1989 .

[123]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[124]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[125]  S. Bandyopadhyay,et al.  Nonparametric genetic clustering: comparison of validity indices , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[126]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[127]  Sanghamitra Bandyopadhyay,et al.  Genetic algorithms for generation of class boundaries , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[128]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[129]  尹中航,et al.  Fuzzy Clustering with Novel Separable Criterion , 2006 .

[130]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[131]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[132]  Victor J. Rayward-Smith,et al.  Metaheuristics for clustering in KDD , 2005, 2005 IEEE Congress on Evolutionary Computation.

[133]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[134]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[135]  Terry Jones,et al.  Fitness Distance Correlation as a Measure of Problem Difficulty for Genetic Algorithms , 1995, ICGA.

[136]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[137]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[138]  Sanghamitra Bandyopadhyay,et al.  Pattern classification using genetic algorithms: Determination of H , 1998, Pattern Recognit. Lett..

[139]  T. Golub,et al.  Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. , 2004, Blood.

[140]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[141]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[142]  R. Sokal Clustering and Classification: Background and Current Directions , 1977 .

[143]  Giuseppe Patanè,et al.  The enhanced LBG algorithm , 2001, Neural Networks.

[144]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[145]  Santanu Kumar Rath,et al.  Gene Expression Analysis Using Clustering , 2009 .

[146]  Santanu Kumar Rath,et al.  Comparison of SGA and RGA based Clustering Algorithm for Pattern Recognition , 2009 .

[147]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[148]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[149]  Hwei-Jen Lin,et al.  An Efficient GA-based Clustering Technique , 2005 .

[150]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[151]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[152]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[153]  Claire Cardie,et al.  Intelligent Clustering with Instance-Level Constraints , 2002 .

[154]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[155]  Valerie Guralnik,et al.  A scalable algorithm for clustering sequential data , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[156]  Meland,et al.  THE USE OF MOLECULAR PROFILING TO PREDICT SURVIVAL AFTER CHEMOTHERAPY FOR DIFFUSE LARGE-B-CELL LYMPHOMA , 2002 .

[157]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[158]  Chien-Hsing Chou,et al.  Short Papers , 2001 .