A machine learning approach for gene expression analysis and applications

High-throughput microarray technology is an important and revolutionary technique used in genomics and systems biology to analyze the expression of thousands of genes simultaneously. The popular use of this technique has resulted in enormous repositories of microarray data, for example, the Gene Expression Omnibus (GEO), maintained by the National Center for Biotechnology Information (NCBI). However, an effective approach to optimally exploit these datasets in support of specific biological studies is still lacking. Specifically, an improved method is required to integrate data from multiple sources and to select only those datasets that meet an investigator's interest. In addition, to exploit the full power of microarray data, an effective method is required to determine the relationships among genes in the selected datasets and to interpret the biological meanings behind these relationships. To address these requirements, we have developed a machine learning based approach that includes: • An effective meta-analysis method to integrate microarray data from multiple sources; the method exploits information regarding the biological context of interest provided by the biologists. • A novel and effective cluster analysis method to identify hidden patterns in selected data representing relationships between genes under the biological conditions of interest. • A novel motif finding method that discovers, not only the common transcription factor binding sites of co-regulated genes, but also the miRNA binding sites associated with the biological conditions. • A machine learning-based framework for microarray data analysis with a web application to run common analysis tasks on online.

[1]  Isabelle Couloigner,et al.  Modified fuzzy c‐means classification technique for mapping vague wetlands using Landsat ETM+ imagery , 2006 .

[2]  Rainer Breitling,et al.  RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis , 2006, Bioinform..

[3]  J. Koziol Comments on the rank product method for analyzing replicated experiments , 2010, FEBS letters.

[4]  Chengpeng Bi,et al.  A Genetic-Based EM Motif-Finding Algorithm for Biological Sequence Analysis , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[5]  Ker-Chau Li,et al.  A Bayesian Insertion/Deletion Algorithm for Distant Protein Motif Searching via Entropy Filtering , 2004 .

[6]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[7]  Jeng-Shyang Pan,et al.  An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index , 2007 .

[8]  Ashish Ghosh,et al.  Fuzzy clustering algorithms for unsupervised change detection in remote sensing images , 2011, Inf. Sci..

[9]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[10]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[11]  Donald Geman,et al.  Large-scale integration of cancer microarray data identifies a robust common cancer signature , 2007, BMC Bioinformatics.

[12]  Dan Li,et al.  Fuzzy c-means clustering of partially missing data sets based on statistical representation , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[13]  Didier Dubois,et al.  Fuzzy sets and systems ' . Theory and applications , 2007 .

[14]  R. Tavakkoli-Moghaddam,et al.  OPTIMIZATION OF FUZZY CLUSTERING CRITERIA BY A HYBRID PSO AND FUZZY C-MEANS CLUSTERING ALGORITHM , 2008 .

[15]  D. Stekel,et al.  Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction , 2010, Nucleic acids research.

[16]  Huaguang Zhang,et al.  Motif discoveries in unaligned molecular sequences using self-organizing neural networks , 2006, IEEE Trans. Neural Networks.

[17]  Miin-Shen Yang,et al.  A cluster validity index for fuzzy clustering , 2005, Pattern Recognit. Lett..

[18]  Yu Liang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm080 Sequence analysis , 2022 .

[19]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[20]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  Koji Kadota,et al.  Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity , 2008, Algorithms for Molecular Biology.

[23]  Qing Yang,et al.  An Initialization Method for Fuzzy C-means Algorithm Using Subtractive Clustering , 2010, 2010 Third International Conference on Intelligent Networks and Intelligent Systems.

[24]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[25]  Charles Wang,et al.  Probability fold change: A robust computational approach for identifying differentially expressed gene lists , 2009, Comput. Methods Programs Biomed..

[26]  Erzsébet Merényi,et al.  Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps , 2009, IEEE Transactions on Neural Networks.

[27]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[28]  Lin Shili Space Oriented Rank-Based Data Integration , 2010 .

[29]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Dechang Chen,et al.  Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome , 2007, BMC Genomics.

[31]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[32]  Weina Wang,et al.  On fuzzy cluster validity indices , 2007, Fuzzy Sets Syst..

[33]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[34]  L. Staudt,et al.  c-Myc and Rel/NF-κB Are the Two Master Transcriptional Systems Activated in the Latency III Program of Epstein-Barr Virus-Immortalized B Cells , 2009, Journal of Virology.

[35]  Didier Dubois,et al.  Possibility Theory - An Approach to Computerized Processing of Uncertainty , 1988 .

[36]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[37]  Jerzy Tiuryn,et al.  A new approach to the assessment of the quality of predictions of transcription factor binding sites , 2007, J. Biomed. Informatics.

[38]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[39]  C. Scott,et al.  Statistical Applications in Genetics and Molecular Biology Semi-Parametric Differential Expression Analysis via Partial Mixture Estimation , 2011 .

[40]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[41]  Douglas G Altman,et al.  Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets , 2008, PLoS medicine.

[42]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[43]  Christopher Leckie,et al.  Meta-analysis of gene expression microarrays with missing replicates , 2011, BMC Bioinformatics.

[44]  Zheng Qin,et al.  A Weighted Mean Subtractive Clustering Algorithm , 2008 .

[45]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[46]  Jesús Carlos Pedraza Ortega,et al.  Comparison between Fuzzy C-means clustering and Fuzzy Clustering Subtractive in urban air pollution , 2010, 2010 20th International Conference on Electronics Communications and Computers (CONIELECOMP).

[47]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[48]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[50]  Joseph Beyene,et al.  Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models , 2005, BMC Bioinformatics.

[51]  Patrick Cahan,et al.  Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. , 2007, Gene.

[52]  Weida Tong,et al.  Very Important Pool (VIP) genes – an application for microarray-based molecular signatures , 2008, BMC Bioinformatics.

[53]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[54]  Yu Liang,et al.  fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control , 2008, Bioinform..

[55]  Yongchao Liu,et al.  Optimizing Parameters of Fuzzy c-Means Clustering Algorithm , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[56]  Rainer Breitling,et al.  A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments , 2008, Bioinform..

[57]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[58]  Yan Shi,et al.  Study on combining subtractive clustering with fuzzy c-means clustering , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[59]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[60]  L. Hedges,et al.  Statistical Methods for Meta-Analysis , 1987 .

[61]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[62]  Miin-Shen Yang,et al.  A modified mountain clustering algorithm , 2005, Pattern Analysis and Applications.

[63]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[64]  Kevin Loquin,et al.  Histogram density estimators based upon a fuzzy partition , 2008 .

[65]  Chitta Baral,et al.  Fuzzy C-means Clustering with Prior Biological Knowledge , 2022 .

[66]  YanWang,et al.  Missing value estimation for microarray data based on fuzzy C-means clustering , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[67]  João Pedro de Magalhães,et al.  Meta-analysis of age-related gene expression profiles identifies common signatures of aging , 2009, Bioinform..

[68]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[69]  Jie Li,et al.  An improved fuzzy c-means algorithm for manufacturing cell formation , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[70]  I. Yang,et al.  Multi-platform, multi-site, microarray-based human tumor classification. , 2004, The American journal of pathology.

[71]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[72]  Lei Liu,et al.  A study of inter-lab and inter-platform agreement of DNA microarray data , 2005, BMC Genomics.

[73]  Aníbal R. Figueiras-Vidal,et al.  Pattern classification with missing data: a review , 2010, Neural Computing and Applications.

[74]  Niki Pissinou,et al.  Fuzzy belief pattern classification of incomplete data , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[75]  Guillemette Marot,et al.  Statistical Applications in Genetics and Molecular Biology Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis , 2011 .

[76]  Donald Geman,et al.  Merging microarray data from separate breast cancer studies provides a robust prognostic test , 2008, BMC Bioinformatics.

[77]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[78]  Shunpu Zhang,et al.  An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data , 2007, Statistical applications in genetics and molecular biology.

[79]  Qi Zhao,et al.  FCM Algorithm Based on the Optimization Parameters of Objective Function Point , 2010, 2010 International Conference on Computing, Control and Industrial Engineering.

[80]  James C. Bezdek,et al.  Two soft relatives of learning vector quantization , 1995, Neural Networks.

[81]  S. Rodriguez-Zas,et al.  Advancing the understanding of the embryo transcriptome co-regulation using meta-, functional, and gene network analysis tools. , 2008, Reproduction.

[82]  Rajesh Kumar,et al.  A review on particle swarm optimization algorithms and their applications to data clustering , 2011, Artificial Intelligence Review.

[83]  A. Hatzigeorgiou,et al.  The DIANA-mirExTra Web Server: From Gene Expression Data to MicroRNA Function , 2010, PloS one.

[84]  Raghu Krishnapuram,et al.  Fitting an unknown number of lines and planes to image data through compatible cluster merging , 1992, Pattern Recognit..

[85]  Tzung-Pei Hong,et al.  A Hierarchical Gene-Set Genetic Algorithm , 2008, J. Comput..

[86]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.

[87]  Witold Pedrycz,et al.  A survey of defuzzification strategies , 2001, Int. J. Intell. Syst..

[88]  Veit Schwämmle,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[89]  Dmitrij Frishman,et al.  MIPS: a database for protein sequences and complete genomes , 1998, Nucleic Acids Res..

[90]  Debashis Ghosh,et al.  Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data , 2004, BMC Genomics.

[91]  Tom Altman,et al.  HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences , 2010, Bioinform..

[92]  Olaf Wolkenhauer,et al.  Possibility theory with applications to data analysis , 1998 .

[93]  William G. Bardsley,et al.  Meta-analysis of microarray data: The case of imatinib resistance in chronic myelogenous leukemia , 2010, Comput. Biol. Chem..

[94]  Yanqing Zhang,et al.  Improving Feature Subset Selection Using a Genetic Algorithm for Microarray Gene Expression Data , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[95]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[96]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[97]  Sangsoo Kim,et al.  Integrative analysis of multiple gene expression profiles applied to liver cancer study , 2004, FEBS letters.

[98]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[99]  Soumajit Pramanik,et al.  Dynamic Image Segmentation using Fuzzy C-Means based Genetic Algorithm , 2011 .

[100]  Jaap Heringa,et al.  Accurate confidence aware clustering of array CGH tumor profiles , 2010, Bioinform..

[101]  Dhammika Amaratunga,et al.  Exploration and Analysis of DNA Microarray and Protein Array Data , 2003, Wiley series in probability and statistics.

[102]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[103]  Yan Zhou,et al.  Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data , 2008, BMC Bioinformatics.

[104]  Tzong-Jer Chen,et al.  Fuzzy c-means clustering with spatial information for image segmentation , 2006, Comput. Medical Imaging Graph..

[105]  Ying Dai,et al.  Principal component analysis based methods in bioinformatics studies , 2011, Briefings Bioinform..

[106]  C. Klinge Estrogen receptor interaction with estrogen response elements. , 2001, Nucleic acids research.

[107]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[108]  Yong Yang,et al.  An Automatic Hybrid Method for Retinal Blood Vessel Extraction , 2008, Int. J. Appl. Math. Comput. Sci..

[109]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[110]  K. Gardiner Molecular basis of pharmacotherapies for cognition in Down syndrome. , 2010, Trends in pharmacological sciences.

[111]  Slavka Bodjanova Linear intensification of probabilistic fuzzy partitions , 2004, Fuzzy Sets Syst..

[112]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[113]  Stefan Conrad,et al.  Clustering approaches for data with missing values: Comparison and evaluation , 2010, 2010 Fifth International Conference on Digital Information Management (ICDIM).

[114]  Shengrui Wang,et al.  ON COMPUTING THE FUZZIFIER IN ↓FLVQ: A DATA DRIVEN APPROACH , 2002 .

[115]  Michael Ruogu Zhang,et al.  Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. , 2001, Journal of molecular biology.

[116]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[117]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[118]  Qu Shouning,et al.  Adaptive fuzzy clustering based on Genetic algorithm , 2010, 2010 2nd International Conference on Advanced Computer Control.

[119]  T. Barrette,et al.  Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. , 2002, Cancer research.

[120]  Doheon Lee,et al.  Data and text mining Towards clustering of incomplete microarray data without the use of imputation , 2006 .

[121]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[122]  Kuo-Lung Wu Parameter Selections of Fuzzy C-Means Based on Robust Analysis , 2010 .

[123]  T. Runkler,et al.  Defuzzification based on fuzzy clustering , 1994, Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference.

[124]  A. Butte,et al.  Microarrays for an Integrative Genomics , 2002 .

[125]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[126]  Young-Il Kim,et al.  A cluster validation index for GK cluster analysis based on relative degree of sharing , 2004, Inf. Sci..

[127]  Tom Altman,et al.  Probability-based Imputation Method for Fuzzy Cluster Analysis of Gene Expression Microarray Data , 2012, 2012 Ninth International Conference on Information Technology - New Generations.

[128]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[129]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[130]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[131]  Witold Pedrycz,et al.  Advances in Fuzzy Clustering and its Applications , 2007 .

[132]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[133]  Éloi Bossé,et al.  Approximation techniques for the transformation of fuzzy sets into random sets , 2008, Fuzzy Sets Syst..

[134]  R. Manavalan,et al.  Performance Analysis of Unsupervised Classification Based on Optimization , 2012 .

[135]  S. Halpain,et al.  Dynamic actin filaments are required for stable long-term potentiation (LTP) in area CA1 of the hippocampus. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[136]  Azadeh Mohammadi,et al.  Estimating Missing Value in Microarray Data Using Fuzzy Clustering and Gene Ontology , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[137]  K. Gardiner,et al.  Pathways to cognitive deficits in Down syndrome. , 2012, Progress in brain research.

[138]  Shu-Dong Zhang,et al.  A simple and robust method for connecting small-molecule drugs using gene-expression signatures , 2008, BMC Bioinformatics.

[139]  Stephen L. Chiu,et al.  Fuzzy Model Identification Based on Cluster Estimation , 1994, J. Intell. Fuzzy Syst..