Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering

One of the crucial problems in the field of functional genomics is to identify a set of genes which are responsible for a particular cellular mechanism. The current work explores the usage of a multi-objective optimization based genetic clustering technique to classify genes into groups with respect to their functional similarities and biological relevance. Our contribution is two-fold: firstly a new quality measure to compute the goodness of gene-clusters namely protein-protein interaction confidence score is developed. This utilizes the confidence scores of the protein-protein interaction networks to measure the similarity between genes of a particular cluster with respect to their biochemical protein products. Secondly, a multi-objective based clustering approach is developed which intelligently uses integrated information of expression values of microarray dataset and protein-protein interaction confidence scores to select both statistically and biologically relevant genes. For that very purpose, some biological cluster validity indices, viz. biological homogeneity index and protein-protein interaction confidence score, along with two traditional internal cluster validity indices, viz. fuzzy partition coefficient and Pakhira-Bandyopadhyay-Maulik-index, are simultaneously optimized during the clustering process. Experimental results on three real-life gene expression datasets show that the addition of new objective capturing protein-protein interaction information aids in clustering the genes as compared to the existing techniques. The observations are further supported by biological and statistical significance tests.

[1]  SantoniDaniele,et al.  An Integrated Approach (CLuster Analysis Integration Method) to Combine Expression Data and Protein–Protein Interaction Networks in Agrigenomics: Application on Arabidopsis thaliana , 2014 .

[2]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[3]  Ralf Herwig,et al.  ConsensusPathDB: toward a more complete picture of cell biology , 2010, Nucleic Acids Res..

[4]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[5]  Peng Gang Sun,et al.  The human Drug-Disease-Gene Network , 2015, Inf. Sci..

[6]  Chandra Sekhar Pedamallu,et al.  Open source tool for prediction of genome wide protein-protein interaction network based on ortholog information , 2010, Source Code for Biology and Medicine.

[7]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[8]  T. Ideker,et al.  Integrative approaches for finding modular structure in biological networks , 2013, Nature Reviews Genetics.

[9]  David Galas,et al.  Systems biology of interstitial lung diseases: integration of mRNA and microRNA expression changes , 2011, BMC Medical Genomics.

[10]  D. Goldberg,et al.  Assessing experimentally derived interactions in a small world , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Rajesh N. Davé,et al.  Validating fuzzy partitions obtained through c-shells clustering , 1996, Pattern Recognit. Lett..

[12]  José Salvador Sánchez,et al.  Mapping microarray gene expression data into dissimilarity spaces for tumor classification , 2015, Inf. Sci..

[13]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[14]  Hsin-Chih Lai,et al.  Activation of Multiple Apoptotic Pathways in Human Nasopharyngeal Carcinoma Cells by the Prenylated Isoflavone, Osajin , 2011, PloS one.

[15]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[16]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[17]  Jingkai Yu,et al.  Assigning confidence scores to protein-protein interactions. , 2012, Methods in molecular biology.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[20]  Kalyanmoy Deb,et al.  Simulated Binary Crossover for Continuous Search Space , 1995, Complex Syst..

[21]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[22]  Wei Zheng,et al.  dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks , 2011, Bioinform..

[23]  Jingkai Yu,et al.  Combining multiple positive training sets to generate confidence scores for protein–protein interactions , 2008, Bioinform..

[24]  Pradipta Maji,et al.  Gene expression and protein–protein interaction data for identification of colon cancer related genes using f-information measures , 2015, Natural Computing.

[25]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[26]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[28]  Ralf Herwig,et al.  IntScore: a web tool for confidence scoring of biological interactions , 2012, Nucleic Acids Res..

[29]  J. Miguel,et al.  Gene expression profiling of B lymphocytes and plasma cells from Waldenström's macroglobulinemia: comparison with expression patterns of the same cell counterparts from chronic lymphocytic leukemia, multiple myeloma and normal individuals , 2007, Leukemia.

[30]  Desmond J. Higham,et al.  Geometric De-noising of Protein-Protein Interaction Networks , 2009, PLoS Comput. Biol..

[31]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Tobias Müller,et al.  Identifying functional modules in protein–protein interaction networks: an integrated exact approach , 2008, ISMB.

[34]  Daniele Santoni,et al.  An integrated approach (CLuster Analysis Integration Method) to combine expression data and protein-protein interaction networks in agrigenomics: application on Arabidopsis thaliana. , 2014, Omics : a journal of integrative biology.

[36]  K. Chou,et al.  Identification of Colorectal Cancer Related Genes with mRMR and Shortest Path in Protein-Protein Interaction Network , 2012, PloS one.

[37]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[38]  Jinyan Li,et al.  Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data , 2012, BMC Genomics.

[39]  Petter Holme,et al.  Ranking Candidate Disease Genes from Gene Expression and Protein Interaction: A Katz-Centrality Based Approach , 2011, PloS one.

[40]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[41]  Mitsuaki Yanagida,et al.  Functional proteomics; current achievements. , 2002, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[42]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[43]  Bibhas Chandra Dhara,et al.  Selection of genes mediating certain cancers, using a neuro-fuzzy approach , 2014, Neurocomputing.

[44]  Aidong Zhang,et al.  Protein Interaction Networks: Computational Analysis , 2009 .

[45]  Aidong Zhang,et al.  Selecting informative genes from microarray dataset by incorporating gene ontology , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[46]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[47]  G. Gahrton,et al.  Identification of progression markers in B-CLL by gene expression profiling. , 2005, Experimental hematology.

[48]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[49]  Nada Lavrac,et al.  SEGS: Search for enriched gene sets in microarray data , 2008, J. Biomed. Informatics.

[50]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Clustering , 2015, ACM Comput. Surv..

[51]  Saeid Nahavandi,et al.  Hidden Markov models for cancer classification using gene expression profiles , 2015, Inf. Sci..

[52]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Dervis Karaboga,et al.  A novel clustering approach: Artificial Bee Colony (ABC) algorithm , 2011, Appl. Soft Comput..

[54]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[55]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[56]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Julie M. Sahalie,et al.  An experimentally derived confidence score for binary protein-protein interactions , 2008, Nature Methods.

[58]  Sara Linse,et al.  Methods for the detection and analysis of protein–protein interactions , 2007, Proteomics.

[59]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[60]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[61]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[62]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[63]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[64]  Emmanuel D. Levy,et al.  How Perfect Can Protein Interactomes Be? , 2009, Science Signaling.

[65]  Kuo-Chen Chou,et al.  Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property , 2011, PloS one.

[66]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2012 update , 2011, Nucleic Acids Res..

[67]  Roded Sharan,et al.  BMC Bioinformatics BioMed Central , 2006 .

[68]  Frederick P. Roth,et al.  Next generation software for functional trend analysis , 2009, Bioinform..

[69]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[70]  Pradipta Maji,et al.  RelSim: An integrated method to identify disease genes using gene expression profiles and PPIN based similarity measure , 2017, Inf. Sci..

[71]  Salwani Abdullah,et al.  A combined approach for clustering based on K-means and gravitational search algorithms , 2012, Swarm Evol. Comput..

[72]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[73]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[74]  A. Barabasi,et al.  An empirical framework for binary interactome mapping , 2008, Nature Methods.

[75]  Sanghamitra Bandyopadhyay,et al.  A New Principal Axis Based Line Symmetry Measurement and Its Application to Clustering , 2008, ICONIP.

[76]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[77]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[78]  Chao Wu,et al.  Integrating gene expression and protein-protein interaction network to prioritize cancer-associated genes , 2012, BMC Bioinformatics.

[79]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[80]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[81]  Ralf Herwig,et al.  Cluster-based assessment of protein-protein interaction confidence , 2012, BMC Bioinformatics.

[82]  Pradipta Maji,et al.  Rough set based maximum relevance-maximum significance criterion and Gene selection from microarray data , 2011, Int. J. Approx. Reason..

[83]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[84]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[85]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[86]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.