A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database

Current tools and techniques devoted to examine the content of large databases are often hampered by their inability to support searches based on criteria that are meaningful to their users. These shortcomings are particularly evident in data banks storing representations of structural data such as biological networks. Conceptual clustering techniques have demonstrated to be appropriate for uncovering relationships between features that characterize objects in structural data. However, typical conceptual clustering approaches normally recover the most obvious relations, but fail to discover the less frequent but more informative underlying data associations. The combination of evolutionary algorithms with multiobjective and multimodal optimization techniques constitutes a suitable tool for solving this problem. We propose a novel conceptual clustering methodology termed evolutionary multiobjective conceptual clustering (EMO-CC), relying on the NSGA-II multiobjective (MO) genetic algorithm. We apply this methodology to identify conceptual models in structural databases generated from gene ontologies. These models can explain and predict phenotypes in the immunoinflammatory response problem, similar to those provided by gene expression or other genetic markers. The analysis of these results reveals that our approach uncovers cohesive clusters, even those comprising a small number of observations explained by several features, which allows describing objects and their interactions from different perspectives and at different levels of detail.

[1]  Bilal Alatas,et al.  MODENAR: Multi-objective differential evolution algorithm for mining numeric association rules , 2008, Appl. Soft Comput..

[2]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[4]  Peter J. Fleming,et al.  Multiobjective Genetic Programming: A Nonlinear System Identification Application , 1997 .

[5]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[6]  Hans-Peter Meinzer,et al.  The simplicity of metazoan cell lineages , 2005, Nature.

[7]  Pedro Larrañaga,et al.  Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  D. Cook,et al.  Graph-based hierarchical conceptual clustering , 2002 .

[9]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[10]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[11]  DebK.,et al.  A fast and elitist multiobjective genetic algorithm , 2002 .

[12]  Sergei Egorov,et al.  Pathway studio - the analysis and navigation of molecular networks , 2003, Bioinform..

[13]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[14]  Brian Everitt,et al.  A Handbook of Statistical Analyses Using SAS , 1996 .

[15]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[16]  J. Collado-Vides,et al.  Identifying global regulators in transcriptional regulatory networks in bacteria. , 2003, Current opinion in microbiology.

[17]  Igor Zwir,et al.  AUTOMATED GENERATION OF QUALITATIVE REPRESENTATIONS OF COMPLEX OBJECTS BY HYBRID SOFT-COMPUTING METHODS , 2001 .

[18]  Gary G. Yen,et al.  Multiple objective evolutionary algorithm for temporal linguistic rule extraction. , 2005 .

[19]  Lothar Thiele,et al.  Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach , 1999, IEEE Trans. Evol. Comput..

[20]  Igor Zwir,et al.  Dissecting the PhoP regulatory network of Escherichia coli and Salmonella enterica. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[22]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[23]  Kwong-Sak Leung,et al.  Data Mining Using Grammar Based Genetic Programming and Applications , 2000 .

[24]  David G. Stork,et al.  Pattern Classification , 1973 .

[25]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[26]  John D. Storey,et al.  A network-based analysis of systemic inflammation in humans , 2005, Nature.

[27]  I Zwir,et al.  Automated Biological Sequence Description by Genetic Multiobjective Generalized Clustering , 2002, Annals of the New York Academy of Sciences.

[28]  Boldeanu Silviu,et al.  FUZZY CLUSTERING , 2006 .

[29]  Yang Zhang,et al.  Feature Extraction Using Multi-Objective Genetic Programming , 2006, Multi-Objective Machine Learning.

[30]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[31]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[32]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[33]  Francisco Herrera,et al.  Genetic Fuzzy Systems - Evolutionary Tuning and Learning of Fuzzy Knowledge Bases , 2002, Advances in Fuzzy Systems - Applications and Theory.

[34]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[35]  Igor Zwir,et al.  GENERALIZED ANALYSIS OF PROMOTERS: A METHOD FOR DNA SEQUENCE DESCRIPTION , 2004 .

[36]  Flávio Bortolozzi,et al.  Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[37]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[38]  Jeffrey Horn,et al.  Multiobjective Optimization Using the Niched Pareto Genetic Algorithm , 1993 .

[39]  Johann Dréo,et al.  Metaheuristics for Hard Optimization: Methods and Case Studies , 2005 .

[40]  Oscar Cordón,et al.  Evolutionary Learning of Boolean Queries by Multiobjective Genetic Programming , 2002, PPSN.

[41]  Enrique Herrera-Viedma,et al.  Improving the learning of Boolean queries by means of a multiobjective IQBE evolutionary algorithm , 2006, Inf. Process. Manag..

[42]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[43]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[44]  H. Ishibuchi Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases , 2004 .

[45]  Oscar Cordón,et al.  Optimal Selection of Microarray Analysis Methods Using a Conceptual Clustering Algorithm , 2006, EvoWorkshops.

[46]  E. Ruspini,et al.  Automated qualitative description of measurements , 1999, IMTC/99. Proceedings of the 16th IEEE Instrumentation and Measurement Technology Conference (Cat. No.99CH36309).

[47]  Eric R. Ziecel Selecting Models From Data , 1995 .

[48]  Sushmita Mitra,et al.  Multi-objective evolutionary biclustering of gene expression data , 2006, Pattern Recognit..

[49]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[50]  Kimberly E Applegate,et al.  An introduction to biostatistics. , 2002, Radiology.

[51]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[52]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[53]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[54]  N. Kobayashi,et al.  DBC2 significantly influences cell-cycle, apoptosis, cytoskeleton and membrane-trafficking pathways. , 2005, Journal of molecular biology.

[55]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[56]  C. J. Cheeseman BT rainscatter studies-experimental measurements , 1988 .

[57]  Peter C. Cheeseman,et al.  Selecting models from data , 1994, Lecture notes in statistics.

[58]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[59]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[60]  Purvesh Khatri,et al.  Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments , 2004, Nucleic Acids Res..

[61]  Kuo-Sheng Cheng,et al.  Evolution-Based Tabu Search Approach to Automatic Clustering , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[62]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[63]  Mihai Oltean,et al.  Using traceless genetic programming for solving multi-objective optimization problems , 2007, J. Exp. Theor. Artif. Intell..

[64]  Henry Huang,et al.  Analysis of differentially-regulated genes within a regulatory network by GPS genome navigation , 2005, Bioinform..

[65]  Zbigniew Michalewicz,et al.  Handbook of Evolutionary Computation , 1997 .

[66]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[67]  D.J. Cook,et al.  Structural mining of molecular biology data , 2001, IEEE Engineering in Medicine and Biology Magazine.

[68]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[69]  T. M. Murali,et al.  Automatic layout and visualization of biclusters , 2006, Algorithms for Molecular Biology.

[70]  Lothar Thiele,et al.  Comparison of Multiobjective Evolutionary Algorithms: Empirical Results , 2000, Evolutionary Computation.