Agregación de medidas de similitud para la detección de ortólogos: validación con medidas basadas en la teoría de conjuntos aproximados

This paper presents a novel algorithm for ortholog detection that involves the aggregation of similarity measures characterizing the relationship between gene pairs of two genomes. The measures are based on the alignment score, the length of the sequences, the membership in the conserved regions as well as on the protein physicochemical profile. The clustering step over the similarity bipartite graph is performed by using the Markov clustering algorithm (MCL). A new ortholog assignment policy is applied over the homology groups obtained in the graph clustering. The classification results are validated with the Saccharomyces Cerevisiae and the Schizosaccharomyces Pombe genomes with the ortholog list of the INPARANOID 7.0 algorithm with the Adjusted Rand Index (ARI) external measure. Other validation measures based on the rough set theory are applied to calculate the quality of the classification dealing with class imbalance.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Elizabeth Shriberg,et al.  Comparing Evaluation Metrics for Sentence Boundary Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Tao Jiang,et al.  MSOAR: A High-Throughput Ortholog Assignment System Based on Genome Rearrangement , 2007, J. Comput. Biol..

[4]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Kihoon Yoon,et al.  An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[6]  Salvatore Greco,et al.  Fuzzy Similarity Relation as a Basis for Rough Approximations , 1998, Rough Sets and Current Trends in Computing.

[7]  Deborah Galpert Cañizares,et al.  A LOCAL-GLOBAL GENE COMPARISON FOR ORTHOLOG DETECTION IN TWO CLOSELY RELATED EUKARYOTES SPECIES , 2012 .

[8]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[9]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[10]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[11]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[12]  L. Hubert,et al.  Comparing partitions , 1985 .

[13]  J. A. Bondy,et al.  Graph Theory with Applications , 1978 .

[14]  D. Vanderpooten Similarity Relation as a Basis for Rough Approximations , 1995 .

[15]  Leo Goodstadt,et al.  Phylogenetic Reconstruction of Orthology, Paralogy, and Conserved Synteny for Dog and Human , 2006, PLoS Comput. Biol..

[16]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[17]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[18]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[19]  C. A. Del Carpio-Muñoz,et al.  Folding pattern recognition in proteins using spectral analysis methods. , 2002, Genome informatics. International Conference on Genome Informatics.

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[22]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[23]  Gabriel Moreno-Hagelsieb,et al.  Choosing BLAST options for better detection of orthologs as reciprocal best hits , 2008, Bioinform..

[24]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[25]  Andrzej Skowron,et al.  Rough-Fuzzy Hybridization: A New Trend in Decision Making , 1999 .

[26]  Włodzisław Duch,et al.  Similarity-based methods: a general framework for classification, approximation and association , 2000 .

[27]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[28]  Elena Deza,et al.  Dictionary of distances , 2006 .

[29]  Manolis Kamvysselis,et al.  Computational comparative genomics: genes, regulation, evolution , 2003 .

[30]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[31]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[32]  S. vanDongen Graph Clustering by Flow Simulation , 2000 .

[33]  Erik L. L. Sonnhammer,et al.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis , 2009, Nucleic Acids Res..

[34]  Vasant Honavar,et al.  Detection of Gene Orthology Based on Protein-Protein Interaction Networks , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[35]  Caleb Webber,et al.  Genes and homology , 2004, Current Biology.

[36]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[37]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[38]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  R. Jernigan,et al.  Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation , 1985 .

[41]  Zdzislaw Pawlak,et al.  VAGUENESS AND UNCERTAINTY: A ROUGH SET PERSPECTIVE , 1995, Comput. Intell..

[42]  Ralf Schweizer,et al.  Technical Analysis From A To Z , 2016 .

[43]  G. Pertea,et al.  Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). , 2002, Genome research.

[44]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[45]  Xin Chen,et al.  Assignment of orthologous genes via genome rearrangement , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[47]  Alessandro Dal Palù,et al.  Protein Folding in CLP(FD) with Empirical Contact Energies , 2003, CSCLP.