Rough Sets in Ortholog Gene Detection - Selection of Feature Subsets and Case Reduction Considering Imbalance

Ortholog detection should be improved because of the real value of ortholog genes in the prediction of protein functions. Datasets in the binary classification problem can be represented as information systems. We use a gene pair extended similarity relation based on an extension of the Rough Set Theory and aggregated gene similarity measures as gene features, to select feature subsets with the aid of quality measures that take imbalance into account. The proposed procedure can be useful for datasets with few features and discrete parameters. The case reduction obtained from the approximation of ortholog and non-ortholog concepts might be an effective method to cope with extremely high imbalance in supervised classification.

[1]  Elizabeth N. Koch,et al.  Conserved rules govern genetic interaction degree across species , 2012, Genome Biology.

[2]  Manolis Kamvysselis,et al.  Computational comparative genomics: genes, regulation, evolution , 2003 .

[3]  Leonid Peshkin,et al.  Roundup: a multi-genome repository of orthologs and evolutionary distances , 2006, Bioinform..

[4]  Zdzislaw Pawlak,et al.  VAGUENESS AND UNCERTAINTY: A ROUGH SET PERSPECTIVE , 1995, Comput. Intell..

[5]  Salvatore Greco,et al.  Fuzzy Similarity Relation as a Basis for Rough Approximations , 1998, Rough Sets and Current Trends in Computing.

[6]  Erik L. L. Sonnhammer,et al.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis , 2009, Nucleic Acids Res..

[7]  Andrzej Skowron,et al.  Rough-Fuzzy Hybridization: A New Trend in Decision Making , 1999 .

[8]  Elena Deza,et al.  Dictionary of distances , 2006 .

[9]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[10]  Chun-Chin Hsu,et al.  An information granulation based data mining approach for classifying imbalanced data , 2008, Inf. Sci..

[11]  Damian Szklarczyk,et al.  eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations , 2009, Nucleic Acids Res..

[12]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Anne Condon,et al.  Parsing Nucleic Acid Pseudoknotted Secondary Structure: Algorithm and Applications , 2007, J. Comput. Biol..

[14]  Deborah Galpert Cañizares,et al.  A LOCAL-GLOBAL GENE COMPARISON FOR ORTHOLOG DETECTION IN TWO CLOSELY RELATED EUKARYOTES SPECIES , 2012 .

[15]  Qinghua Hu,et al.  A comparative study on rough set based class imbalance learning , 2008, Knowl. Based Syst..

[16]  A. Rokas,et al.  Evaluating Ortholog Prediction Algorithms in a Yeast Model Clade , 2011, PloS one.

[17]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[18]  C. A. Del Carpio-Muñoz,et al.  Folding pattern recognition in proteins using spectral analysis methods. , 2002, Genome informatics. International Conference on Genome Informatics.

[19]  Szymon Wilk,et al.  Rough Sets for Handling Imbalanced Data: Combining Filtering and Rule-based Classifiers , 2006, Fundam. Informaticae.

[20]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[21]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[22]  Michaela Dewar,et al.  Identifying Hosts of Families of Viruses: A Machine Learning Approach , 2011, PloS one.

[23]  Ricardo del Corazón Grau-Ábalo,et al.  Agregación de medidas de similitud para la detección de ortólogos: validación con medidas basadas en la teoría de conjuntos aproximados , 2014, Computación y Sistemas.

[24]  Qinghua Hu,et al.  A weighted rough set based method developed for class imbalance learning , 2008, Inf. Sci..

[25]  Tao Jiang,et al.  MSOAR: A High-Throughput Ortholog Assignment System Based on Genome Rearrangement , 2007, J. Comput. Biol..

[26]  Olivier Poch,et al.  OrthoInspector: comprehensive orthology analysis and visual exploration , 2011, BMC Bioinformatics.

[27]  Gaston H. Gonnet,et al.  OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements , 2005, Comparative Genomics.