The OGCleaner: filtering false-positive homology clusters

Summary: Detecting homologous sequences in organisms is an essential step in protein structure and function prediction, gene annotation and phylogenetic tree construction. Heuristic methods are often employed for quality control of putative homology clusters. These heuristics, however, usually only apply to pairwise sequence comparison and do not examine clusters as a whole. We present the Orthology Group Cleaner (the OGCleaner), a tool designed for filtering putative orthology groups as homology or non-homology clusters by considering all sequences in a cluster. The OGCleaner relies on high-quality orthologous groups identified in OrthoDB to train machine learning algorithms that are able to distinguish between true-positive and false-positive homology groups. This package aims to improve the quality of phylogenetic tree construction especially in instances of lower-quality transcriptome assemblies. Availability and Implementation: https://github.com/byucsl/ogcleaner Contact: sfujimoto@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[2]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[3]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[4]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[5]  Chun-Nan Hsu,et al.  Weakly supervised learning of biomedical information extraction from curated data , 2016, BMC Bioinformatics.

[6]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[7]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[8]  Gaël Varoquaux,et al.  Proceedings of the 20th Python in Science Conference 2021 (SciPy 2021), Virtual Conference, July 12 - July 18, 2021 , 2008, SciPy.

[9]  Patrick Kück,et al.  Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees , 2010, Frontiers in Zoology.

[10]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[11]  Anthony T Papenfuss,et al.  Analysis of the platypus genome suggests a transposon origin for mammalian imprinting , 2009, Genome Biology.

[12]  Innes C Cuthill,et al.  The influence of a hot environment on parental cooperation of a ground-nesting shorebird, the Kentish plover Charadrius alexandrinus , 2010, Frontiers in Zoology.

[13]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[14]  Albert J. Vilella,et al.  Joining forces in the quest for orthologs , 2009, Genome Biology.

[15]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[16]  Katharina Misof,et al.  A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: a more objective means of data exclusion. , 2009, Systematic biology.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Mark J. Clement,et al.  Detecting false positive sequence homology: a machine learning approach , 2016, BMC Bioinformatics.

[19]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[20]  Evgeny M. Zdobnov,et al.  OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software , 2014, Nucleic Acids Res..