SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

Phylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

[1]  Vivek Sharma,et al.  An Unusual Amino Acid Substitution Within Hummingbird Cytochrome c Oxidase Alters a Key Proton-Conducting Channel , 2019, bioRxiv.

[2]  Ziheng Yang,et al.  The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. , 2010, Molecular biology and evolution.

[3]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[4]  Sanne Nygaard,et al.  DivA: detection of non-homologous and very divergent regions in protein sequence alignments , 2014, BMC Research Notes.

[5]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[6]  M. Mutanen,et al.  Molecular evolution of a widely-adopted taxonomic marker (COI) across the animal tree of life , 2016, Scientific Reports.

[7]  Sujeevan Ratnasingham,et al.  coil: an R package for cytochrome C oxidase I (COI) DNA barcode data cleaning, translation, and error evaluation , 2019, bioRxiv.

[8]  Kazutaka Katoh,et al.  MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization , 2017, Briefings Bioinform..

[9]  Yan Li,et al.  SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation , 2016, PloS one.

[10]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[11]  David R Smith,et al.  Revisiting published genomes with fresh eyes and new data , 2019, EMBO reports.

[12]  Tomasz Magdziarz,et al.  BALCONY: an R package for MSA and functional compartments of protein variability analysis , 2018, BMC Bioinformatics.

[13]  Ziheng Yang,et al.  Phylogenetic tree building in the genomic age , 2020, Nature Reviews Genetics.

[14]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[15]  Marek Borowiec,et al.  Spruceup: fast and flexible identification, visualization, and removal of outliers from large multiple sequence alignments , 2019, J. Open Source Softw..

[16]  Fernando González-Candelas,et al.  EvalMSA: A Program to Evaluate Multiple Sequence Alignments and Detect Outliers , 2016, Evolutionary bioinformatics online.

[17]  Desmond G. Higgins,et al.  OD-seq: outlier detection in multiple sequence alignments , 2015, BMC Bioinformatics.

[18]  N. Baeshen,et al.  Biological Identifications Through DNA Barcodes , 2012 .

[19]  Anders Larsson,et al.  AliView: a fast and lightweight alignment viewer and editor for large datasets , 2014, Bioinform..

[20]  Vincent Ranwez,et al.  Strengths and Limits of Multiple Sequence Alignment and Filtering Methods , 2020 .

[21]  B. Erman,et al.  Information‐theoretical entropy as a measure of sequence variability , 1991, Proteins.

[22]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[24]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[25]  Jing Tang,et al.  Chloroplot: An Online Program for the Versatile Plotting of Organelle Genomes , 2020, Frontiers in Genetics.