HiMAP2: Identifying phylogenetically informative genetic markers from diverse genomic resources

Multiplexed amplicon sequencing offers a cost‐effective and rapid solution for phylogenomic studies that include a large number of individuals. Selecting informative genetic markers is a critical initial step in designing such multiplexed amplicon panels, but screening various genomic data and selecting markers that are informative for the question at hand can be laborious. Here, we present a flexible and user‐friendly tool, HiMAP2, for identifying, visualizing and filtering phylogenetically informative loci from diverse genomic and transcriptomic resources. This bioinformatics pipeline includes orthology prediction, exon extraction and filtering of aligned exon sequences according to user‐defined specifications. Additionally, HiMAP2 facilitates exploration of the final filtered exons by incorporating phylogenetic inference of individual exon trees with raxml‐ng as well as the estimation of a species tree using astral. Finally, results of the marker selection can be visualized and refined with an interactive Bokeh application that can be used to generate publication‐quality figures. Source code and user instructions for HiMAP2 are available at https://github.com/popphylotools/HiMAP_v2.

[1]  L. Leblanc,et al.  A phylogenomic approach to species delimitation in the mango fruit fly (Bactrocera frauenfeldi) complex: A new synonym of an important pest species with variable morphotypes (Diptera: Tephritidae) , 2022, Systematic Entomology.

[2]  A. Harris,et al.  Phylogenomics and biogeography of Torreya (Taxaceae)—Integrating data from three organelle genomes, morphology, and fossils and a practical method for reducing missing data from RAD‐seq , 2022, Journal of Systematics and Evolution.

[3]  Martin R. Smith Robust Analysis of Phylogenetic Tree Space , 2021, Systematic biology.

[4]  Peter D. Crompton,et al.  Design and implementation of multiplexed amplicon sequencing panels to serve genomic epidemiology of infectious disease: A malaria case study , 2021, medRxiv.

[5]  Hugo Flávio,et al.  Fishing for DNA? Designing baits for population genetics in target enrichment experiments: Guidelines, considerations and the new tool supeRbaits , 2021, Molecular ecology resources.

[6]  P. Hufnagl,et al.  Multiplexed detection of SARS-CoV-2 and other respiratory infections in high throughput by SARSeq , 2021, Nature communications.

[7]  A. C. de Freitas,et al.  EntroPhylo: An entropy-based tool to select phylogenetic informative regions and primer design. , 2021, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[8]  Aaron A. Comeault,et al.  Widespread introgression across a phylogeny of 155 Drosophila genomes , 2020, Current Biology.

[9]  M. Campbell,et al.  Mating systems and predictors of relative reproductive success in a Cutthroat Trout subspecies of conservation concern , 2020, Ecology and evolution.

[10]  Kiet Van Nguyen,et al.  Genetic surveillance in the Greater Mekong subregion and South Asia to support malaria control and elimination , 2020, medRxiv.

[11]  Graeme T. Lloyd,et al.  Bayesian analyses in phylogenetic palaeontology: interpreting the posterior sample , 2020, Palaeontology.

[12]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[13]  J. Bailey,et al.  Sensitive, Highly Multiplexed Sequencing of Microhaplotypes From the Plasmodium falciparum Heterozygome , 2020, bioRxiv.

[14]  K. Larsen,et al.  Genotyping‐in‐Thousands by sequencing (GT‐seq) panel development and application to minimally invasive DNA samples to support studies in molecular ecology , 2020, Molecular ecology resources.

[15]  S. Kelly,et al.  OrthoFinder: phylogenetic orthology inference for comparative genomics , 2019, Genome Biology.

[16]  C. Külheim,et al.  Identifying genetic markers for a range of phylogenetic utility–From species to family level , 2019, PloS one.

[17]  W. Larson,et al.  The future is now: Amplicon sequencing and sequence capture usher in the conservation genomics era , 2019, Molecular ecology resources.

[18]  F. Forest,et al.  A customized nuclear target enrichment approach for developing a phylogenomic baseline for Dioscorea yams (Dioscoreaceae) , 2019, Applications in plant sciences.

[19]  Alexey M. Kozlov,et al.  ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models , 2019, bioRxiv.

[20]  Nathan K. Truelove,et al.  Empowering conservation practice with efficient and economical genotyping from poor quality samples , 2019, Methods in ecology and evolution.

[21]  Gregory W. Stull,et al.  Characterizing gene tree conflict in plastome-inferred phylogenies , 2019, bioRxiv.

[22]  J. Solassol,et al.  Benchmarking of Amplicon-Based Next-Generation Sequencing Panels Combined with Bioinformatics Solutions for Germline BRCA1 and BRCA2 Alteration Detection. , 2018, The Journal of molecular diagnostics : JMD.

[23]  Alexey M. Kozlov,et al.  RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference , 2018, bioRxiv.

[24]  P. Zaharias,et al.  Exon-Capture-Based Phylogeny and Diversification of the Venomous Gastropods (Neogastropoda, Conoidea) , 2018, Molecular biology and evolution.

[25]  Chao Zhang,et al.  ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees , 2018, BMC Bioinformatics.

[26]  Jason G. Bragg,et al.  Analysis of Phylogenomic Tree Space Resolves Relationships Among Marsupial Families , 2018, Systematic biology.

[27]  Michael R McKain,et al.  Practical considerations for plant phylogenomics , 2018, Applications in plant sciences.

[28]  L. Leblanc,et al.  HiMAP: robust Phylogenomics from Highly Multiplexed Amplicon sequencing , 2017, bioRxiv.

[29]  Ioannis P. Vlahavas,et al.  FIFS: A data mining method for informative marker selection in high dimensional population genomic data , 2017, Comput. Biol. Medicine.

[30]  Andrew J. Alverson,et al.  Signal, Uncertainty, and Conflict in Phylogenomic Data for a Diverse Lineage of Microbial Eukaryotes (Diatoms, Bacillariophyta) , 2017, Molecular biology and evolution.

[31]  M. Kendall,et al.  treespace: Statistical exploration of landscapes of phylogenetic trees , 2017, Molecular ecology resources.

[32]  K. Bi,et al.  An evaluation of transcriptome‐based exon capture for frog phylogenomics across multiple scales of divergence (Class: Amphibia, Order: Anura) , 2016, Molecular ecology resources.

[33]  Julio Rozas,et al.  DOMINO: development of informative molecular markers for phylogenetic and genome-wide population genetic studies in non-model organisms , 2016, Bioinform..

[34]  L. Boykin,et al.  Rooting Trees, Methods for , 2016, Encyclopedia of Evolutionary Biology.

[35]  Karen Meusemann,et al.  BaitFisher: A Software Package for Multispecies Target DNA Enrichment Probe Design. , 2016, Molecular biology and evolution.

[36]  P. Bork,et al.  ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data , 2016, Molecular biology and evolution.

[37]  Liang Liu,et al.  Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased. , 2015, Molecular phylogenetics and evolution.

[38]  M. De Luca,et al.  Amplicon‐based next‐generation sequencing: an effective approach for the molecular diagnosis of epidermolysis bullosa , 2015, The British journal of dermatology.

[39]  M. Snyder,et al.  High-throughput sequencing technologies. , 2015, Molecular cell.

[40]  E. Wilberg What's in an Outgroup? The Impact of Outgroup Choice on the Phylogenetic Position of Thalattosuchia (Crocodylomorpha) and the Origin of Crocodyliformes. , 2015, Systematic biology.

[41]  Vivek Krishnakumar,et al.  MarkerMiner 1.0: A new application for phylogenetic marker development using angiosperm transcriptomes , 2015, Applications in plant sciences.

[42]  Scott V Edwards,et al.  Estimating phylogenetic trees from genome‐scale data , 2015, Annals of the New York Academy of Sciences.

[43]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[44]  Thomas Mailund,et al.  tqDist: a library for computing the quartet and triplet distances between binary or general trees , 2014, Bioinform..

[45]  Chris Whidden,et al.  Quantifying MCMC Exploration of Phylogenetic Tree Space , 2014, Systematic biology.

[46]  Arun S. Seetharam,et al.  Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments , 2013, PeerJ.

[47]  A. Lemmon,et al.  High-Throughput Genomic Data in Systematics and Phylogenetics , 2013 .

[48]  Reed A. Cartwright,et al.  A composite genome approach to identify phylogenetically informative data from next-generation sequencing , 2013, BMC Bioinformatics.

[49]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[50]  A. Lemmon,et al.  Anchored hybrid enrichment for massively high-throughput phylogenomics. , 2012, Systematic biology.

[51]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[52]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[55]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[56]  D. Hillis,et al.  Analysis and visualization of tree space. , 2005, Systematic biology.

[57]  P. Taberlet,et al.  The power and promise of population genomics: from genotyping to genome typing , 2003, Nature Reviews Genetics.

[58]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[59]  J. Keilwagen,et al.  GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. , 2019, Methods in molecular biology.

[60]  Troy J. Kieran,et al.  Insight from an ultraconserved element bait set designed for hemipteran phylogenetics integrated with genomic resources. , 2019, Molecular phylogenetics and evolution.

[61]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[62]  Guido van Rossum,et al.  Python Programming Language , 2007, USENIX Annual Technical Conference.