Bioinformatics Protocols for Quickly Obtaining Large-Scale Data Sets for Phylogenetic Inferences

Useful insight into the evolution of genes and gene families can be provided by the analysis of all available genome datasets rather than just a few, which are usually those of model species. Handling and transforming such datasets into the desired format for downstream analyses is, however, often a difficult and time-consuming task for researchers without a background in informatics. Therefore, we present two simple and fast protocols for data preparation, using an easy-to-install, open-source, cross-platform software application with user-friendly, rich graphical user interface (SEDA; http://www.sing-group.org/seda/index.html). The first protocol is a substantial improvement over one recently published (López-Fernández et al. Practical applications of computational biology and bioinformatics, 12th International conference. Springer, Cham, pp 88–96 (2019)[1]), which was used to study the evolution of GULO, a gene that encodes the enzyme responsible for the last step of vitamin C synthesis. In this paper, we show how the sequence data file used for the phylogenetic analyses can now be obtained much faster by changing the way coding sequence isoforms are removed, using the newly implemented SEDA operation “Remove isoforms”. This protocol can be used to easily show that putative functional GULO genes are present in several Prostotomian groups such as Molluscs, Priapulida and Arachnida. Such findings could have been easily missed if only a few Protostomian model species had been used. The second protocol allowed us to identify positively selected amino acid sites in a set of 19 primate HLA immunity genes. Interestingly, the proteins encoded by MHC class II genes can show just as many positively selected amino acid sites as those encoded by classical MHC class I genes. Although a significant percentage of codons, which can be as high as 14.8%, are evolving under positive selection, the main mode of evolution of HLA immunity genes is purifying selection. Using a large number of primate species, the probability of missing the identification of positively selected amino acid sites is lower. Both projects were performed in less than one week, and most of the time was spent running the analyses rather than preparing the files. Such protocols can be easily adapted to answer many other questions using a phylogenetic approach.

[1]  P. Hedrick PATHOGEN RESISTANCE AND GENETIC VARIATION AT MHC LOCI , 2002, Evolution; international journal of organic evolution.

[2]  W. V. van Berkel,et al.  Functional assignment of Glu386 and Arg388 in the active site of l‐galactono‐γ‐lactone dehydrogenase , 2009, FEBS letters.

[3]  Silvia Maggini,et al.  Immune-Enhancing Role of Vitamin C and Zinc and Effect on Clinical Conditions , 2006, Annals of Nutrition and Metabolism.

[4]  J. Knight,et al.  The human Major Histocompatibility Complex as a paradigm in genomics research. , 2009, Briefings in functional genomics & proteomics.

[5]  P. Roche,et al.  The ins and outs of MHC class II-mediated antigen processing and presentation , 2015, Nature Reviews Immunology.

[6]  H. Orr,et al.  Human leukocyte antigen F (HLA-F). An expressed HLA gene composed of a class I coding sequence linked to a novel transcribed repetitive element , 1990, The Journal of experimental medicine.

[7]  T. Lenz,et al.  Divergent Allele Advantage at Human MHC Genes: Signatures of Past and Ongoing Selection , 2018, Molecular biology and evolution.

[8]  Hugo López-Fernández,et al.  Large Scale Analyses and Visualization of Adaptive Amino Acid Changes Projects , 2018, Interdisciplinary Sciences: Computational Life Sciences.

[9]  Florentino Fernández Riverola,et al.  A Bioinformatics Protocol for Quickly Creating Large-Scale Phylogenetic Trees , 2018, PACBB.

[10]  Sudhir Kumar,et al.  MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. , 2016, Molecular biology and evolution.

[11]  G. Drouin,et al.  The Genetics of Vitamin C Loss in Vertebrates , 2011, Current genomics.

[12]  S. Seifter,et al.  The biochemical functions of ascorbic acid. , 1986, Annual review of nutrition.

[13]  L. Zhao,et al.  HLA-E, HLA-F, and HLA-G polymorphism: genomic sequence defines haplotype structure and variation spanning the nonclassical class I genes , 2006, Immunogenetics.

[14]  J. Klein,et al.  MHC polymorphism and parasites. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[15]  E. Hewitt The MHC class I antigen presentation pathway: strategies for viral immune evasion , 2003, Immunology.

[16]  Nuno A. Fonseca,et al.  ADOPS - Automatic Detection Of Positively Selected Sites , 2012, Journal of integrative bioinformatics.

[17]  J. Lykkesfeldt,et al.  Does Vitamin C Deficiency Affect Cognitive Development and Function? , 2014, Nutrients.