Improved Core Genes Prediction for Constructing Well-Supported Phylogenetic Trees in Large Sets of Plant Species

The way to infer well-supported phylogenetic trees that precisely reflect the evolutionary process is a challenging task that completely depends on the way the related core genes have been found. In previous computational biology studies, many similarity based algorithms, mainly dependent on calculating sequence alignment matrices, have been proposed to find them. In these kinds of approaches, a significantly high similarity score between two coding sequences extracted from a given annotation tool means that one has the same genes. In a previous work article, we presented a quality test approach (QTA) that improves the core genes quality by combining two annotation tools (namely NCBI, a partially human-curated database, and DOGMA, an efficient annotation algorithm for chloroplasts). This method takes the advantages from both sequence similarity and gene features to guarantee that the core genome contains correct and well-clustered coding sequences (\emph{i.e.}, genes). We then show in this article how useful are such well-defined core genes for biomolecular phylogenetic reconstructions, by investigating various subsets of core genes at various family or genus levels, leading to subtrees with strong bootstraps that are finally merged in a well-supported supertree.

[1]  Nigel Chaffey,et al.  Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P. Molecular biology of the cell. 4th edn. , 2003 .

[2]  Robert K. Jansen,et al.  Automatic annotation of organellar genomes with DOGMA , 2004, Bioinform..

[3]  Hideaki Sugawara,et al.  DDBJ with new system and face , 2007, Nucleic Acids Res..

[4]  Claudio Donati,et al.  Genome sequencing of disease and carriage isolates of nontypeable Haemophilus influenzae identifies discrete population structure , 2014, Proceedings of the National Academy of Sciences.

[5]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[6]  Jacques M. Bahi,et al.  Gene similarity-based approaches for determining core-genes of chloroplasts , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[7]  Paul G. Falkowski,et al.  THE MESOZOIC RADIATION OF EUKARYOTIC ALGAE: THE PORTABLE PLASTID HYPOTHESIS 1 , 2003 .

[8]  H. Tiano,et al.  Nucleotide sequences of the cDNA and an intronless pseudogene for human lactate dehydrogenase-A isozyme. , 1985, European journal of biochemistry.

[9]  Vincent Ranwez,et al.  SuperTriplets: a triplet-based supertree approach to phylogenomics , 2010, Bioinform..

[10]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[11]  Jacques M. Bahi,et al.  Finding the Core-Genes of Chloroplasts , 2014, ArXiv.

[12]  Raja Mazumder,et al.  CoreGenes: A computational tool for identifying and cataloging "core" genes in a set of small genomes , 2002, BMC Bioinformatics.

[13]  Shane S. Sturrock,et al.  Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data , 2012, Bioinform..

[14]  W. Martin,et al.  Distribution and Nomenclature of Protein-coding Genes in 12 Sequenced Chloroplast Genomes , 1998, Plant Molecular Biology Reporter.