GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics

We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a ‘mean’ human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves. Database URL: http://apollo11.isto.unibo.it/software/

[1]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[2]  H. Chandler Database , 1985 .

[3]  P. ’. ‘t Hoen,et al.  Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. , 2015, Trends in genetics : TIG.

[4]  L. Vitale,et al.  Integrated Transcriptome Map Highlights Structural and Functional Aspects of the Normal Human Heart , 2017, Journal of cellular physiology.

[5]  P. Hoen,et al.  Alternative mRNA transcription, processing, and translation: insights from RNA sequencing , 2015 .

[6]  M. Kango-Singh,et al.  Vogel and Motulsky's human genetics--problems and approaches , 2010, Human Genomics.

[7]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[8]  P. Heutink,et al.  CNTNAP2 is disrupted in a family with Gilles de la Tourette syndrome and obsessive compulsive disorder. , 2003, Genomics.

[9]  Hong-lin Liu,et al.  [Biochemical methods for the analysis of DNA-protein interactions]. , 2009, Yi chuan = Hereditas.

[10]  U. Oppermann,et al.  Human UTY(KDM6C) Is a Male-specific Nϵ-Methyl Lysyl Demethylase , 2014, The Journal of Biological Chemistry.

[11]  T. Vavouri,et al.  Parallel Evolution of Chordate Cis-Regulatory Code for Development , 2013, PLoS genetics.

[12]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[13]  L. Vitale,et al.  Characterization of human gene locus CYYR1: a complex multi-transcript system , 2014, Molecular Biology Reports.

[14]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[15]  Luca Lenzi,et al.  Uncertainty principle of genetic information in a living cell , 2005, Theoretical Biology and Medical Modelling.

[16]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[17]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[18]  Wojciech Makalowski,et al.  The human genome structure and organization. , 2001, Acta biochimica Polonica.

[19]  D. Coppola,et al.  Differential expression of alternatively spliced mRNA forms of the insulin-like growth factor 1 receptor in human neuroendocrine tumors. , 2006, Oncology reports.

[20]  L. Vitale,et al.  Systematic reanalysis of partial trisomy 21 cases with or without Down syndrome suggests a small region on 21q22.13 as critical to the phenotype , 2016, Human molecular genetics.

[21]  S. Antonarakis,et al.  Comprar Vogel and Motulsky's Human Genetics · Problems and Approaches | Speicher, Michael | 9783540376538 | Springer , 2010 .

[22]  S. Daiger,et al.  Survival of Texas infants born with trisomies 21, 18, and 13 , 2010, American journal of medical genetics. Part A.

[23]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[24]  Michael T. McManus,et al.  Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs , 2013, PLoS genetics.

[25]  Nuno A. Fonseca,et al.  Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction , 2015, BMC Genomics.

[26]  Christophe Béroud,et al.  Genotype–phenotype analysis in 2,405 patients with a dystrophinopathy using the UMD–DMD database: a model of nationwide knowledgebase , 2009, Human mutation.

[27]  S. Antonarakis,et al.  Vogel and Motulsky's Human Genetics , 2010 .

[28]  [Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes]. , 2004, Yi chuan xue bao = Acta genetica Sinica.

[29]  Luca Lenzi,et al.  UniGene Tabulator: a full parser for the UniGene format , 2006, Bioinform..

[30]  L. Vitale,et al.  A quantitative transcriptome reference map of the normal human brain , 2014, neurogenetics.

[31]  L. Vitale,et al.  Complexity of Bidirectional Transcription and Alternative Splicing at Human RCAN3 Locus , 2011, PloS one.

[32]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[33]  J. Harrow,et al.  Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes , 2014, Human molecular genetics.

[34]  R. Savan,et al.  Translating the Untranslated Region , 2015, The Journal of Immunology.

[35]  S. Antonarakis,et al.  Vogel and Motulsky's Human Genetics: Problems and Approaches , 1986 .

[36]  M. Batzer,et al.  Repetitive Elements May Comprise Over Two-Thirds of the Human Genome , 2011, PLoS genetics.

[37]  Brent L Fogel,et al.  Orchestration of neurodevelopmental programs by RBFOX1: implications for autism spectrum disorder. , 2013, International review of neurobiology.

[38]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[39]  Masahiko Watanabe,et al.  A mutation in the low voltage-gated calcium channel CACNA1G alters the physiological properties of the channel, causing spinocerebellar ataxia , 2015, Molecular Brain.

[40]  Zhiyong Lu,et al.  Database resources of the National Center for Biotechnology Information , 2010, Nucleic Acids Res..

[41]  L. Vitale,et al.  Universal tight correlation of codon bias and pool of RNA codons (codonome): The genome is optimized to allow any distribution of gene expression values in the transcriptome from bacteria to humans. , 2013, Genomics.

[42]  Brian T. Lee,et al.  The UCSC Genome Browser database: 2015 update , 2014, Nucleic Acids Res..

[43]  Fabian J Theis,et al.  Huge splicing frequency in human Y chromosomal UTY gene. , 2011, Omics : a journal of integrative biology.

[44]  L. Vitale,et al.  Systematic analysis of mRNA 5' coding sequence incompleteness in Danio rerio: an automated EST-based approach , 2007, Biology Direct.

[45]  L. Vitale,et al.  Genome-scale analysis of human mRNA 5' coding sequences based on expressed sequence tag (EST) database. , 2012, Genomics.

[46]  Marco Seri,et al.  An integrated route to identifying new pathogenesis-based therapeutic approaches for trisomy 21 (Down Syndrome) following the thought of Jérôme Lejeune , 2013 .

[47]  Lorenza Vitale,et al.  Identification of minimal eukaryotic introns through GeneBase, a user-friendly tool for parsing the NCBI Gene databank , 2015, DNA research : an international journal for rapid publication of reports on genes and genomes.