BIOINFORMATICS STRATEGIES FOR GENOMICS: EXAMPLES AND APPROACHES FOR TOMATO

My PhD is funded by the Solanaceae Pollen thermotolerance – Initial Training Network (SPOT-ITN) in the frame of the European Marie Curie Actions. The consortium aims to investigate fundamental and applied aspects contributing to the protection of pollen at increased environmental temperatures, deciphering the underlying of pollen development and its response to heat stress, starting from analyses on Tomato. Obviously, the findings are supposed to be a guideline, and the procedures to be applicable to other plants in the future. In the light of the SPOT-ITN project objectives, and to provide a comprehensive bioinformatics infrastructure to support extensive genomics analyses in tomato, we collected, processed and integrated different resources; and organized them into dedicated databases with appropriate query user interfaces. This bioinformatics effort required the design of the most adequate software to reconcile the manifold resources from different cell information levels (genomics, transcriptomics, epigenomics). This is fundamental for data integration and analysis. The development of appropriate tools to mine the data from the “omics” approaches employed to trace the pollen development and the heat stress response has also been necessary to the project. In this thesis, the main efforts undertaken and the analyses conducted on the basis of such resources with the strategies and approaches developed are reported in details.

[1]  C. Guerrero-Bosagna DNA Methylation Research Methods , 2013 .

[2]  Daniel W. A. Buchan,et al.  The tomato genome sequence provides insights into fleshy fruit evolution , 2012, Nature.

[3]  Luigi Frusciante,et al.  TomatEST database: in silico exploitation of EST data to explore expression patterns in tomato species , 2006, Nucleic Acids Res..

[4]  Matthew D. Wilkerson,et al.  PlantGDB: a resource for comparative plant genomics , 2007, Nucleic Acids Res..

[5]  V. Ambros,et al.  An Extensive Class of Small RNAs in Caenorhabditis elegans , 2001, Science.

[6]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[7]  Alex Bateman,et al.  TreeFam v9: a new website, more species and orthology-on-the-fly , 2013, Nucleic Acids Res..

[8]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[9]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[10]  Salvador Capella-Gutiérrez,et al.  PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome , 2013, Nucleic Acids Res..

[11]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[12]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[13]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Javier Herrero,et al.  Toward community standards in the quest for orthologs , 2012, Bioinform..

[15]  T. Tuschl,et al.  Identification of Novel Genes Coding for Small Expressed RNAs , 2001, Science.

[16]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[17]  L. Girard,et al.  Regulatory changes as a consequence of transposon insertion. , 1999, Developmental genetics.

[18]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[19]  M. Gonzalo Claros,et al.  SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read , 2010, BMC Bioinformatics.

[20]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[21]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[22]  Zhou Du,et al.  agriGO: a GO analysis toolkit for the agricultural community , 2010, Nucleic Acids Res..

[23]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[24]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[25]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[26]  Joachim Selbig,et al.  ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites , 2007, BMC Bioinformatics.

[27]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[28]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[29]  Nozomu Sakurai,et al.  MiBASE : A database of a miniature tomato cultivar Micro-Tom , 2006 .

[30]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[31]  Zhen Su,et al.  EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species , 2007, BMC Genomics.

[32]  C. Mello,et al.  Revealing the world of RNA interference , 2004, Nature.

[33]  Simon Anders,et al.  Analysing RNA-Seq data with the DESeq package , 2011 .

[34]  Chi-Ying F. Huang,et al.  miRTarBase: a database curates experimentally validated microRNA–target interactions , 2010, Nucleic Acids Res..

[35]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[36]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[37]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[38]  M. Huynen,et al.  Benchmarking ortholog identification methods using functional genomics data , 2006, Genome Biology.

[39]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[40]  Damian Szklarczyk,et al.  eggNOG v4.0: nested orthology inference across 3686 organisms , 2013, Nucleic Acids Res..

[41]  Mark T Bedford,et al.  Arginine methylation an emerging regulator of protein function. , 2005, Molecular cell.

[42]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[43]  R. Slotkin,et al.  The Initiation of Epigenetic Silencing of Active Transposable Elements Is Triggered by RDR6 and 21-22 Nucleotide Small Interfering RNAs1[W][OA] , 2013, Plant Physiology.

[44]  David M. A. Martin,et al.  Construction of Reference Chromosome-Scale Pseudomolecules for Potato: Integrating the Potato Genome with Genetic and Physical Maps , 2013, G3: Genes, Genomes, Genetics.

[45]  J. Parkinson,et al.  Expressed sequence tags: an overview. , 2009, Methods in molecular biology.

[46]  Mark H. Wright,et al.  The SOL Genomics Network. A Comparative Resource for Solanaceae Biology and Beyond1 , 2005, Plant Physiology.

[47]  Yi Zheng,et al.  Tomato Functional Genomics Database: a comprehensive resource and analysis package for tomato functional genomics , 2010, Nucleic Acids Res..

[48]  John Quackenbush,et al.  Genesis: cluster analysis of microarray data , 2002, Bioinform..

[49]  E. Finnegan,et al.  The small RNA world , 2003, Journal of Cell Science.

[50]  Evgeny M. Zdobnov,et al.  OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs , 2012, Nucleic Acids Res..

[51]  R. Martienssen,et al.  Reprogramming of DNA Methylation in Pollen Guides Epigenetic Inheritance via Small RNA , 2012, Cell.

[52]  Detlef Weigel,et al.  Transposable elements and small RNAs contribute to gene expression divergence between Arabidopsis thaliana and Arabidopsis lyrata , 2011, Proceedings of the National Academy of Sciences.

[53]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[54]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[55]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[56]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[57]  Burkhard Morgenstern,et al.  Metabolite-based clustering and visualization of mass spectrometry data using one-dimensional self-organizing maps , 2008, Algorithms for Molecular Biology.

[58]  E. Koonin,et al.  Orthology, paralogy and proposed classification for paralog subtypes. , 2002, Trends in genetics : TIG.

[59]  Y. Li,et al.  The Inheritance Pattern of 24 nt siRNA Clusters in Arabidopsis Hybrids Is Influenced by Proximity to Transposable Elements , 2012, PloS one.

[60]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[61]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[62]  Qi Zheng,et al.  GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis , 2008, Nucleic Acids Res..

[63]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[64]  Luigi Frusciante,et al.  ISOL@: an Italian SOLAnaceae genomics resource , 2008, BMC Bioinformatics.

[65]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[66]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[67]  Sören Müller,et al.  APADB: a database for alternative polyadenylation and microRNA regulation events , 2014, Database J. Biol. Databases Curation.

[68]  A. Riggs,et al.  DNA methylation and gene function. , 1980, Science.

[69]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[70]  E. Selker Gene Silencing Repeats that Count , 1999, Cell.

[71]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[72]  Adrian Bird,et al.  DNA methylation inhibits transcription indirectly via a methyl-CpG binding protein , 1991, Cell.

[73]  Plant Metabolic Network , 2015 .

[74]  Gaston H. Gonnet,et al.  OMA 2011: orthology inference among 1000 complete genomes , 2010, Nucleic Acids Res..

[75]  M. Chiusano,et al.  NexGenEx-Tom: a gene expression platform to investigate the functionalities of the tomato genome , 2015, BMC Plant Biology.

[76]  R. Voorrips MapChart: software for the graphical presentation of linkage maps and QTLs. , 2002, The Journal of heredity.

[77]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[78]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[79]  John Quackenbush,et al.  Using the DFCI Gene Index Databases for Biological Discovery , 2010, Current protocols in bioinformatics.

[80]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[81]  W. Kibbe,et al.  Review of Current Methods, Applications, and Data Management for the Bioinformatics Analysis of Whole Exome Sequencing , 2014, Cancer informatics.

[82]  Björn Rotter,et al.  Identification of novel small ncRNAs in pollen of tomato , 2015, BMC Genomics.

[83]  Y. van de Peer,et al.  PLAZA: A Comparative Genomics Resource to Study Gene and Genome Evolution in Plants[W] , 2009, The Plant Cell Online.

[84]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[85]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[86]  Lin He,et al.  MicroRNAs: small RNAs with a big role in gene regulation , 2004, Nature reviews genetics.

[87]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[88]  B. Reinhart,et al.  Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA , 2000, Nature.

[89]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[90]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[91]  Martin Vingron,et al.  Ontologizer 2.0 - a multifunctional tool for GO term enrichment analysis and data exploration , 2008, Bioinform..

[92]  Rudolf Bayer,et al.  The Universal B-Tree for Multidimensional Indexing: general Concepts , 1997, WWCA.

[93]  J. Herman,et al.  Aberrant patterns of DNA methylation, chromatin formation and gene expression in cancer. , 2001, Human molecular genetics.

[94]  William A. Walters,et al.  Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms , 2012, The ISME Journal.

[95]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[96]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[97]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[98]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[99]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[100]  Chidchanok Lursinsap,et al.  Fuzzy C-Mean: A Statistical Feature Classification of Text and Image Segmentation Method , 2001, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[101]  Christian M. Zmasek,et al.  GreenPhylDB v2.0: comparative and functional genomics in plants , 2010, Nucleic Acids Res..

[102]  Nunzio D'Agostino,et al.  ParPEST: a pipeline for EST data analysis based on parallel computing , 2005, BMC Bioinformatics.

[103]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[104]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[105]  Song Li,et al.  LUCY2: an interactive DNA sequence quality trimming and vector removal tool , 2004, Bioinform..

[106]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[107]  Björn Rotter,et al.  SuperTAG Methylation-specific Digital Karyotyping Reveals Uremia-induced Epigenetic Dysregulation of Atherosclerosis-Related Genes , 2012, Circulation. Cardiovascular genetics.

[108]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[109]  Chris Mungall,et al.  AmiGO: online access to ontology and annotation data , 2008, Bioinform..

[110]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[111]  T. Tuschl,et al.  RNA interference is mediated by 21- and 22-nucleotide RNAs. , 2001, Genes & development.

[112]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[113]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[114]  V. Ambros,et al.  MicroRNAs and Other Tiny Endogenous RNAs in C. elegans , 2003, Current Biology.

[115]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[116]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[117]  S. Zhong,et al.  Single-base resolution methylomes of tomato fruit development reveal epigenome modifications associated with ripening , 2013, Nature Biotechnology.

[118]  Edgar R. Weippl,et al.  InnoDB Database Forensics , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[119]  N M Luscombe,et al.  What is Bioinformatics? A Proposed Definition and Overview of the Field , 2001, Methods of Information in Medicine.

[120]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[121]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[122]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[123]  Cameron Johnson,et al.  Clusters and superclusters of phased small RNAs in the developing inflorescence of rice. , 2009, Genome research.

[124]  D. Botstein,et al.  Orthology and functional conservation in eukaryotes. , 2007, Annual review of genetics.

[125]  M. Matzke,et al.  RNA-directed DNA methylation: an epigenetic pathway of increasing complexity , 2014, Nature Reviews Genetics.

[126]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[127]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[128]  K. Sahu,et al.  Tomato Genomic Resources Database: An Integrated Repository of Useful Tomato Genomic Information for Basic and Applied Research , 2014, PloS one.

[129]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[130]  C. Mungall,et al.  Gene Ontology Consortium : going forward The Gene Ontology , 2015 .

[131]  Kuo-Chen Chou,et al.  Mining Biological Data Using Self-Organizing Map , 2003, J. Chem. Inf. Comput. Sci..

[132]  G. Sherlock Analysis of large-scale gene expression data. , 2000, Current opinion in immunology.

[133]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[134]  C. Pikaard,et al.  Multisubunit RNA polymerases IV and V: purveyors of non-coding RNA for plant gene silencing , 2011, Nature Reviews Molecular Cell Biology.

[135]  Xuemei Chen,et al.  Small RNAs and their roles in plant development. , 2009, Annual review of cell and developmental biology.

[136]  D. Spector,et al.  Long noncoding RNAs: functional surprises from the RNA world. , 2009, Genes & development.