What is a gene, post-ENCODE? History and updated definition.

While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.

[1]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[2]  Thomas R Gingeras,et al.  Origin of phenotypes: genes and transcripts. , 2007, Genome research.

[3]  Charlotte N. Henrichsen,et al.  Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. , 2007, Genome research.

[4]  M. Gerstein,et al.  Structured Rnas in the Encode Selected Regions of the Human Genome , 2022 .

[5]  Zhiping Weng,et al.  Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. , 2007, Genome research.

[6]  Deyou Zheng,et al.  Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. , 2007, Genome research.

[7]  Daniel E. Newburger,et al.  The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. , 2007, Genome research.

[8]  Philipp Kapranov,et al.  Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. , 2007, Genome research.

[9]  C. Ponting,et al.  Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. , 2007, Genome research.

[10]  Deyou Zheng,et al.  The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? , 2007, Trends in genetics : TIG.

[11]  Hagen Blankenburg,et al.  The implications of alternative splicing in the ENCODE protein complement , 2007, Proceedings of the National Academy of Sciences.

[12]  Mark Gerstein,et al.  Bioinformatics Original Paper a Supervised Hidden Markov Model Framework for Efficiently Segmenting Tiling Array Data in Transcriptional and Chip-chip Experiments: Systematically Incorporating Validated Biological Knowledge , 2022 .

[13]  P. Griffiths,et al.  Genes in the Postgenomic Era , 2006, Theoretical medicine and bioethics.

[14]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[15]  Laurent Duret,et al.  The Xist RNA Gene Evolved in Eutherians by Pseudogenization of a Protein-Coding Gene , 2006, Science.

[16]  Helen Pearson,et al.  Genetics: What is a gene? , 2006, Nature.

[17]  Simon Tavaré,et al.  BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data , 2006, Bioinform..

[18]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[19]  Jun Kawai,et al.  Pseudo–Messenger RNA: Phantoms of the Transcriptome , 2006, PLoS genetics.

[20]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[21]  A. Reymond,et al.  Tandem chimerism as a means to increase protein complexity in the human genome. , 2005, Genome research.

[22]  R. Sorek,et al.  Transcription-mediated gene fusion in the human genome. , 2005, Genome research.

[23]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[24]  P. Stadler,et al.  Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome , 2005, Nature Biotechnology.

[25]  Francis D. Gibbons,et al.  Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization , 2005, Genome Biology.

[26]  Wing Hung Wong,et al.  TileMap: create chromosomal map of tiling array hybridizations , 2005, Bioinform..

[27]  Leah Barrera,et al.  A high-resolution map of active promoters in the human genome , 2005, Nature.

[28]  T. Blumenthal Trans-splicing and operons. , 2005, WormBook : the online review of C. elegans biology.

[29]  R. Flavell,et al.  Interchromosomal associations between alternatively expressed loci , 2005, Nature.

[30]  Mark Gerstein,et al.  Integrated pseudogene annotation for human chromosome 22: evidence for transcription. , 2005, Journal of molecular biology.

[31]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[32]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[33]  M. Gerstein,et al.  Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability , 2005, Nucleic acids research.

[34]  J. Grienenberger,et al.  The rapeseed mitochondrial gene encoding a homologue of the bacterial protein Ccl1 is divided into two independently transcribed reading frames , 1996, Molecular and General Genetics MGG.

[35]  Clifford A. Meyer,et al.  A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences , 2005, ISMB.

[36]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[37]  Vladimir Svetnik,et al.  A comprehensive transcript index of the human genome generated using microarrays and computational approaches , 2004, Genome Biology.

[38]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[39]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[40]  Masaru Tomita,et al.  A new role for expressed pseudogenes as ncRNA: regulation of mRNA stability of its homologous coding gene , 2004, Journal of Molecular Medicine.

[41]  Mark Gerstein,et al.  CREB Binds to Multiple Loci on Human Chromosome 22 , 2004, Molecular and Cellular Biology.

[42]  S. Cawley,et al.  Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs , 2004, Cell.

[43]  M. Suyama,et al.  A genome-wide survey of human pseudogenes. , 2003, Genome research.

[44]  Mark Gerstein,et al.  Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. , 2003, Genome research.

[45]  F. Ayala,et al.  Pseudogenes: are they "junk" or functional DNA? , 2003, Annual review of genetics.

[46]  J. Rinn,et al.  The transcriptional activity of human Chromosome 22. , 2003, Genes & development.

[47]  M. Gerstein,et al.  Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level. , 2003, Current opinion in chemical biology.

[48]  Harvey F. Lodish,et al.  MOLECULAR.CELL.BIOLOGY 5TH.ED , 2003 .

[49]  Mark Gerstein,et al.  Toward a systematic definition of protein function that scales to the genome level: defining function in terms of interactions , 2002, Proc. IEEE.

[50]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[51]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[52]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[53]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[54]  Mathew W. Wright,et al.  Guidelines for human gene nomenclature. , 2002, Genomics.

[55]  Laurent Duret,et al.  Comparative sequence analysis of the X-inactivation center region in mouse, human, and bovine. , 2000, Genome research.

[56]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[57]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[58]  Alan K. Mackworth,et al.  Evaluation of gene-finding programs on mammalian sequences. , 2001, Genome research.

[59]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[60]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[61]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[62]  S. Yanagisawa,et al.  Heterogeneous Sp1 mRNAs in Human HepG2 Cells Include a Product of Homotypic trans-Splicing* , 2000, The Journal of Biological Chemistry.

[63]  M. O'Shea,et al.  Neuronal Expression of Neural Nitric Oxide Synthase (nNOS) Protein Is Suppressed by an Antisense RNA Transcribed from an NOS Pseudogene , 1999, The Journal of Neuroscience.

[64]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[65]  David B. Searls,et al.  Linguistic approaches to biological sequences , 1997, Comput. Appl. Biosci..

[66]  S. C. Lakhotia,et al.  What is a gene? , 1997 .

[67]  F. Zindy,et al.  Alternative reading frames of the INK4a tumor suppressor gene encode two unrelated proteins capable of inducing cell cycle arrest , 1995, Cell.

[68]  H. Rheinberger When Did Carl Correns Read Gregor Mendel's Paper? A Research Note , 1995, Isis.

[69]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[70]  Yang Shi,et al.  Transcriptional repression by YY1, a human GLI-Krüippel-related protein, and relief of repression by adenovirus E1A protein , 1991, Cell.

[71]  N. Roll-Hansen The crucial experiment of Wilhelm Johannsen , 1989 .

[72]  D. Tuan,et al.  An erythroid-specific, developmental-stage-independent enhancer far upstream of the human "beta-like globin" genes. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Harvey Eisen,et al.  RNA editing: Who's on first? , 1988, Cell.

[74]  A. Dobrovic,et al.  DNA methylation and genetic inactivation at thymidine kinase locus: Two different mechanisms for silencing autosomal genes , 1988, Somatic cell and molecular genetics.

[75]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[76]  R. Falk,et al.  What is a gene? , 1986, Studies in history and philosophy of science.

[77]  S. Henikoff,et al.  Gene within a gene: Nested Drosophila genes encode unrelated proteins on opposite DNA strands , 1986, Cell.

[78]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[79]  H. Lodish Molecular Cell Biology , 1986 .

[80]  P. Borst,et al.  Discontinuous transcription and antigenic variation in trypanosomes. , 1986, Annual review of biochemistry.

[81]  F Wold,et al.  In vivo chemical modification of proteins (post-translational modification). , 1981, Annual review of biochemistry.

[82]  O. Smithies,et al.  A mouse α-globin-related pseudogene lacking intervening sequences , 1980, Nature.

[83]  L. Hood,et al.  An immunoglobulin heavy chain variable region gene is generated from three segments of DNA: VH, D and JH , 1980, Cell.

[84]  R. Contreras,et al.  Overlapping of the VP2-VP3 gene and the VP1 gene in the SV40 genome , 1977, Cell.

[85]  R. Roberts,et al.  An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger RNA , 1977, Cell.

[86]  P. Sharp,et al.  Spliced segments at the 5′ terminus of adenovirus 2 late mRNA* , 1977, Proceedings of the National Academy of Sciences.

[87]  R. Roberts,et al.  One predominant 5′-undecanucleotide in adenovirus 2 late messenger RNAs , 1977, Cell.

[88]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[89]  L. Villa-komaroff,et al.  Complete translation of poliovirus RNA in a eukaryotic cell-free system. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[90]  R. Sager,et al.  Selective silencing of eukaryotic DNA , 1975, Science.

[91]  H. Kröger,et al.  [Protein synthesis]. , 1974, Fortschritte der Medizin.

[92]  J. Paul General Theory of Chromosome Structure and Gene Activation in Eukaryotes , 1972, Nature.

[93]  R. Contreras,et al.  Recent progress in the sequence determination of bacteriophage MS2 RNA. , 1971, Biochimie.

[94]  H. Khorana,et al.  Studies on polynucleotides, XLIX. Stimulation of the binding of aminoacyl-sRNA's to ribosomes by ribotrinucleotides and a survey of codon assignments for 20 amino acids. , 1965, Proceedings of the National Academy of Sciences of the United States of America.

[95]  M Nirenberg,et al.  RNA codewords and protein synthesis, VII. On the general nature of the RNA code. , 1965, Proceedings of the National Academy of Sciences of the United States of America.

[96]  M. Nirenberg,et al.  RNA Codewords and Protein Synthesis , 1964, Science.

[97]  J. Heimans Hugo de Vries and the Gene Concept , 1962, The American Naturalist.

[98]  J. Monod,et al.  Genetic regulatory mechanisms in the synthesis of proteins. , 1961, Journal of molecular biology.

[99]  G. Mendel,et al.  Versuche Uber Pflanzenhybriden , 1960 .

[100]  S. Benzer,et al.  FINE STRUCTURE OF A GENETIC REGION IN BACTERIOPHAGE. , 1955, Proceedings of the National Academy of Sciences of the United States of America.

[101]  A. D. Hershey,et al.  An upper limit to the protein content of the germinal substance of bacteriophage T2. , 1955, Virology.

[102]  F. Crick,et al.  Genetical Implications of the Structure of Deoxyribonucleic Acid , 1953, Nature.

[103]  B. Mcclintock Mutable Loci in Maize , 1951 .

[104]  O. Avery,et al.  STUDIES ON THE CHEMICAL NATURE OF THE SUBSTANCE INDUCING TRANSFORMATION OF PNEUMOCOCCAL TYPES , 1946, The Journal of experimental medicine.

[105]  G. Beadle,et al.  Genetic Control of Biochemical Reactions in Neurospora , 1941 .

[106]  B. Mcclintock A Cytological and Genetical Study of Triploid Maize. , 1929, Genetics.

[107]  F. Griffith The Significance of Pneumococcal Types , 1928, Journal of Hygiene.

[108]  H J Muller,et al.  ARTIFICIAL TRANSMUTATION OF THE GENE. , 1927, Science.

[109]  Thomas Hunt Morgan The Mechanism of Mendelian heredity, by T. H. Morgan, A. H. Sturtevant, H. J. Muller [and] C. B. Bridges. , 1915 .

[110]  Thomas Hunt Morgan,et al.  The mechanism of Mendelian heredity , 1915 .

[111]  A. Sturtevant,et al.  THE LINEAR ARRANGEMENT OF SIX SEX-LINKED FACTORS IN DROSOPHILA, AS SHOWN BY THEIR MODE OF ASSOCIATION , 1913 .