Gene Identification: Classical and Computational Intelligence Approaches

Automatic identification of genes has been an actively researched area of bioinformatics. Compared to earlier attempts for finding genes, the recent techniques are significantly more accurate and reliable. Many of the current gene-finding methods employ computational intelligence techniques that are known to be more robust when dealing with uncertainty and imprecision. In this paper, a detailed survey on the existing classical and computational intelligence based methods for gene identification is carried out. This includes a brief description of the classical and computational intelligence methods before discussing their applications to gene finding. In addition, a long list of available gene finders is compiled. For the convenience of the readers, the list is enhanced by mentioning their corresponding web sites and commenting on the general approach adopted. An extensive bibliography is provided. Finally, some limitations of the current approaches and future directions are discussed.

[1]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[2]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[3]  Vladimir B. Bajic,et al.  Dragon Gene Start Finder identifies approximate locations of the 5' ends of genes , 2003, Nucleic Acids Res..

[4]  D. Haussler,et al.  Genie--gene finding in Drosophila melanogaster. , 2000, Genome research.

[5]  Seng Hong Seah,et al.  Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. , 2003, Genome research.

[6]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[7]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[8]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[9]  Simon Kasif,et al.  Induction of Oblique Decision Trees , 1993, IJCAI.

[10]  Peter G. Korning,et al.  Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information , 1996 .

[11]  Micheal Q. Zhang,et al.  Using MZEF to Find Internal Coding Exons , 2003, Current Protocols in Bioinformatics.

[12]  Simon Kasif,et al.  Computational methods in molecular biology , 1998 .

[13]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[14]  R. Shamir,et al.  How prevalent is functional alternative splicing in the human genome? , 2004, Trends in genetics : TIG.

[15]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[16]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[17]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[18]  J. Ott,et al.  Applications of neural networks for gene finding. , 2001, Advances in genetics.

[19]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[20]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[21]  Jonathan E. Allen,et al.  Computational gene prediction using multiple sources of evidence. , 2003, Genome research.

[22]  Mikhail S. Gelfand FANS-REF: a bibliography on statistics and functional analysis of nucleotide sequences , 1995, Comput. Appl. Biosci..

[23]  Vladimir Pavlovic,et al.  A Bayesian framework for combining gene predictions , 2002, Bioinform..

[24]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[25]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[26]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[27]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[28]  Michal Galdzicki,et al.  Mammalian overlapping genes: the comparative perspective. , 2004, Genome research.

[29]  Victor G. Levitsky,et al.  Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis , 2003, Silico Biol..

[30]  Jean Garnier,et al.  FORESST: fold recognition from secondary structure predictions of proteins , 1999, Bioinform..

[31]  S. Colowick,et al.  Methods in Enzymology , Vol , 1966 .

[32]  Charles J. Vaske,et al.  Gene prediction and verification in a compact genome with numerous small introns. , 2004, Genome research.

[33]  T. Blumenthal Gene clusters and polycistronic transcription in eukaryotes , 1998, BioEssays : news and reviews in molecular, cellular and developmental biology.

[34]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[35]  Marc Parizeau,et al.  Training Hidden Markov Models with Multiple Observations-A Combinatorial Method , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[37]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[38]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[39]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[40]  Alexander E. Kel,et al.  GenViewer: A computing tool for protein-coding regions prediction in nucleotide sequences , 1993 .

[41]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[42]  James W. Fickett,et al.  ORFs and Genes: How Strong a Connection? , 1995, J. Comput. Biol..

[43]  John A. Gunnels,et al.  Genetic algorithms and simulated annealing for gene mapping , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[44]  R. Bellman Dynamic programming. , 1957, Science.

[45]  Luciano Milanesi,et al.  Analysis of donor splice sites in different eukaryotic organisms , 1997, Journal of Molecular Evolution.

[46]  M. Van Montagu,et al.  Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. , 1999, Journal of molecular biology.

[47]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[48]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[49]  Igor Jurisica,et al.  Applications of Case-Based Reasoning in Molecular Biology , 2004, AI Mag..

[50]  Paul Bradley Discovering Genomics, Proteomics and Bioinformatics , 2003 .

[51]  Mikhail S. Gelfand,et al.  An Algorithm for Highly Specific Recognition of Protein-coding Regions , 1996 .

[52]  Vasile Palade,et al.  A neural network based multi-classifier system for gene identification in DNA sequences , 2004, Neural Computing & Applications.

[53]  Alexander E. Kel,et al.  A genetic algorithm for designing gene family specific oligonucleotide sets used for hybridization. The G Protein-coupled receptor Protein superfamily , 1997, German Conference on Bioinformatics.

[54]  Jurg Ott,et al.  20 Applications of neural networks for gene finding , 2001 .

[55]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Yin Xu,et al.  An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequence , 1994, ISMB.

[57]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[58]  Mikhail S. Gelfand,et al.  Recognition of Genes in Human DNA Sequences , 1996, J. Comput. Biol..

[59]  S Brunak,et al.  A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. , 1997, Nucleic acids research.

[60]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1989, IJCAI 1989.

[62]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[63]  S. Cawley,et al.  Phat--a gene finding program for Plasmodium falciparum. , 2001, Molecular and biochemical parasitology.

[64]  Paolo Frasconi,et al.  Prediction of Protein Topologies Using GIOHMMs and GRNNs , 2003, NIPS 2003.

[65]  P J Shaw,et al.  Clusters of multiple different small nucleolar RNA genes in plants are expressed as and processed from polycistronic pre‐snoRNAs , 1997, The EMBO journal.

[66]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[67]  G. Stormo Gene-finding approaches for eukaryotes. , 2000, Genome research.

[68]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[69]  Steven Salzberg,et al.  GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders , 2003, Nucleic Acids Res..

[70]  Roderic Guigó,et al.  Assembling Genes from Predicted Exons In Linear Time with Dynamic Programming , 1998, J. Comput. Biol..

[71]  V. Brendel,et al.  Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. , 1996, Nucleic acids research.

[72]  Ian D. Watson,et al.  Applying case-based reasoning - techniques for the enterprise systems , 1997 .

[73]  Brian P. Brunk,et al.  EpoDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis , 1999, Nucleic Acids Res..

[74]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[75]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[76]  L. Stein 21.10 n&v 915 MH , 2004 .

[77]  T. Koski Hidden Markov Models for Bioinformatics , 2001 .

[78]  G. Christian Overton,et al.  Case-based reasoning driven gene annotation , 1998 .

[79]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[80]  Yi Xing,et al.  Genome-Wide Detection of Alternative Splicing in Expressed Sequences Using Partial Order Multiple Sequence Alignment Graphs , 2003, Pacific Symposium on Biocomputing.

[81]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[82]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[83]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[84]  Ashwin Ram,et al.  MULTI-PLAN RETRIEVAL AND ADAPTATION IN AN EXPERIENCE-BASED AGENT , 1996 .

[85]  Dalong Ma,et al.  Nested genes in the human genome. , 2005, Genomics.

[86]  Christopher B. Burge,et al.  Classification of Introns: U2-Type or U12-Type , 1997, Cell.

[87]  I. Dunham,et al.  The DNA sequence and biological annotation of human chromosome 1 , 2006, Nature.

[88]  M. Adams,et al.  A tool for analyzing and annotating genomic sequences. , 1997, Genomics.

[89]  Lawrence Davis,et al.  Genetic Algorithms and Simulated Annealing , 1987 .

[90]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[91]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[92]  Ying Xu,et al.  Constructing gene models from accurately predicted exons: an application of dynamic programming , 1994, Comput. Appl. Biosci..

[93]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[94]  M S Gelfand,et al.  Prediction of function in DNA sequence analysis. , 1995, Journal of computational biology : a journal of computational molecular cell biology.

[95]  Sanghamitra Bandyopadhyay,et al.  An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection , 2005, Fuzzy Sets Syst..

[96]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[97]  M. Bishop,et al.  Nucleic acid and protein sequence analysis : a practical approach , 1987 .

[98]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[99]  E. Costello,et al.  A Case-Based Approach to Gene Finding , 2003 .

[100]  Thangavel Alphonse Thanaraj A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures , 1999, Nucleic Acids Res..

[101]  Temple F. Smith,et al.  Prediction of gene structure. , 1992, Journal of molecular biology.

[102]  Jonathan Casper,et al.  Combining local‐structure, fold‐recognition, and new fold methods for protein structure prediction , 2003, Proteins.

[103]  Laurie J. Heyer,et al.  Discovering Genomics, Proteomics, and Bioinformatics , 2002 .

[104]  J. K. Lenstra,et al.  Local Search in Combinatorial Optimisation. , 1997 .

[105]  M. Gelfand,et al.  Prediction of the exon-intron structure by a dynamic programming approach. , 1993, Bio Systems.

[106]  David B. Fogel,et al.  Identification of Coding Regions in DNA Sequences Using Evolved Neural Networks , 2003 .

[107]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[108]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[109]  Mikhail S. Gelfand,et al.  Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors , 2001, Bioinform..

[110]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[111]  Thomas Schiex,et al.  EUGÈNE: An Eukaryotic Gene Finder That Combines Several Sources of Evidence , 2000, JOBIM.

[112]  Dan Roth,et al.  Gene recognition based on DAG shortest paths , 2001, ISMB.

[113]  Martin G. Reese,et al.  Application of a Time-delay Neural Network to Promoter Annotation in the Drosophila Melanogaster Genome , 2001, Comput. Chem..

[114]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .