Enhanced homology searching through genome reading frame predetermination

MOTIVATION Many bioinformatic approaches exist for finding novel genes within genomic sequence data. Traditionally, homology search-based methods are often the first approach employed in determining whether a novel gene exists that is similar to a known gene. Unfortunately, distantly related genes or motifs often are difficult to find using single query-based homology search algorithms against large sequence datasets such as the human genome. Therefore, the motivation behind this work was to develop an approach to enhance the sensitivity of traditional single query-based homology algorithms against genomic data without losing search selectivity. RESULTS We demonstrate that by searching against a genome fragmented into all possible reading frames, the sensitivity of homology-based searches is enhanced without degrading its selectivity. Using the ETS-domain, bromodomain and acetyl-CoA acetyltransferase gene as queries, we were able to demonstrate that direct protein-protein searches using BLAST2P or FASTA3 against a human genome segmented among all possible reading frames and translated was substantially more sensitive than traditional protein-DNA searches against a raw genomic sequence using an application such as TBLAST2N. Receiver operating characteristic analysis was employed to demonstrate that the algorithms remained selective, while comparisons of the algorithms showed that the protein-protein searches were more sensitive in identifying hits. Therefore, through the overprediction of reading frames by this method and the increased sensitivity of protein-protein based homology search algorithms, a genome can be deeply mined, potentially finding hits overlooked by protein-DNA searches against raw genomic data.

[1]  M. Borodovsky,et al.  Identification of new human cadherin genes using a combination of protein motif search and gene finding methods. , 2004, Journal of molecular biology.

[2]  Lior Pachter,et al.  VISTA : visualizing global DNA sequence alignments of arbitrary length , 2000, Bioinform..

[3]  Ronen Marmorstein,et al.  Structure of the Elk-1–DNA complex reveals how DNA-distal residues affect ETS domain recognition of DNA , 2000, Nature Structural Biology.

[4]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[5]  H. Mewes,et al.  Conservation of microstructure between a sequenced region of the genome of rice and multiple segments of the genome of Arabidopsis thaliana. , 2001, Genome research.

[6]  R. Durbin,et al.  Using GeneWise in the Drosophila annotation experiment. , 2000, Genome research.

[7]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[8]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[9]  R. Stoughton,et al.  Experimental annotation of the human genome using microarray technology , 2001, Nature.

[10]  Andrew D Sharrocks Complexities in ETS-domain transcription factor function and regulation: lessons from the TCF (ternary complex factor) subfamily. The Colworth Medal Lecture. , 2002, Biochemical Society transactions.

[11]  Jonathan C. Cohen,et al.  An Apolipoprotein Influencing Triglycerides in Humans and Mice Revealed by Comparative Sequencing , 2001, Science.

[12]  R. Gibbs,et al.  Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. , 1998, Genome research.

[13]  G C Overton,et al.  Analysis of EST-driven gene annotation in human genomic sequence. , 1998, Genome research.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  C R Scriver,et al.  A "new" disorder of isoleucine catabolism. , 1971, Lancet.

[16]  Ming-Ming Zhou,et al.  Bromodomain: an acetyl‐lysine binding domain , 2002, FEBS letters.

[17]  C. Burge,et al.  Computational and experimental analysis identifies many novel human genes. , 2000, Biochemical and biophysical research communications.

[18]  William Noble Grundy,et al.  Family pairwise search with embedded motif models , 1999, Bioinform..

[19]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[20]  W R Pearson,et al.  Dynamic programming algorithms for biological sequence comparison. , 1992, Methods in enzymology.

[21]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[22]  R A Gibbs,et al.  Comparing vertebrate whole-genome shotgun reads to the human genome. , 2001, Genome research.

[23]  T. Fukao,et al.  Molecular cloning of cDNA for human mitochondrial acetoacetyl-CoA thiolase and molecular analysis of 3-ketothiolase deficiency , 2005, Journal of Inherited Metabolic Disease.

[24]  H. Kowarzyk Structure and Function. , 1910, Nature.

[25]  K. Novak The complete genome sequence… , 1998, Nature Medicine.

[26]  S. P. Fodor,et al.  Large-Scale Transcriptional Activity in Chromosomes 21 and 22 , 2002, Science.

[27]  L Roberts,et al.  GRAIL seeks out genes buried in DNA sequence. , 1991, Science.

[28]  M. Tompa Identifying functional elements by comparative DNA sequence analysis. , 2001, Genome research.

[29]  Yuan Liu,et al.  MULTICLUSTAL: a systematic method for surveying Clustal W alignment parameters , 1999, Bioinform..

[30]  F F Costa,et al.  Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[31]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[32]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[33]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[34]  W. Miller,et al.  Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. , 2000, Science.

[35]  K. Murakami,et al.  Gene recognition by combination of several gene-finding programs , 1998, Bioinform..

[36]  Y. Nakamura,et al.  Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and genomic sequence comparison with Bacillus subtilis. , 2000, Nucleic acids research.

[37]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[38]  Michael Q. Zhang,et al.  GFScan: A Gene Family Search Tool at Genomic DNA Level , 2002 .

[39]  H. Prydz,et al.  Evaluation of the exon predictions of the GRAIL software. , 1994, Genomics.

[40]  T G Wolfsberg,et al.  A comparison of expressed sequence tags (ESTs) to human genomic sequences. , 1997, Nucleic acids research.

[41]  M. Gribskov,et al.  Identification of Sequence Patterns with Profile Analysis , 1996 .

[42]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[43]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[44]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[45]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[46]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[47]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[48]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[49]  R. Blevins,et al.  Genome analysis with gene-indexing databases. , 2001, Pharmacology & therapeutics.

[50]  S. Berger,et al.  Structure and function of bromodomains in chromatin-regulating complexes. , 2001, Gene.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[52]  R. Gibbs,et al.  Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. , 1997, Genome research.