Data Mining for Bioinformatics

In this chapter, we discuss the analysis of biomedical, DNA, and protein data. A detailed discussion is made on the major databases, such as the nucleotide sequence database, the protein sequence database, and the gene expression database. In order to make use of the data from these databases, efficient software tools are needed to retrieve data, compare biological sequences, discover patterns, and visualize the discovered knowledge. The most widely used tools are also covered.

[1]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[2]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[3]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[4]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[5]  J. Hawkins,et al.  A survey on intron and exon lengths. , 1988, Nucleic acids research.

[6]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[7]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[8]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[9]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[10]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[11]  A A Salamov,et al.  Protein secondary structure prediction using local alignments. , 1997, Journal of molecular biology.

[12]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[13]  N. P. Brown,et al.  The GeneQuiz web server: protein functional analysis through the Web. , 2000, Trends in biochemical sciences.

[14]  Hong Yan,et al.  Classification of short human exons and introns based on statistical features. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Kenji Mizuguchi,et al.  Analysis of conservation and substitutions of secondary structure elements within protein superfamilies , 2000, Bioinform..

[16]  L A Mirny,et al.  Statistical significance of protein structure prediction by threading. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[17]  R Thiele,et al.  Protein threading by recursive dynamic programming. , 1999, Journal of molecular biology.

[18]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[19]  L. Mirny,et al.  Protein structure prediction by threading. Why it works and why it does not. , 1998, Journal of molecular biology.

[20]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[21]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[22]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[23]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[24]  Hong Yan,et al.  Effective statistical features for coding and non-coding DNA sequence classification for yeast, C. elegans and human , 2005, Int. J. Bioinform. Res. Appl..

[25]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[26]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[27]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[28]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[29]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[30]  J U Bowie,et al.  Three-dimensional profiles for measuring compatibility of amino acid sequence with three-dimensional structure. , 1996, Methods in enzymology.

[31]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[32]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[34]  L. Shapiro,et al.  Finding function through structural genomics. , 2000, Current opinion in biotechnology.

[35]  R F Doolittle,et al.  Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. , 1996, Methods in enzymology.

[36]  R. Guigó,et al.  Evaluation of gene structure prediction programs. , 1996, Genomics.

[37]  S V Buldyrev,et al.  Average mutual information of coding and noncoding DNA. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[38]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[39]  Terri K. Attwood,et al.  The PRINTS Database: A Resource for Identification of Protein Families , 2002, Briefings Bioinform..

[40]  Alexander V. Spirov,et al.  Graphical interface to the genetic network database GeNet , 1998, Bioinform..

[41]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[42]  G. Helt,et al.  BioViews: Java-based tools for genomic data visualization. , 1998, Genome research.

[43]  R Zhang,et al.  Z curves, an intutive tool for visualizing and analyzing the DNA sequences. , 1994, Journal of biomolecular structure & dynamics.

[44]  Richard P. Lippmann,et al.  An introduction to computing with neural nets , 1987 .

[45]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[46]  Geoffrey J. Barton,et al.  3Dee: a database of protein structural domains , 2001, Bioinform..

[47]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[48]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[49]  A. Gibbs,et al.  The Diagram, a Method for Comparing Sequences , 1970 .

[50]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[51]  T A Thanaraj,et al.  Positional characterisation of false positives from computational prediction of human splice sites. , 2000, Nucleic acids research.

[52]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[53]  E. Uberbacher,et al.  Discovering and understanding genes in human DNA sequence using GRAIL. , 1996, Methods in enzymology.

[54]  Catherine Brooksbank,et al.  An open letter to the scientific journals , 2002, Bioinform..

[55]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[56]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[57]  J. Gibrat,et al.  GOR method for predicting protein secondary structure from amino acid sequence. , 1996, Methods in enzymology.

[58]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[59]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[60]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[61]  Chun-Ting Zhang,et al.  Recognizing shorter coding regions of human genes based on the statistics of stop codons. , 2002, Biopolymers.

[62]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[63]  J. Garnier,et al.  Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. , 1978, Journal of molecular biology.

[64]  C. Zhang,et al.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. , 2000, Nucleic acids research.

[65]  Alan Wee-Chung Liew,et al.  DB-Curve: a novel 2D method of DNA sequence visualization and representation , 2003 .

[66]  Steven Salzberg,et al.  A Decision Tree System for Finding Genes in DNA , 1998, J. Comput. Biol..

[67]  R. Staden Finding protein coding regions in genomic sequences. , 1990, Methods in enzymology.

[68]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[69]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Michael Y. Galperin,et al.  'Conserved hypothetical' proteins: prioritization of targets for experimental study. , 2004, Nucleic acids research.

[71]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[73]  N. Dovichi,et al.  DNA sequencing by capillary array electrophoresis. , 2001, Methods in molecular biology.

[74]  Nucleic Acid and Protein Sequence Databases , 1997 .

[75]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[76]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[77]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[78]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Renata C. Geer,et al.  Tutorial Section: Entrez: Making Use of Its Power , 2003, Briefings Bioinform..

[80]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) in XML , 2003, Nucleic Acids Res..

[81]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[82]  S. Tiwari,et al.  Prediction of probable genes by Fourier analysis of genomic sequences , 1997, Comput. Appl. Biosci..

[83]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[84]  Moon-Jung Chung,et al.  Multiple sequence alignment using simulated annealing , 1994, Comput. Appl. Biosci..

[85]  J. Felsenstein Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. , 1996, Methods in enzymology.

[86]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[87]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[88]  John P. Overington,et al.  Discrimination of common protein folds: application of protein structure to sequence/structure comparisons. , 1996, Methods in enzymology.

[89]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[90]  Wentian Li,et al.  The Study of Correlation Structures of DNA Sequences: A Critical Review , 1997, Comput. Chem..

[91]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[92]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[93]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[94]  J W Fickett,et al.  Finding genes by computer: the state of the art. , 1996, Trends in genetics : TIG.

[95]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[96]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[97]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[98]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[99]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[100]  Hideaki Sugawara,et al.  DBJ in the stream of various biological data , 2004, Nucleic Acids Res..

[101]  Roderic Guigó,et al.  DNA Composition, Codon Usage and Exon Prediction , 1997 .

[102]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) for genome scale research in life science , 2002, Nucleic Acids Res..

[103]  S. Wodak,et al.  Protein structure prediction by threading methods: Evaluation of current techniques , 1995, Proteins.

[104]  Douglas L. Brutlag,et al.  The EMOTIF database , 2001, Nucleic Acids Res..

[105]  Temple F. Smith,et al.  Analysis and algorithms for protein sequence–structure alignment , 1998 .

[106]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.