Automata Learning and Stochastic Modeling for Biosequence Analysis

[1]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[2]  Hanah Margalit,et al.  PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites , 2001, Nucleic Acids Res..

[3]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[4]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[5]  Philip M. Lewis,et al.  The characteristic selection problem in recognition systems , 1962, IRE Trans. Inf. Theory.

[6]  Zukang Feng,et al.  The Protein Data Bank and structural genomics , 2003, Nucleic Acids Res..

[7]  W. Zander,et al.  The Hebrew University , 1998 .

[8]  Naftali Tishby,et al.  Markovian domain fingerprinting: statistical segmentation of protein sequences , 2001, Bioinform..

[9]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[10]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[12]  C. Branden,et al.  Introduction to protein structure , 1991 .

[13]  K. Yoshida,et al.  Foldability of barnase mutants obtained by permutation of modules or secondary structure units. , 1999, Journal of molecular biology.

[14]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[15]  Gill Bejerano Efficient exact value computation and applications to biosequence analysis , 2003, RECOMB '03.

[16]  Yoshua Bengio,et al.  Markovian Models for Sequential Data , 2004 .

[17]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[18]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[19]  R. Glockshuber,et al.  Random circular permutation of DsbA reveals segments that are essential for protein folding and stability. , 1999, Journal of molecular biology.

[20]  Alberto Apostolico,et al.  Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space , 2000, RECOMB '00.

[21]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[22]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[23]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[24]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[25]  Stefano Toppo,et al.  Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices , 2002, Bioinform..

[26]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[27]  Naftali Tishby,et al.  Discriminative Feature Selection via Multiclass Variable Memory Markov Model , 2002, EURASIP J. Adv. Signal Process..

[28]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[29]  Ron Unger,et al.  Swaps in protein sequences , 2002, Proteins.

[30]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[31]  Michael Sipser,et al.  Inference and minimization of hidden Markov chains , 1994, COLT '94.

[32]  Naftali Tishby,et al.  Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources , 2001, ICML.

[33]  Arne Elofsson,et al.  A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[34]  Sung-Hou Kim,et al.  Electron transfer by domain movement in cytochrome bc1 , 1998, Nature.

[35]  Vineet Bafna,et al.  Pattern Matching Algorithms , 1997 .

[36]  E T Stuart,et al.  Mammalian Pax genes. , 1994, Annual review of genetics.

[37]  Pierre Dupont,et al.  Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains , 2002, ECML.

[38]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[39]  D. Eisenberg,et al.  Computational methods of analysis of protein-protein interactions. , 2003, Current opinion in structural biology.

[40]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[41]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[42]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[43]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[44]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[45]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[46]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[47]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[48]  A. Valencia,et al.  Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[49]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[50]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[51]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[52]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[53]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[54]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[55]  M. A. Basharov Cotranslational Folding of Proteins , 2004, Biochemistry (Moscow).

[56]  Jérôme Gouzy,et al.  Whole Genome Protein Domain Analysis using a New Method for Domain Clustering , 1999, Comput. Chem..

[57]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[58]  Golan Yona,et al.  Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[59]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[60]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[61]  A. Fedorov,et al.  Contribution of cotranslational folding to the rate of formation of native protein structure. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[62]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[63]  A. Fersht,et al.  Folding of circular and permuted chymotrypsin inhibitor 2: retention of the folding nucleus. , 1998, Biochemistry.

[64]  Chris Sander,et al.  Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[65]  H. Margalit,et al.  Novel small RNA-encoding genes in the intergenic regions of Escherichia coli , 2001, Current Biology.

[66]  C Kulikowski,et al.  Automatic discovery of sub-molecular sequence domains in multi-aligned sequences: a dynamic programming algorithm for multiple alignment segmentation. , 2000, Journal of theoretical biology.

[67]  David L. Eaton,et al.  Glutathione S‐transferases: Amino acid sequence comparison, classification and phylogenetic relationship , 1992 .

[68]  J. Thompson,et al.  Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[69]  Cathy H. Wu,et al.  iProClass: an integrated database of protein family, function and structure information , 2003, Nucleic Acids Res..

[70]  J. Thompson,et al.  Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[71]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[72]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[73]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[74]  A. C. May,et al.  Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies , 2002, Protein science : a publication of the Protein Society.

[75]  R. A. George,et al.  Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.

[76]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[77]  Jiye Shi,et al.  HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families , 2001, Bioinform..

[78]  P. Bork,et al.  Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[79]  C. Ponting,et al.  On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? , 2001, Journal of structural biology.

[80]  C Ouzounis,et al.  Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins , 1999, Proteins.

[81]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[82]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..

[83]  Padhraic Smyth,et al.  Decision tree design from a communication theory standpoint , 1988, IEEE Trans. Inf. Theory.

[84]  KharHengChoo,et al.  Recent Applications of Hidden Markov Models in Computational Biology , 2004 .

[85]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[86]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[87]  Gill Bejerano Algorithms for variable length Markov chain modeling , 2004, Bioinform..

[88]  William M. Campbell,et al.  Mutual Information in Learning Feature Transformations , 2000, ICML.

[89]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[90]  Sean R. Eddy,et al.  HMMER User's Guide - Biological sequence analysis using profile hidden Markov models , 1998 .

[91]  P Argos,et al.  DOMO: a new database of aligned protein domains. , 1998, Trends in biochemical sciences.

[92]  J. Parker Amino Acid Substitution , 2001 .

[93]  Golan Yona,et al.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[94]  Sam Griffiths-Jones,et al.  The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs , 2002, Bioinform..

[95]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[96]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[97]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[98]  Ran El-Yaniv,et al.  Agnostic Classification of Markovian Sequences , 1997, NIPS.

[99]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[100]  M. Grossmann,et al.  G Protein-coupled Receptors , 1998, The Journal of Biological Chemistry.

[101]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[102]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[103]  A. Valencia,et al.  Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[104]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[105]  C. Reynolds,et al.  Correlated mutations amongst the external residues of G-protein coupled receptors. , 1997, Biochemical Society transactions.

[106]  C. Mcwherter,et al.  Circular permutation of granulocyte colony-stimulating factor. , 1999, Biochemistry.

[107]  Eleazar Eskin,et al.  Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[108]  C. Chothia,et al.  The geometry of domain combination in proteins. , 2002, Journal of molecular biology.

[109]  H A Scheraga,et al.  Lattice neural network minimization. Application of neural network optimization for locating the global-minimum conformations of proteins. , 1993, Journal of molecular biology.

[110]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[111]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[112]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[113]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[114]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[115]  J B Hurley,et al.  Two amino acid substitutions convert a guanylyl cyclase, RetGC-1, into an adenylyl cyclase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[116]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[117]  J. Moody,et al.  Feature Selection Based on Joint Mutual Information , 1999 .

[118]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[119]  Shlomo Dubnov,et al.  Using Machine-Learning Methods for Musical Style Modeling , 2003, Computer.

[120]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[121]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[122]  J A Epstein,et al.  Crystal structure of the human Pax6 paired domain-DNA complex reveals specific roles for the linker region and carboxy-terminal subdomain in DNA binding. , 1999, Genes & development.

[123]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[124]  Naftali Tishby,et al.  Efficient Exact p-Value Computation for Small Sample, Sparse, and Surprising Categorical Data , 2004, J. Comput. Biol..

[125]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[126]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[127]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[128]  J Roca,et al.  The mechanisms of DNA topoisomerases. , 1995, Trends in biochemical sciences.

[129]  Peter Bühlmann,et al.  Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .

[130]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[131]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[132]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[133]  J. Hayes,et al.  The glutathione S-transferase supergene family: regulation of GST and the contribution of the isoenzymes to cancer chemoprotection and drug resistance. , 1995, Critical reviews in biochemistry and molecular biology.

[134]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[135]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[136]  James E. Johnson,et al.  MetaFam: a unified classification of protein families. II. Schema and query capabilities , 2001, Bioinform..

[137]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[138]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[139]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[140]  Peer Bork,et al.  Recent improvements to the SMART domain-based sequence annotation resource , 2002, Nucleic Acids Res..

[141]  Dana Angluin,et al.  Learning Markov chains with variable memory length from noisy output , 1997, COLT '97.

[142]  Liisa Holm,et al.  Picasso: generating a covering set of protein family profiles , 2001, Bioinform..

[143]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[144]  G. Barrows,et al.  A mutual information measure for feature selection with application to pulse classification , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[145]  Mikhail A. Roytberg,et al.  Segmentation of long genomic sequences into domains with homogeneous composition with BASIO software , 2001, Bioinform..

[146]  Rolf Apweiler,et al.  Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters , 2003, Nucleic Acids Res..

[147]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[148]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[149]  Renato De Mori,et al.  High-performance connected digit recognition using maximum mutual information estimation , 1994, IEEE Trans. Speech Audio Process..

[150]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[151]  Anders Krogh,et al.  SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM , 1995 .

[152]  JORMA RISSANEN,et al.  A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[153]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[154]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[155]  Imre Csiszár,et al.  On the computation of rate-distortion functions (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[156]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[157]  A. Mees,et al.  Context-tree modeling of observed symbolic dynamics. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[158]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[159]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[160]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[161]  Dana Ron,et al.  Learning to model sequences generated by switching distributions , 1995, COLT '95.

[162]  L. Wu,et al.  Autonomous protein folding units. , 2000, Advances in protein chemistry.

[163]  Golan Yona,et al.  Towards a Complete Map of the Protein Space Based on a Unified Sequence and Structure Analysis of All Known Proteins , 2000, ISMB.