论文信息 - Automata Learning and Stochastic Modeling for Biosequence Analysis - 字舞流文

Automata Learning and Stochastic Modeling for Biosequence Analysis

Gill Bejerano | G. Bejerano

[1] Nathan Linial,et al. ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[2] Hanah Margalit,et al. PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites , 2001, Nucleic Acids Res..

[3] Frans M. J. Willems,et al. The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[4] W. Pearson. Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[5] Philip M. Lewis,et al. The characteristic selection problem in recognition systems , 1962, IRE Trans. Inf. Theory.

[6] Zukang Feng,et al. The Protein Data Bank and structural genomics , 2003, Nucleic Acids Res..

[7] W. Zander,et al. The Hebrew University , 1998 .

[8] Naftali Tishby,et al. Markovian domain fingerprinting: statistical segmentation of protein sequences , 2001, Bioinform..

[9] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[10] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11] Dana Ron,et al. The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[12] C. Branden,et al. Introduction to protein structure , 1991 .

[13] K. Yoshida,et al. Foldability of barnase mutants obtained by permutation of modules or secondary structure units. , 1999, Journal of molecular biology.

[14] W. Taylor,et al. The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[15] Gill Bejerano. Efficient exact value computation and applications to biosequence analysis , 2003, RECOMB '03.

[16] Yoshua Bengio,et al. Markovian Models for Sequential Data , 2004 .

[17] M. F.,et al. Bibliography , 1985, Experimental Gerontology.

[18] L Holm,et al. Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[19] R. Glockshuber,et al. Random circular permutation of DsbA reveals segments that are essential for protein folding and stability. , 1999, Journal of molecular biology.

[20] Alberto Apostolico,et al. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space , 2000, RECOMB '00.

[21] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[22] Maria Jesus Martin,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[23] Sean R. Eddy,et al. Profile hidden Markov models , 1998, Bioinform..

[24] Anton J. Enright,et al. An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[25] Stefano Toppo,et al. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices , 2002, Bioinform..

[26] P. Bork,et al. Protein sequence motifs. , 1996, Current opinion in structural biology.

[27] Naftali Tishby,et al. Discriminative Feature Selection via Multiclass Variable Memory Markov Model , 2002, EURASIP J. Adv. Signal Process..

[28] Shmuel Pietrokovski,et al. Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[29] Ron Unger,et al. Swaps in protein sequences , 2002, Proteins.

[30] David R. Gilbert,et al. Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[31] Michael Sipser,et al. Inference and minimization of hidden Markov chains , 1994, COLT '94.

[32] Naftali Tishby,et al. Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources , 2001, ICML.

[33] Arne Elofsson,et al. A comparison of sequence and structure protein domain families as a basis for structural genomics , 1999, Bioinform..

[34] Sung-Hou Kim,et al. Electron transfer by domain movement in cytochrome bc1 , 1998, Nature.

[35] Vineet Bafna,et al. Pattern Matching Algorithms , 1997 .

[36] E T Stuart,et al. Mammalian Pax genes. , 1994, Annual review of genetics.

[37] Pierre Dupont,et al. Improved Smoothing for Probabilistic Suffix Trees Seen as Variable Order Markov Chains , 2002, ECML.

[38] Anton J. Enright,et al. GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[39] D. Eisenberg,et al. Computational methods of analysis of protein-protein interactions. , 2003, Current opinion in structural biology.

[40] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[41] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[42] Alex Bateman,et al. The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[43] Raphail E. Krichevsky,et al. The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[44] Jorma Rissanen,et al. The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[45] Daniel Povey,et al. Large scale discriminative training for speech recognition , 2000 .

[46] N. Wicker,et al. Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[47] Tim J. P. Hubbard,et al. SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[48] A. Valencia,et al. Automatic methods for predicting functionally important residues. , 2003, Journal of molecular biology.

[49] Thomas G. Dietterich,et al. Learning with Many Irrelevant Features , 1991, AAAI.

[50] Anton J. Enright,et al. Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[51] Burkhard Rost,et al. Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[52] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[53] S. Salzberg,et al. Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[54] Naoki Abe,et al. On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[55] M. A. Basharov. Cotranslational Folding of Proteins , 2004, Biochemistry (Moscow).

[56] Jérôme Gouzy,et al. Whole Genome Protein Domain Analysis using a New Method for Domain Clustering , 1999, Comput. Chem..

[57] Jun S. Liu,et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[58] Golan Yona,et al. Modeling protein families using probabilistic suffix trees , 1999, RECOMB.

[59] G. F. Hughes,et al. On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[60] Owen White,et al. The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[61] A. Fedorov,et al. Contribution of cotranslational folding to the rate of formation of native protein structure. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[62] S. Altschul. Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[63] A. Fersht,et al. Folding of circular and permuted chymotrypsin inhibitor 2: retention of the folding nucleus. , 1998, Biochemistry.

[64] Chris Sander,et al. Touring protein fold space with Dali/FSSP , 1998, Nucleic Acids Res..

[65] H. Margalit,et al. Novel small RNA-encoding genes in the intergenic regions of Escherichia coli , 2001, Current Biology.

[66] C Kulikowski,et al. Automatic discovery of sub-molecular sequence domains in multi-aligned sequences: a dynamic programming algorithm for multiple alignment segmentation. , 2000, Journal of theoretical biology.

[67] David L. Eaton,et al. Glutathione S‐transferases: Amino acid sequence comparison, classification and phylogenetic relationship , 1992 .

[68] J. Thompson,et al. Multiple sequence alignment with Clustal X. , 1998, Trends in biochemical sciences.

[69] Cathy H. Wu,et al. iProClass: an integrated database of protein family, function and structure information , 2003, Nucleic Acids Res..

[70] J. Thompson,et al. Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[71] S. Henikoff,et al. Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[72] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[73] A. D. McLachlan,et al. Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[74] A. C. May,et al. Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies , 2002, Protein science : a publication of the Protein Society.

[75] R. A. George,et al. Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.

[76] Yoram Singer,et al. The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[77] Jiye Shi,et al. HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families , 2001, Bioinform..

[78] P. Bork,et al. Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[79] C. Ponting,et al. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? , 2001, Journal of structural biology.

[80] C Ouzounis,et al. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins , 1999, Proteins.

[81] Jeffrey E. F. Friedl. Mastering Regular Expressions , 1997 .

[82] Burkhard Rost,et al. Target space for structural genomics revisited , 2002, Bioinform..

[83] Padhraic Smyth,et al. Decision tree design from a communication theory standpoint , 1988, IEEE Trans. Inf. Theory.

[84] KharHengChoo,et al. Recent Applications of Hidden Markov Models in Computational Biology , 2004 .

[85] Walter R. Gilks,et al. Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[86] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[87] Gill Bejerano. Algorithms for variable length Markov chain modeling , 2004, Bioinform..

[88] William M. Campbell,et al. Mutual Information in Learning Feature Transformations , 2000, ICML.

[89] Jérôme Gouzy,et al. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[90] Sean R. Eddy,et al. HMMER User's Guide - Biological sequence analysis using profile hidden Markov models , 1998 .

[91] P Argos,et al. DOMO: a new database of aligned protein domains. , 1998, Trends in biochemical sciences.

[92] J. Parker. Amino Acid Substitution , 2001 .

[93] Golan Yona,et al. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families , 2001, Bioinform..

[94] Sam Griffiths-Jones,et al. The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs , 2002, Bioinform..

[95] Charles Elkan,et al. Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[96] Stephen H. Bryant,et al. Domain size distributions can predict domain boundaries , 2000, Bioinform..

[97] D. Haussler,et al. Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[98] Ran El-Yaniv,et al. Agnostic Classification of Markovian Sequences , 1997, NIPS.

[99] Ori Sasson,et al. ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[100] M. Grossmann,et al. G Protein-coupled Receptors , 1998, The Journal of Biological Chemistry.

[101] Amos Bairoch,et al. PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[102] R. A. George,et al. Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[103] A. Valencia,et al. Correlated mutations contain information about protein-protein interaction. , 1997, Journal of molecular biology.

[104] James E. Bray,et al. The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[105] C. Reynolds,et al. Correlated mutations amongst the external residues of G-protein coupled receptors. , 1997, Biochemical Society transactions.

[106] C. Mcwherter,et al. Circular permutation of granulocyte colony-stimulating factor. , 1999, Biochemistry.

[107] Eleazar Eskin,et al. Protein Family Classification Using Sparse Markov Transducers , 2000, ISMB.

[108] C. Chothia,et al. The geometry of domain combination in proteins. , 2002, Journal of molecular biology.

[109] H A Scheraga,et al. Lattice neural network minimization. Application of neural network optimization for locating the global-minimum conformations of proteins. , 1993, Journal of molecular biology.

[110] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[111] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[112] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[113] David Haussler,et al. The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[114] R. Russell,et al. Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[115] J B Hurley,et al. Two amino acid substitutions convert a guanylyl cyclase, RetGC-1, into an adenylyl cyclase. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[116] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[117] J. Moody,et al. Feature Selection Based on Joint Mutual Information , 1999 .

[118] W R Pearson,et al. Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[119] Shlomo Dubnov,et al. Using Machine-Learning Methods for Musical Style Modeling , 2003, Computer.

[120] Ronitt Rubinfeld,et al. On the learnability of discrete distributions , 1994, STOC '94.

[121] P. Bühlmann,et al. Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[122] J A Epstein,et al. Crystal structure of the human Pax6 paired domain-DNA complex reveals specific roles for the linker region and carboxy-terminal subdomain in DNA binding. , 1999, Genes & development.

[123] Nir Friedman,et al. A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[124] Naftali Tishby,et al. Efficient Exact p-Value Computation for Small Sample, Sparse, and Surprising Categorical Data , 2004, J. Comput. Biol..

[125] Susan T. Dumais,et al. Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[126] Martin Vingron,et al. The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[127] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[128] J Roca,et al. The mechanisms of DNA topoisomerases. , 1995, Trends in biochemical sciences.

[129] Peter Bühlmann,et al. Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm , 2000 .

[130] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[131] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[132] Peter Weiner,et al. Linear Pattern Matching Algorithms , 1973, SWAT.

[133] J. Hayes,et al. The glutathione S-transferase supergene family: regulation of GST and the contribution of the isoenzymes to cancer chemoprotection and drug resistance. , 1995, Critical reviews in biochemistry and molecular biology.

[134] M. Gerstein,et al. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[135] Esko Ukkonen,et al. On-line construction of suffix trees , 1995, Algorithmica.

[136] James E. Johnson,et al. MetaFam: a unified classification of protein families. II. Schema and query capabilities , 2001, Bioinform..

[137] D T Jones,et al. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[138] Terri K. Attwood,et al. PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[139] Stefano Lonardi,et al. Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[140] Peer Bork,et al. Recent improvements to the SMART domain-based sequence annotation resource , 2002, Nucleic Acids Res..

[141] Dana Angluin,et al. Learning Markov chains with variable memory length from noisy output , 1997, COLT '97.

[142] Liisa Holm,et al. Picasso: generating a covering set of protein family profiles , 2001, Bioinform..

[143] Roberto Battiti,et al. Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[144] G. Barrows,et al. A mutual information measure for feature selection with application to pulse classification , 1996, Proceedings of Third International Symposium on Time-Frequency and Time-Scale Analysis (TFTS-96).

[145] Mikhail A. Roytberg,et al. Segmentation of long genomic sequences into domains with homogeneous composition with BASIO software , 2001, Bioinform..

[146] Rolf Apweiler,et al. Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters , 2003, Nucleic Acids Res..

[147] Jérôme Gracy,et al. Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[148] S. Henikoff,et al. Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[149] Renato De Mori,et al. High-performance connected digit recognition using maximum mutual information estimation , 1994, IEEE Trans. Speech Audio Process..

[150] Jorja G. Henikoff,et al. Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[151] Anders Krogh,et al. SAM: SEQUENCE ALIGNMENT AND MODELING SOFTWARE SYSTEM , 1995 .

[152] JORMA RISSANEN,et al. A universal data compression system , 1983, IEEE Trans. Inf. Theory.

[153] David Haussler,et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[154] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[155] Imre Csiszár,et al. On the computation of rate-distortion functions (Corresp.) , 1974, IEEE Trans. Inf. Theory.

[156] D. Eisenberg,et al. Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[157] A. Mees,et al. Context-tree modeling of observed symbolic dynamics. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[158] Lalit R. Bahl,et al. Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[159] William Noble Grundy,et al. Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[160] Jorma Rissanen,et al. Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[161] Dana Ron,et al. Learning to model sequences generated by switching distributions , 1995, COLT '95.

[162] L. Wu,et al. Autonomous protein folding units. , 2000, Advances in protein chemistry.

[163] Golan Yona,et al. Towards a Complete Map of the Protein Space Based on a Unified Sequence and Structure Analysis of All Known Proteins , 2000, ISMB.