Novel Combinatorial and Information‐Theoretic Alignment‐Free Distances for Biological Data Mining

[1]  Jun Wang,et al.  WSE, a new sequence distance measure based on word frequencies , 2008, Mathematical Biosciences.

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[4]  Matteo Comin,et al.  Mining, compressing and classifying with extensible motifs , 2006, Algorithms for Molecular Biology.

[5]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[6]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[7]  Antonio Restivo,et al.  A New Combinatorial Approach to Sequence Comparison , 2007, Theory of Computing Systems.

[8]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[9]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  B. Steipe,et al.  Nh3D: A reference dataset of non-homologous protein structures , 2005, BMC Structural Biology.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[13]  Long Li,et al.  REDfly: a Regulatory Element Database for Drosophila , 2006, Bioinform..

[14]  Dong Xu,et al.  Phylogenetic analysis using complete signature information of whole genomes and clustered Neighbour-Joining method , 2006, Int. J. Bioinform. Res. Appl..

[15]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[16]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[17]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[18]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[19]  T. P. Flores,et al.  Protein structural topology: Automated analysis and diagrammatic representation , 2008, Protein science : a publication of the Protein Society.

[20]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[21]  W. Pearson,et al.  Sensitivity and selectivity in protein structure comparison , 2004, Protein science : a publication of the Protein Society.

[22]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[23]  Jacques van Helden,et al.  Metrics for comparing regulatory sequences on the basis of pattern counts , 2004, Bioinform..

[24]  Shengrui Wang,et al.  CLUSS: Clustering of protein sequences based on a new similarity measure , 2007, BMC Bioinformatics.

[25]  Xiang Fang,et al.  An improved string composition method for sequence comparison , 2008, BMC Bioinformatics.

[26]  T. P. Flores,et al.  An algorithm for automatically generating protein topology cartoons. , 1994, Protein engineering.

[27]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[28]  Tuan D. Pham,et al.  A probabilistic measure for alignment-free sequence comparison , 2004, Bioinform..

[29]  J M Thornton,et al.  An atlas of protein topology cartoons available on the World-Wide Web. , 1998, Trends in biochemical sciences.

[30]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[31]  Sylvain Forêt,et al.  Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences , 2006, BMC Bioinformatics.

[32]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[33]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[34]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy , 2003, Nucleic Acids Res..

[35]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[36]  Z. Xuan,et al.  Phylogeny Based on Whole Genome as inferred from Complete Information Set Analysis , 2002, Journal of biological physics.

[37]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[38]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[39]  L Alexander Lyznik,et al.  ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced. , 2004, Gene.

[40]  Antonio Restivo,et al.  Words and forbidden factors , 2002, Theor. Comput. Sci..

[41]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[42]  Susan R. Wilson,et al.  Approximate word matches between two random sequences , 2008 .

[43]  Huey-Wen Yien,et al.  Linguistic analysis of the human heartbeat using frequency and rank order statistics. , 2003, Physical review letters.

[44]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[45]  Klara Kedem,et al.  Finding the Consensus Shape for a Protein Family , 2003, Algorithmica.

[46]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[47]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[49]  Alberto Apostolico,et al.  Fast algorithms for computing sequence distances by exhaustive substring composition , 2008, Algorithms for Molecular Biology.

[50]  C. J. Burden,et al.  Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences , 2007, Journal of Applied Probability.

[51]  Gilles Didier,et al.  Local Decoding of Sequences and Alignment-Free Comparison , 2006, J. Comput. Biol..

[52]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[53]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[54]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[55]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[56]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[57]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[59]  Edmund K. Burke,et al.  ProCKSI: a decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information , 2007, BMC Bioinformatics.

[60]  M. Zalis,et al.  Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. , 1999, Journal of molecular biology.

[61]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.