Análise e compressão de sequências genómicas

[1]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[2]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[3]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[4]  Kimmo Fredriksson,et al.  Shift-or string matching with super-alphabets , 2003, Inf. Process. Lett..

[5]  Norman Abramson,et al.  Information theory and coding , 1963 .

[6]  John Shawe-Taylor,et al.  Fast string matching using an n‐gram algorithm , 1994, Softw. Pract. Exp..

[7]  Jean-Paul Delahaye,et al.  Fast Discerning Repeats in DNA Sequences with a Compression Algorithm , 1997 .

[8]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems. , 2002 .

[9]  Udi Manber,et al.  A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[10]  Marc-Thorsten Hütt,et al.  Genome Phylogeny Based on Short-Range Correlations in DNA Sequences , 2005, J. Comput. Biol..

[11]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Tanya Z. Berardini,et al.  PatMatch: a program for finding patterns in peptide and nucleotide sequences , 2005, Nucleic Acids Res..

[14]  N. Goodman Biological data becomes computer literate: new advances in bioinformatics. , 2002, Current opinion in biotechnology.

[15]  G. F. Joyce The antiquity of RNA-based evolution , 2002, Nature.

[16]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[17]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[19]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[20]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[21]  Kimmo Fredriksson,et al.  Faster String Matching with Super-Alphabets , 2002, SPIRE.

[22]  T. Oikonomou,et al.  Power law exponents characterizing human DNA. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Deborah Joseph,et al.  Beyond tandem repeats: complex pattern structures and distant regions of similarity , 2002, ISMB.

[24]  S. Naranan,et al.  Information Theory and Algorithmic Complexity: Applications to Language Discourses and DNA Sequences as Complex Systems Part II: Complexity of DNa Sequences, Analogy with Linguistic Discourses , 2000, J. Quant. Linguistics.

[25]  Vladimir D. Gusev,et al.  On the complexity measures of genetic sequences , 1999, Bioinform..

[26]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[27]  Trevor I. Dix,et al.  Sequence Complexity for Biological Sequence Analysis , 2000, Comput. Chem..

[28]  L. Patthy Modular Assembly of Genes and the Evolution of New Functions , 2003, Genetica.

[29]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[30]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[31]  Arnaud Lefebvre,et al.  FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[32]  Jacques Cohen,et al.  Computer science and bioinformatics , 2005, CACM.

[33]  O. White,et al.  A quality control algorithm for DNA sequencing projects. , 1993, Nucleic acids research.

[34]  H. Herzel Complexity of symbol sequences , 1988 .

[35]  Richard Clark Pasco,et al.  Source coding algorithms for fast data compression , 1976 .

[36]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[37]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[38]  D. Kugiumtzis,et al.  Statistical analysis of gene and intergenic DNA sequences , 2004, q-bio/0404024.

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  Noam Chomsky,et al.  The Logical Structure of Linguistic Theory , 1975 .

[41]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[42]  Thierry Lecroq,et al.  Fast exact string matching algorithms , 2007, Inf. Process. Lett..

[43]  Eric V. Denardo,et al.  Dynamic Programming: Models and Applications , 2003 .

[44]  J. S. Heslop-Harrison,et al.  Genomes, genes and junk: the large-scale organization of plant chromosomes , 1998 .

[45]  Yong Zhang,et al.  DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[46]  Armando J. Pinho,et al.  Exploring Three-Base Periodicity for DNA Compression and Modeling , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[47]  Sam Kwong,et al.  A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. , 1999 .

[48]  Frantisek Franek,et al.  A simple fast hybrid pattern-matching algorithm , 2007, J. Discrete Algorithms.

[49]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Arun Krishnan,et al.  Exhaustive whole-genome tandem repeats search , 2004, Bioinform..

[51]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[52]  Nikolay V. Dokholyan,et al.  Similarity and dissimilarity in correlations of genomic DNA , 2007 .

[53]  Ming Li,et al.  Superiority and complexity of the spaced seeds , 2006, SODA 2006.

[54]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[55]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[56]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[57]  John C. Wootton,et al.  Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes , 2003, J. Comput. Biol..

[58]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[59]  Neri Merhav,et al.  Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[60]  Serap A. Savari,et al.  On the entropy of DNA: algorithms and measurements based on memory and rapid convergence , 1995, SODA '95.

[61]  Dina Sokol,et al.  Filtering Tandem Repeats in DNA Sequences , 2006, BIOCOMP.

[62]  Ka-Lok Ng,et al.  Quantitative linguistic study of DNA sequences , 2003 .

[63]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[64]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[65]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[66]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[67]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[68]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[69]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[70]  Pietro Lio,et al.  Statistical analysis of simple repeats in the human genome , 2005, q-bio/0502009.

[71]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[72]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[73]  John Case,et al.  Computing Entropy for Ortholog Detection , 2004, International Conference on Computational Intelligence.

[74]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[75]  Trevor I. Dix,et al.  Compression and Approximate Matching , 1999, Comput. J..

[76]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[77]  M Dauchet,et al.  Compression and genetic sequence analysis. , 1996, Biochimie.

[78]  R. Mantegna,et al.  Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[79]  Mateo Valero,et al.  Performance Analysis of Sequence Alignment Applications , 2006, 2006 IEEE International Symposium on Workload Characterization.

[80]  Ian Witten,et al.  Data Mining , 2000 .

[81]  John G. Cleary,et al.  Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[82]  Richard R. Sinden,et al.  Triplet repeat DNA structures and human genetic disease: dynamic mutations from dynamic DNA , 2002, Journal of Biosciences.

[83]  A Hariri,et al.  On the validity of Shannon-information calculations for molecular biological sequences. , 1990, Journal of theoretical biology.

[84]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[85]  William B. Langdon,et al.  Repeated Sequences in Linear GP Genomes , 2004 .

[86]  Paulo Carvalho,et al.  GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences , 2009, Int. J. Bioinform. Res. Appl..

[87]  John M. Hancock Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects , 2002, Genetica.

[88]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[89]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[90]  Ioan Tabus,et al.  DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[91]  Stephen Benz,et al.  A DNA Motif Lexicon: cataloguing and annotating sequences. , 2004, In silico biology.

[92]  Maria de Sousa Vieira,et al.  Statistics of DNA sequences: a low-frequency analysis. , 1999, cond-mat/9905074.

[93]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[94]  Yuriy L. Orlov,et al.  Complexity: an internet resource for analysis of DNA sequence complexity , 2004, Nucleic Acids Res..

[95]  Gonzalo Navarro,et al.  Fast and flexible string matching by combining bit-parallelism and suffix automata , 2000, JEAL.

[96]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[97]  Jeremy Buhler,et al.  Designing Multiple Simultaneous Seeds for DNA Similarity Search , 2005, J. Comput. Biol..

[98]  Costas S. Iliopoulos,et al.  Finding Approximate Occurrences of a Pattern That Contains Gaps , 2003 .

[99]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[100]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[101]  Khalid Sayood Lossless Compression Handbook , 2003 .

[102]  Ivo Grosse,et al.  Repeats and correlations in human DNA sequences. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[103]  Jorma Tarhio,et al.  Alternative Algorithms for Bit-Parallel String Matching , 2003, SPIRE.

[104]  Huiru Zheng,et al.  An assessment of machine and statistical learning approaches to inferring networks of protein-protein interactions , 2006, J. Integr. Bioinform..

[105]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[106]  Gonzalo Navarro,et al.  A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching , 1998, CPM.

[107]  S. Acharya Some Aspects of Physicochemical Properties of DNA and RNA , 2006 .

[108]  H. Herzel,et al.  Estimating the entropy of DNA sequences. , 1997, Journal of theoretical biology.

[109]  D. Krane,et al.  Fundamental Concepts of Bioinformatics , 2002 .

[110]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[111]  J. Shapiro A 21st century view of evolution: genome system architecture, repetitive DNA, and natural genetic engineering. , 2005, Gene.

[112]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[113]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[114]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[115]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[116]  Daniel G. Brown Optimizing Multiple Seeds for Protein Homology Search , 2005, TCBB.

[117]  M. Turker,et al.  Tandem B1 Elements Located in a Mouse Methylation Center Provide a Target for de Novo DNA Methylation* , 1999, The Journal of Biological Chemistry.

[118]  Indranil Mukhopadhyay,et al.  Word organization in coding DNA: A mathematical model , 2006, Theory in Biosciences.

[119]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2005, J. Comput. Syst. Sci..

[120]  Valeria De Fonzo,et al.  Hidden Markov Models in Bioinformatics , 2007 .

[121]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[122]  J. Jurka,et al.  Microsatellites in different eukaryotic genomes: survey and analysis. , 2000, Genome research.

[123]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[124]  P Bernaola-Galván,et al.  Study of statistical correlations in DNA sequences. , 2002, Gene.

[125]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[126]  John C. Kieffer,et al.  Ergodic behavior of graph entropy , 1997 .

[127]  V. R. Chechetkin,et al.  LEVELS OF ORDERING IN CODING AND NONCODING REGIONS OF DNA SEQUENCES , 1996 .

[128]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[129]  Werner Ebeling,et al.  Entropy and complexity of finite sequences as fluctuating quantities. , 2002, Bio Systems.

[130]  Antoine Danchin,et al.  Genome structures, operating systems and the image of the machine , 2004 .

[131]  Anthony Jf Griffiths,et al.  Modern Genetic Analysis , 1998 .

[132]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[133]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[134]  Thierry Lecroq,et al.  Experimental results on string matching algorithms , 1995, Softw. Pract. Exp..

[135]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[136]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[137]  W. Stemmer,et al.  Directed evolution of proteins by exon shuffling , 2001, Nature Biotechnology.

[138]  Szymon Grabowski,et al.  Revisiting dictionary‐based compression , 2005, Softw. Pract. Exp..

[139]  J. Goodman,et al.  The long (LINEs) and the short (SINEs) of it: altered methylation as a precursor to toxicity. , 2003, Toxicological sciences : an official journal of the Society of Toxicology.

[140]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[141]  Gregory Kucherov,et al.  Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[142]  D. Leach Long DNA palindromes, cruciform structures, genetic instability and secondary structure repair , 1994, BioEssays : news and reviews in molecular, cellular and developmental biology.

[143]  Alfonso Valencia,et al.  Information extraction in molecular biology , 2002, Briefings Bioinform..

[144]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[145]  D R Powell,et al.  Discovering simple DNA sequences by compression. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[146]  Shigehiko Kanaya,et al.  Statistical Analysis of Genomic Information: Various Periodicities in DNA Sequence , 2001 .

[147]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[148]  M. Ridley,et al.  Genome: The Autobiography of a Species In 23 Chapters , 1999 .

[149]  J. Lobry THE BLACK HOLE OF SYMMETRIC MOLECULAR EVOLUTION , 2000 .

[150]  Zaher Dawy,et al.  Genomic analysis using methods from information theory , 2004, Information Theory Workshop.

[151]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[152]  Dachao Li,et al.  Conditional LZ Complexity of DNA Sequences Analysis and its Application in Phylogenetic Tree Reconstruction , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[153]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[154]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[155]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[156]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[157]  Chiara Romualdi,et al.  Differential expression of genes coding for ribosomal proteins in different human tissues , 2001, Bioinform..

[158]  R. Bellman Dynamic programming. , 1957, Science.

[159]  Eric Coissac,et al.  Origin and fate of repeats in bacteria , 2002, Nucleic Acids Res..

[160]  H Herzel,et al.  Information content of protein sequences. , 2000, Journal of theoretical biology.

[161]  M. Singer SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes , 1982, Cell.

[162]  C Patience,et al.  Our retroviral heritage. , 1997, Trends in genetics : TIG.

[163]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[164]  T. Govezensky,et al.  Statistical properties of DNA sequences revisited: the role of inverse bilateral symmetry in bacterial chromosomes , 2004, q-bio/0408014.

[165]  Michael D. Hendy,et al.  Compressing DNA sequence databases with coil , 2007, BMC Bioinformatics.

[166]  Lila L. Gatlin,et al.  Information theory and the living system , 1972 .

[167]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[168]  Max Dauchet,et al.  A first step toward chromosome analysis by compression algorithms , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[169]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[170]  Samuel Karlin,et al.  Comparative statistics for DNA and protein sequences: multiple sequence analysis , 1985 .

[171]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[172]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[173]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[174]  Bin Ma,et al.  Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[175]  Mike Alder,et al.  Natural Language Grammatical Inference , 1994 .