论文信息 - Análise e compressão de sequências genómicas - 字舞流文

Análise e compressão de sequências genómicas

Sérgio Deusdado | Sérgio Deusdado

[1] Ian H. Witten,et al. Arithmetic coding for data compression , 1987, CACM.

[2] Sudhir Kumar,et al. Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[3] Ioan Tabus,et al. An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[4] Kimmo Fredriksson,et al. Shift-or string matching with super-alphabets , 2003, Inf. Process. Lett..

[5] Norman Abramson,et al. Information theory and coding , 1963 .

[6] John Shawe-Taylor,et al. Fast string matching using an n‐gram algorithm , 1994, Softw. Pract. Exp..

[7] Jean-Paul Delahaye,et al. Fast Discerning Repeats in DNA Sequences with a Compression Algorithm , 1997 .

[8] Simon Cawley,et al. Applications of generalized pair hidden Markov models to alignment and gene finding problems. , 2002 .

[9] Udi Manber,et al. A FAST ALGORITHM FOR MULTI-PATTERN SEARCHING , 1999 .

[10] Marc-Thorsten Hütt,et al. Genome Phylogeny Based on Short-Range Correlations in DNA Sequences , 2005, J. Comput. Biol..

[11] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[12] D. Lipman,et al. Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[13] Tanya Z. Berardini,et al. PatMatch: a program for finding patterns in peptide and nucleotide sequences , 2005, Nucleic Acids Res..

[14] N. Goodman. Biological data becomes computer literate: new advances in bioinformatics. , 2002, Current opinion in biotechnology.

[15] G. F. Joyce. The antiquity of RNA-based evolution , 2002, Nature.

[16] Ian H. Witten,et al. Data mining in bioinformatics using Weka , 2004, Bioinform..

[17] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[18] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[19] Chuong B. Do,et al. Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[20] Durbin,et al. Biological Sequence Analysis , 1998 .

[21] Kimmo Fredriksson,et al. Faster String Matching with Super-Alphabets , 2002, SPIRE.

[22] T. Oikonomou,et al. Power law exponents characterizing human DNA. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23] Deborah Joseph,et al. Beyond tandem repeats: complex pattern structures and distant regions of similarity , 2002, ISMB.

[24] S. Naranan,et al. Information Theory and Algorithmic Complexity: Applications to Language Discourses and DNA Sequences as Complex Systems Part II: Complexity of DNa Sequences, Analogy with Linguistic Discourses , 2000, J. Quant. Linguistics.

[25] Vladimir D. Gusev,et al. On the complexity measures of genetic sequences , 1999, Bioinform..

[26] D. Mount. Bioinformatics: Sequence and Genome Analysis , 2001 .

[27] Trevor I. Dix,et al. Sequence Complexity for Biological Sequence Analysis , 2000, Comput. Chem..

[28] L. Patthy. Modular Assembly of Genes and the Evolution of New Functions , 2003, Genetica.

[29] R. Voss,et al. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[30] Gregory Kucherov,et al. mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[31] Arnaud Lefebvre,et al. FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[32] Jacques Cohen,et al. Computer science and bioinformatics , 2005, CACM.

[33] O. White,et al. A quality control algorithm for DNA sequencing projects. , 1993, Nucleic acids research.

[34] H. Herzel. Complexity of symbol sequences , 1988 .

[35] Richard Clark Pasco,et al. Source coding algorithms for fast data compression , 1976 .

[36] En-Hui Yang,et al. Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[37] Sean R. Eddy,et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[38] D. Kugiumtzis,et al. Statistical analysis of gene and intergenic DNA sequences , 2004, q-bio/0404024.

[39] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[40] Noam Chomsky,et al. The Logical Structure of Linguistic Theory , 1975 .

[41] Stéphane Grumbach,et al. Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[42] Thierry Lecroq,et al. Fast exact string matching algorithms , 2007, Inf. Process. Lett..

[43] Eric V. Denardo,et al. Dynamic Programming: Models and Applications , 2003 .

[44] J. S. Heslop-Harrison,et al. Genomes, genes and junk: the large-scale organization of plant chromosomes , 1998 .

[45] Yong Zhang,et al. DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[46] Armando J. Pinho,et al. Exploring Three-Base Periodicity for DNA Compression and Modeling , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[47] Sam Kwong,et al. A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison. , 1999 .

[48] Frantisek Franek,et al. A simple fast hybrid pattern-matching algorithm , 2007, J. Discrete Algorithms.

[49] D. Lipman,et al. Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[50] Arun Krishnan,et al. Exhaustive whole-genome tandem repeats search , 2004, Bioinform..

[51] Trevor I. Dix,et al. Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[52] Nikolay V. Dokholyan,et al. Similarity and dissimilarity in correlations of genomic DNA , 2007 .

[53] Ming Li,et al. Superiority and complexity of the spaced seeds , 2006, SODA 2006.

[54] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[55] P. Sellers. On the Theory and Computation of Evolutionary Distances , 1974 .

[56] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[57] John C. Wootton,et al. Discovering Simple Regions in Biological Sequences Associated with Scoring Schemes , 2003, J. Comput. Biol..

[58] Behshad Behzadi,et al. DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[59] Neri Merhav,et al. Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[60] Serap A. Savari,et al. On the entropy of DNA: algorithms and measurements based on memory and rapid convergence , 1995, SODA '95.

[61] Dina Sokol,et al. Filtering Tandem Repeats in DNA Sequences , 2006, BIOCOMP.

[62] Ka-Lok Ng,et al. Quantitative linguistic study of DNA sequences , 2003 .

[63] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[64] Giovanni Manzini,et al. A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[65] W. Pearson. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[66] Trevor I. Dix,et al. A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[67] En-Hui Yang,et al. Estimating DNA sequence entropy , 2000, SODA '00.

[68] Robert S. Boyer,et al. A fast string searching algorithm , 1977, CACM.

[69] Ian H. Witten,et al. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[70] Pietro Lio,et al. Statistical analysis of simple repeats in the human genome , 2005, q-bio/0502009.

[71] H. Müller,et al. Statistical methods for DNA sequence segmentation , 1998 .

[72] C. Peng,et al. Long-range correlations in nucleotide sequences , 1992, Nature.

[73] John Case,et al. Computing Entropy for Ortholog Detection , 2004, International Conference on Computational Intelligence.

[74] R. Rosenfeld,et al. Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[75] Trevor I. Dix,et al. Compression and Approximate Matching , 1999, Comput. J..

[76] Gaston H. Gonnet,et al. A new approach to text searching , 1992, CACM.

[77] M Dauchet,et al. Compression and genetic sequence analysis. , 1996, Biochimie.

[78] R. Mantegna,et al. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[79] Mateo Valero,et al. Performance Analysis of Sequence Alignment Applications , 2006, 2006 IEEE International Symposium on Workload Characterization.

[80] Ian Witten,et al. Data Mining , 2000 .

[81] John G. Cleary,et al. Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[82] Richard R. Sinden,et al. Triplet repeat DNA structures and human genetic disease: dynamic mutations from dynamic DNA , 2002, Journal of Biosciences.

[83] A Hariri,et al. On the validity of Shannon-information calculations for molecular biological sequences. , 1990, Journal of theoretical biology.

[84] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[85] William B. Langdon,et al. Repeated Sequences in Linear GP Genomes , 2004 .

[86] Paulo Carvalho,et al. GRASPm: an efficient algorithm for exact pattern-matching in genomic sequences , 2009, Int. J. Bioinform. Res. Appl..

[87] John M. Hancock. Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects , 2002, Genetica.

[88] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[89] Bin Ma,et al. DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[90] Ioan Tabus,et al. DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[91] Stephen Benz,et al. A DNA Motif Lexicon: cataloguing and annotating sequences. , 2004, In silico biology.

[92] Maria de Sousa Vieira,et al. Statistics of DNA sequences: a low-frequency analysis. , 1999, cond-mat/9905074.

[93] P Bork,et al. Automated extraction of information in molecular biology , 2000, FEBS letters.

[94] Yuriy L. Orlov,et al. Complexity: an internet resource for analysis of DNA sequence complexity , 2004, Nucleic Acids Res..

[95] Gonzalo Navarro,et al. Fast and flexible string matching by combining bit-parallelism and suffix automata , 2000, JEAL.

[96] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[97] Jeremy Buhler,et al. Designing Multiple Simultaneous Seeds for DNA Similarity Search , 2005, J. Comput. Biol..

[98] Costas S. Iliopoulos,et al. Finding Approximate Occurrences of a Pattern That Contains Gaps , 2003 .

[99] S Karlin,et al. Patchiness and correlations in DNA sequences , 1993, Science.

[100] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[101] Khalid Sayood. Lossless Compression Handbook , 2003 .

[102] Ivo Grosse,et al. Repeats and correlations in human DNA sequences. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[103] Jorma Tarhio,et al. Alternative Algorithms for Bit-Parallel String Matching , 2003, SPIRE.

[104] Huiru Zheng,et al. An assessment of machine and statistical learning approaches to inferring networks of protein-protein interactions , 2006, J. Integr. Bioinform..

[105] Thomas L. Madden,et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[106] Gonzalo Navarro,et al. A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching , 1998, CPM.

[107] S. Acharya. Some Aspects of Physicochemical Properties of DNA and RNA , 2006 .

[108] H. Herzel,et al. Estimating the entropy of DNA sequences. , 1997, Journal of theoretical biology.

[109] D. Krane,et al. Fundamental Concepts of Bioinformatics , 2002 .

[110] Daniel Sunday,et al. A very fast substring search algorithm , 1990, CACM.

[111] J. Shapiro. A 21st century view of evolution: genome system architecture, repetitive DNA, and natural genetic engineering. , 2005, Gene.

[112] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[113] David Sankoff,et al. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[114] W. Ebeling,et al. Finite sample effects in sequence analysis , 1994 .

[115] Joshua Goodman,et al. A bit of progress in language modeling , 2001, Comput. Speech Lang..

[116] Daniel G. Brown. Optimizing Multiple Seeds for Protein Homology Search , 2005, TCBB.

[117] M. Turker,et al. Tandem B1 Elements Located in a Mouse Methylation Center Provide a Target for de Novo DNA Methylation* , 1999, The Journal of Biological Chemistry.

[118] Indranil Mukhopadhyay,et al. Word organization in coding DNA: A mathematical model , 2006, Theory in Biosciences.

[119] Jeremy Buhler,et al. Designing seeds for similarity search in genomic DNA , 2005, J. Comput. Syst. Sci..

[120] Valeria De Fonzo,et al. Hidden Markov Models in Bioinformatics , 2007 .

[121] Xin Chen,et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[122] J. Jurka,et al. Microsatellites in different eukaryotic genomes: survey and analysis. , 2000, Genome research.

[123] Gary Benson,et al. Tandem repeats over the edit distance , 2007, Bioinform..

[124] P Bernaola-Galván,et al. Study of statistical correlations in DNA sequences. , 2002, Gene.

[125] Frans M. J. Willems,et al. The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[126] John C. Kieffer,et al. Ergodic behavior of graph entropy , 1997 .

[127] V. R. Chechetkin,et al. LEVELS OF ORDERING IN CODING AND NONCODING REGIONS OF DNA SEQUENCES , 1996 .

[128] Limsoon Wong,et al. Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[129] Werner Ebeling,et al. Entropy and complexity of finite sequences as fluctuating quantities. , 2002, Bio Systems.

[130] Antoine Danchin,et al. Genome structures, operating systems and the image of the machine , 2004 .

[131] Anthony Jf Griffiths,et al. Modern Genetic Analysis , 1998 .

[132] Ian H. Witten,et al. Arithmetic coding revisited , 1998, TOIS.

[133] R. Nigel Horspool,et al. Practical fast searching in strings , 1980, Softw. Pract. Exp..

[134] Thierry Lecroq,et al. Experimental results on string matching algorithms , 1995, Softw. Pract. Exp..

[135] I. Good. THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[136] W R Pearson,et al. Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[137] W. Stemmer,et al. Directed evolution of proteins by exon shuffling , 2001, Nature Biotechnology.

[138] Szymon Grabowski,et al. Revisiting dictionary‐based compression , 2005, Softw. Pract. Exp..

[139] J. Goodman,et al. The long (LINEs) and the short (SINEs) of it: altered methylation as a precursor to toxicity. , 2003, Toxicological sciences : an official journal of the Society of Toxicology.

[140] Stefano Lonardi,et al. Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[141] Gregory Kucherov,et al. Improved hit criteria for DNA local alignment , 2004, BMC Bioinformatics.

[142] D. Leach. Long DNA palindromes, cruciform structures, genetic instability and secondary structure repair , 1994, BioEssays : news and reviews in molecular, cellular and developmental biology.

[143] Alfonso Valencia,et al. Information extraction in molecular biology , 2002, Briefings Bioinform..

[144] Hanspeter Herzel,et al. Correlations in DNA sequences: The role of protein coding segments , 1997 .

[145] D R Powell,et al. Discovering simple DNA sequences by compression. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[146] Shigehiko Kanaya,et al. Statistical Analysis of Genomic Information: Various Periodicities in DNA Sequence , 2001 .

[147] Ian H. Witten,et al. Text Compression , 1990, 125 Problems in Text Algorithms.

[148] M. Ridley,et al. Genome: The Autobiography of a Species In 23 Chapters , 1999 .

[149] J. Lobry. THE BLACK HOLE OF SYMMETRIC MOLECULAR EVOLUTION , 2000 .

[150] Zaher Dawy,et al. Genomic analysis using methods from information theory , 2004, Information Theory Workshop.

[151] Toshiko Matsumoto,et al. Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[152] Dachao Li,et al. Conditional LZ Complexity of DNA Sequences Analysis and its Application in Phylogenetic Tree Reconstruction , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[153] Bin Ma,et al. Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[154] David Loewenstern,et al. Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[155] Jorma Rissanen,et al. Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[156] Lukas Wagner,et al. A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[157] Chiara Romualdi,et al. Differential expression of genes coding for ribosomal proteins in different human tissues , 2001, Bioinform..

[158] R. Bellman. Dynamic programming. , 1957, Science.

[159] Eric Coissac,et al. Origin and fate of repeats in bacteria , 2002, Nucleic Acids Res..

[160] H Herzel,et al. Information content of protein sequences. , 2000, Journal of theoretical biology.

[161] M. Singer. SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes , 1982, Cell.

[162] C Patience,et al. Our retroviral heritage. , 1997, Trends in genetics : TIG.

[163] Ian H. Witten,et al. Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[164] T. Govezensky,et al. Statistical properties of DNA sequences revisited: the role of inverse bilateral symmetry in bacterial chromosomes , 2004, q-bio/0408014.

[165] Michael D. Hendy,et al. Compressing DNA sequence databases with coil , 2007, BMC Bioinformatics.

[166] Lila L. Gatlin,et al. Information theory and the living system , 1972 .

[167] Louxin Zhang,et al. Good spaced seeds for homology search , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[168] Max Dauchet,et al. A first step toward chromosome analysis by compression algorithms , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[169] S. B. Needleman,et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[170] Samuel Karlin,et al. Comparative statistics for DNA and protein sequences: multiple sequence analysis , 1985 .

[171] Timothy B. Stockwell,et al. The Sequence of the Human Genome , 2001, Science.

[172] Richard W. Hamming,et al. Error detecting and error correcting codes , 1950 .

[173] Claude E. Shannon,et al. The Mathematical Theory of Communication , 1950 .

[174] Bin Ma,et al. Optimizing Multiple Spaced Seeds for Homology Search , 2004, CPM.

[175] Mike Alder,et al. Natural Language Grammatical Inference , 1994 .