论文信息 - A Compression-Based Method for Stemmatic Analysis

A Compression-Based Method for Stemmatic Analysis

Stemmatology studies relations among different variants of a text that has been gradually altered as a result of imperf ectly copying the text over and over again. Applications are mainly in hu- manities, especially textual criticism, but the methods ca n be used to study the evolution of any symbolic objects, including chain let- ters and computer viruses.We propose an algorithm for stemmatic analysis based on a minimum-information criterion and stochastic tree optimization. Our approach is related to phylogenetic recon- struction criteria such as maximum parsimony and maximum like- lihood, and builds upon algorithmic techniques developed for bioin- formatics. Unlike many earlier methods, the proposed method does not require significant preprocessing of the data but rather , operates directly on aligned text files. We demonstrate our method on a real- world experiment involving all 52 known variants of the legend of St. Henry of Finland, and provide the first computer-generat ed fam- ily tree of the legend. The obtained tree of the variants is su pported to a large extent by results obtained with more traditional m ethods, and identifies a number of previously unrecognized relation s.

[1] D. Swofford. PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[2] Tal Pupko,et al. A Structural EM Algorithm for Phylogenetic Inference , 2002, J. Comput. Biol..

[3] A C Barbrook,et al. Manuscript evolution. , 2001, Endeavour.

[4] Jean-Paul Delahaye,et al. The transformation distance: A dissimilarity measure based an movements of segments , 1998, German Conference on Bioinformatics.

[5] Caroline Macé,et al. Phylogenetic analysis of Gregory of Nazianzus' Homily 27 , 2004 .

[6] M. P. Cummings,et al. PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[7] P. Robinson,et al. Report on the Textual Criticism Challenge 1991 , 1992 .

[8] A. Kolmogorov. Three approaches to the quantitative definition of information , 1968 .

[9] Ming Li,et al. Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[10] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[11] Matthew Spencer,et al. How Accurate Were Scribes? A Mathematical Model , 2002, Lit. Linguistic Comput..

[12] Michael P. Cummings,et al. PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[13] Xin Chen,et al. A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[14] Bin Ma,et al. Chain letters & evolutionary histories. , 2003, Scientific American.

[15] H. Künsch. The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[16] Stéphane Grumbach,et al. A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[17] E. Wattel,et al. Weighted formal support of a pedigree , 1995 .

[18] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.

[20] Tuomas Heikkilä,et al. Compression-based Stemmatology: A Study of the Legend of St. Henry of Finland , 2005 .

[21] H. Hirsh,et al. DNA Sequence Classification Using Compression-Based Induction , 1995 .

[22] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[23] D. Haussler,et al. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[24] Vittorio Loreto,et al. Language trees and zipping. , 2002, Physical review letters.