On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships

Over the history of mankind, textual records change. Sometimes due to mistakes during transcription, sometimes on purpose, as a way to rewrite facts and reinterpret history. There are several classical cases, such as the logarithmic tables, and the transmission of antique and medieval scholarship. Today, text documents are largely edited and redistributed on the Web. Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance. However, this is not an easy task, as textual features pointing to the documents’ evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework, and evaluate each approach with extensive experiments, including a set of artificial near-duplicate documents with known phylogeny, and from documents collected from Wikipedia, whose modifications were made by Internet users. We also present results from qualitative experiments in two different applications: text plagiarism and reconstruction of evolutionary trees for manuscripts (stemmatology).

[1]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[2]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Cynthia A. Phillips,et al.  Constructing Computer Virus Phylogenies , 1996, J. Algorithms.

[6]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[7]  M. Collard,et al.  Investigating cultural evolution through biological phylogenetic analyses of Turkmen textiles , 2002 .

[8]  M. J. O’Brien,et al.  Evolutionary archeology: Current status and future prospects , 2002 .

[9]  Matthew Spencer,et al.  How Reliable is a Stemma? An Analysis of Chaucer's Miller's Tale , 2003, Lit. Linguistic Comput..

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  M. V. Mulken,et al.  Studies in Stemmatology II , 2004 .

[12]  Martin Wattenberg,et al.  Studying cooperation and conflict between authors with history flow visualizations , 2004, CHI.

[13]  M. Spencer,et al.  Phylogenetics of artificial manuscripts. , 2004, Journal of theoretical biology.

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[16]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[17]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[18]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[19]  Johannes R. Sveinsson,et al.  Random Forests for land cover classification , 2006, Pattern Recognit. Lett..

[20]  Rosane Minghim,et al.  Point Placement by Phylogenetic Trees and its Application to Visual Analysis of Document Collections , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[21]  Olivier Buisson,et al.  Content-Based Copy Retrieval Using Distortion-Based Probabilistic Similarity Search , 2007, IEEE Transactions on Multimedia.

[22]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23]  Hwan-Gue Cho,et al.  Detecting and tracing plagiarized documents by reconstruction plagiarism-evolution tree , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[24]  Hwan-Gue Cho,et al.  Generating Pylogenetic Tree of Homogeneous Source Code in a Plagiarism Detection System , 2008 .

[25]  Hwan-Gue Cho,et al.  A detecting and tracing algorithm for unauthorized internet-news plagiarism using spatio-temporal document evolution model , 2009, SAC '09.

[26]  Tuomas Heikkilä,et al.  Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets , 2009, Lit. Linguistic Comput..

[27]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[28]  Joseph A. O'Sullivan,et al.  MDL hierarchical clustering for stemmatology , 2010, 2010 IEEE International Symposium on Information Theory.

[29]  Anderson Rocha,et al.  First steps toward image phylogeny , 2010, 2010 IEEE International Workshop on Information Forensics and Security.

[30]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[31]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[32]  Tudor Dumitras,et al.  Experimental Challenges in Cyber Security: A Story of Provenance and Lineage for Malware , 2011, CSET.

[33]  András Kornai,et al.  Edit Wars in Wikipedia , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[34]  Anderson Rocha,et al.  Video Phylogeny: Recovering near-duplicate video relationships , 2011, 2011 IEEE International Workshop on Information Forensics and Security.

[35]  Mark Dredze,et al.  Name Phylogeny: A Generative Model of String Variation , 2012, EMNLP.

[36]  Mark Stevenson,et al.  Retrieving Candidate Plagiarised Documents Using Query Expansion , 2012, ECIR.

[37]  Anderson Rocha,et al.  Image Phylogeny by Minimal Spanning Trees , 2012, IEEE Transactions on Information Forensics and Security.

[38]  Stefano Tubaro,et al.  A phylogenetic analysis of near-duplicate audio tracks , 2013, 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP).

[39]  S. Goldenstein,et al.  Toward image phylogeny forests: automatically recovering semantically similar image relationships. , 2013, Forensic science international.

[40]  Anderson Rocha,et al.  Exploring heuristic and optimum branching algorithms for image phylogeny , 2013, J. Vis. Commun. Image Represent..

[41]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[42]  Mauro Barni,et al.  Multiple parenting identification in image phylogeny , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[43]  Anderson Rocha,et al.  Image Phylogeny Forests Reconstruction , 2014, IEEE Transactions on Information Forensics and Security.

[44]  Paolo Bestagini,et al.  Phylogeny reconstruction for misaligned and compressed video sequences , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[45]  Thamar Solorio,et al.  Identification of Original Document by Using Textual Similarities , 2015, CICLing.

[46]  Walter J. Scheirer,et al.  The sense of a connection: Automatic tracing of intertextuality by meaning , 2016, Digit. Scholarsh. Humanit..