Reconstructing Textual Documents from n-grams

We analyze the problem of reconstructing documents when we only have access to the n-grams (for n fixed) and their counts from the original documents. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n-grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph, preserving all possible reconstructions while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.

[1]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[4]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[5]  Paul Medvedev,et al.  On the Representation of de Bruijn Graphs , 2014, RECOMB.

[6]  Robert D. Nowak,et al.  Learning Bigrams from Unigrams , 2008, ACL.

[7]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Ari Trachtenberg,et al.  Reconciliation puzzles [separately hosted strings reconciliation] , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[10]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[11]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[12]  Peter Kulchyski and , 2015 .

[13]  Xiaojin Zhu,et al.  Document Recovery from Bag-of-Word Indices , 2008 .

[14]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[15]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[16]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[17]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[18]  Nicola Cancedda Private Access to Phrase Tables for Statistical Machine Translation , 2012, ACL.