The bag-of-repeats representation of documents

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption of word and introducing context. However, this comes at a cost of adding features which are non-descriptive, and increasing the dimension of the vector space model exponentially. We present new representations that avoid both pitfalls. They are based on sound theoretical notions of stringology, and can be computed in optimal asymptotic time with algorithms using data structures from the suffix family. While maximal repeats have been used in the past for similar tasks, we show how another equivalence class of repeats -- largest-maximal repeats -- obtain similar or better results, with only a fraction of the features. This class acts as a minimal generative basis of all repeated substrings. We also report their use for topic modeling, showing easier to interpret models.

[1]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[2]  Matthias Gallé,et al.  Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem , 2011 .

[3]  Jun'ichi Tsujii,et al.  Text Categorization with All Substring Features , 2009, SDM.

[4]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[5]  J. Wolff Learning Syntax and Meanings Through Optimization and Distributional Analysis , 1988 .

[6]  Alexander Clark,et al.  Learning deterministic context free grammars: The Omphalos competition , 2006, Machine Learning.

[7]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[8]  Alberto Apostolico,et al.  Efficient tools for comparative substring analysis. , 2010, Journal of biotechnology.

[9]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[10]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[11]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[14]  Atsuhiro Takasu,et al.  Clustering Documents with Maximal Substrings , 2011, ICEIS.

[15]  Jean-Michel Renders,et al.  Full and Mini-batch Clustering of News Articles with Star-EM , 2012, ECIR.

[16]  Menno van Zaanen ABL: Alignment-Based Learning , 2000, COLING.

[17]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[18]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[19]  Jacques Nicolas,et al.  CRISPI: a CRISPR interactive database , 2009, Bioinform..

[20]  Jean-Pierre Chanod,et al.  Robustness beyond shallowness: incremental deep parsing , 2002, Natural Language Engineering.

[21]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[22]  Pierre Peterlongo,et al.  Modeling local repeats on genomic sequences , 2008 .