论文信息 - The bag-of-repeats representation of documents

The bag-of-repeats representation of documents

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption of word and introducing context. However, this comes at a cost of adding features which are non-descriptive, and increasing the dimension of the vector space model exponentially. We present new representations that avoid both pitfalls. They are based on sound theoretical notions of stringology, and can be computed in optimal asymptotic time with algorithms using data structures from the suffix family. While maximal repeats have been used in the past for similar tasks, we show how another equivalence class of repeats -- largest-maximal repeats -- obtain similar or better results, with only a fraction of the features. This class acts as a minimal generative basis of all repeated substrings. We also report their use for topic modeling, showing easier to interpret models.

Matthias Gallé

[1] Maxime Crochemore,et al. Algorithms on strings , 2007 .

[2] Matthias Gallé,et al. Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem , 2011 .

[3] Jun'ichi Tsujii,et al. Text Categorization with All Substring Features , 2009, SDM.

[4] Michael I. Jordan,et al. Modeling annotated data , 2003, SIGIR.

[5] J. Wolff. Learning Syntax and Meanings Through Optimization and Distributional Analysis , 1988 .

[6] Alexander Clark,et al. Learning deterministic context free grammars: The Omphalos competition , 2006, Machine Learning.

[7] Hanna M. Wallach,et al. Topic modeling: beyond bag-of-words , 2006, ICML.

[8] Alberto Apostolico,et al. Efficient tools for comparative substring analysis. , 2010, Journal of biotechnology.

[9] John D. Lafferty,et al. Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[10] Eytan Ruppin,et al. Unsupervised learning of natural languages , 2006 .

[11] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.