The Locally Weighted Bag of Words Framework for Document Representation

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks.

[1]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[2]  R. Kass The Geometry of Asymptotic Inference , 1989 .

[3]  M. Spivak A comprehensive introduction to differential geometry , 1979 .

[4]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[5]  John D. Lafferty,et al.  Information Diffusion Kernels , 2002, NIPS.

[6]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[7]  W. Boothby An introduction to differentiable manifolds and Riemannian geometry , 1975 .

[8]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[9]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[10]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[11]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[12]  Yang Zhao,et al.  Local likelihood modeling of the concept drift phenomenon , 2009 .

[13]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[14]  M. Wand Local Regression and Likelihood , 2001 .

[15]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[16]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Thorsten Joachims,et al.  The Maximum-Margin Approach to Learning Text Classifiers , 2001, Künstliche Intell..

[19]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[20]  George Forman,et al.  Tackling concept drift by temporal inductive transfer , 2006, SIGIR.

[21]  L. Campbell An extended Čencov characterization of the information metric , 1986 .

[22]  R. Dudley A course on empirical processes , 1984 .

[23]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[24]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[25]  M. Berger,et al.  Le Spectre d'une Variete Riemannienne , 1971 .

[26]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[27]  Marco Cuturi Learning from Structured Objects with Semigroup Kernels , 2006 .

[28]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[29]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[30]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[31]  John Shawe-Taylor,et al.  Covering numbers for support vector machines , 1999, COLT '99.

[32]  Alan Thornton Gous,et al.  Exponential and spherical subfamily models , 1998 .

[33]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[34]  G. Terrell Statistical theory and computational aspects of smoothing , 1997 .

[35]  Tomoko Matsui,et al.  A Kernel for Time Series Based on Global Alignments , 2006, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[36]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[37]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[38]  Thomas Hofmann,et al.  Learning Curved Multinomial Subfamilies for Natural Language Processing and Information Retrieval , 2000, ICML.

[39]  J. Ramsay,et al.  Some Tools for Functional Data Analysis , 1991 .

[40]  Yi Mao,et al.  Sequential Document Visualization , 2007, IEEE Transactions on Visualization and Computer Graphics.

[41]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[42]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[43]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[44]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[47]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[48]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.