Mixed Graph of Terms: Beyond the Bags of Words Representation of a Text

The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Vectors of features are usually made up of weighted words, as well as those used in the text retrieval field, which are obtained thanks to the assumption that considers a document as a "bag of words". However, in this paper we demonstrate that, to obtain more accuracy in the analysis and revelation of common patterns, we could employ (observe) more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a small set of documents through the probabilistic Topic Model. The graph has demonstrated its efficiency in a classic "ad-hoc" text retrieval problem. Here we consider expanding the initial query with this new structured vector of features.

[1]  Sophia Ananiadou,et al.  Improving Full Text Search with Text Mining Tools , 2009, NLDB.

[2]  Susan T. Dumais,et al.  SIGIR 2003 workshop report: implicit measures of user interests and preferences , 2003, SIGF.

[3]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[4]  Ian Ruthven,et al.  Re-examining the potential effectiveness of interactive query expansion , 2003, SIGIR.

[5]  Jintao Li,et al.  Improved latent concept expansion using hierarchical markov random fields , 2010, CIKM.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Paolo Napoletano,et al.  An Adaptive Optimisation Method for Automatic Lightweight Ontology Extraction , 2010, ICEIS.

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  Amanda Spink,et al.  Determining the informational, navigational, and transactional intent of Web queries , 2008, Inf. Process. Manag..

[10]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[11]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[12]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[13]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[14]  Stephen E. Robertson,et al.  On Term Selection for Query Expansion , 1991, J. Documentation.

[15]  Seiji Yamada,et al.  Semisupervised Query Expansion with Minimal Feedback , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[17]  Youngjoong Ko,et al.  Text classification from unlabeled documents with bootstrapping and feature projection techniques , 2009, Inf. Process. Manag..

[18]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[19]  Johanna Enberg,et al.  Query Expansion , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Pu-Jen Cheng,et al.  Selecting Effective Terms for Query Formulation , 2009, AIRS.

[22]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[23]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[24]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.