Dimensionalité intrinsèque dans les espaces de représentation des termes et des documents

L'examen des proprietes des espaces de representation des documents ou des mots en RI (typiquement, R n avec n tres grand) fournit de precieuses indications pour aider la re-cherche. Recemment, plusieurs travaux ont montre qu'il etait possible d'etudier la dimension-nalite reelle des donnees, appelee dimensionnalite intrinseque, en certains points de ces espaces (Houle et al., 2012a). Dans cet article, nous proposons de revisiter cette notion de dimension intrinseque sous la forme d'un indice note α dans le cas particulier de la RI et d'etudier son utilisation pratique en RI. Plus precisement, nous montrons comment son estimation a partir de similarites de type RI, peut etre utilisee dans les espaces de representations des documents et les espaces de representations de mots (Mikolov et al., 2013 ; Claveau et al., 2014). Ainsi, nous montrons d'une part que l'indice α aide a caracteriser les requetes difficiles ; d'autre part, dans une tâche d'extension de requete, nous montrons comment cette notion de dimensionna-lite intrinseque appliquee a des mots permet de choisir au mieux les termes a etendre et leurs extensions. ABSTRACT. Examining the properties of representation spaces for documents or words in IR (typically R n with n large) brings precious insights to help the retrieval process. Recently, several authors have studied the real dimensionality of the datasets, called intrinsic dimensionality, in specific parts of these spaces (Houle et al., 2012a). In this paper, we propose to revisit this notion through a coefficient called α in the specific case of IR and to study its use in IR tasks. More precisely, we show how to estimate α from IR similarities and to use it in representtion spaces used for documents and words (Mikolov et al., 2013 ; Claveau et al., 2014). Indeed, we prove that α may be used to characterize difficult queries; moreover we show that this intrinsic dimensionality notion, applied to words, can help to chosse terms to use for query expansion. MOTS-CLES : Dimensionnalite intrinseque, fonctions RSV, thesaurus distributionnels, extension de requete.

[1]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[2]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[3]  Tao Tao,et al.  Diagnostic Evaluation of Information Retrieval Models , 2011, TOIS.

[4]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[5]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[6]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[7]  Vincent Claveau,et al.  Thésaurus distributionnels pour la recherche d'information et vice-versa , 2015, CORIA.

[8]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[9]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[10]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[11]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[12]  Jarkko Venna,et al.  Local multidimensional scaling , 2006, Neural Networks.

[13]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[14]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[15]  Olivier Ferret Identifying Bad Semantic Neighbors for Improving Distributional Thesauri , 2013, ACL.

[16]  ChengXiang Zhai,et al.  Lower-bounding term frequency normalization , 2011, CIKM '11.

[17]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[18]  Michael E. Houle,et al.  Dimensional Testing for Multi-step Similarity Search , 2012, 2012 IEEE 12th International Conference on Data Mining.

[19]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[20]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[21]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[22]  Hisashi Kashima,et al.  Generalized Expansion Dimension , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[23]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[24]  Michael E. Houle,et al.  Rank Cover Trees for Nearest Neighbor Search , 2013, SISAP.

[25]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[26]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[27]  Vincent Claveau,et al.  Improving distributional thesauri by exploring the graph of neighbors , 2014, COLING.

[28]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.