Graph-based term weighting for information retrieval

A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al. in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures.

[1]  Otto Jespersen,et al.  The Philosophy of Grammar , 1924 .

[2]  J. Deese The structure of associations in language and thought , 1966 .

[3]  H. R. Quillian In semantic information processing , 1968 .

[4]  Michael Halliday,et al.  Cohesion in English , 1976 .

[5]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[6]  M. Feinberg Chemical Oscillations, Multiple Equilibria, and Reaction Network Structure , 1980 .

[7]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[8]  B. Bollobás The evolution of random graphs , 1984 .

[9]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[10]  J. Hopfield,et al.  Computing with neural circuits: a model. , 1986, Science.

[11]  Richard Lippmann,et al.  Neural Net and Traditional Classifiers , 1987, NIPS.

[12]  Nicholas J. Belkin,et al.  SIGIR'89, 12th International Conference on Research and Development in Information Retrieval, Cambridge, Massachusetts, USA, June 25-28, 1989, Proceedings , 1989 .

[13]  Kui-Lam Kwok A neural network for probabilistic information retrieval , 1989, SIGIR '89.

[14]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[15]  Richard K. Belew,et al.  Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents , 1989, SIGIR '89.

[16]  V. Reyna,et al.  Fuzzy processing in transitivity development , 1990 .

[17]  Nancy Ide,et al.  Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries , 1990, COLING.

[18]  James A. Reggia,et al.  Connectionist models and information retrieval , 1990 .

[19]  HERBERT A. SIMON,et al.  The Architecture of Complexity , 1991 .

[20]  Michael Hoey,et al.  Patterns of Lexis In Text , 1991 .

[21]  Ross Wilkinson,et al.  Using the cosine measure in a neural network for document retrieval , 1991, SIGIR '91.

[22]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[23]  Donna Harman,et al.  How effective is suffixing , 1991 .

[24]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[25]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[26]  W. Robertson,et al.  A neural algorithm for document clustering , 1991, Inf. Process. Manag..

[27]  Edward A. Fox,et al.  Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval , 1992, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[28]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[29]  Hideki Kozima,et al.  Similarity between Words Computed by Spreading Activation on an English Dictionary , 1993, EACL.

[30]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[31]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[32]  Jan O. Pedersen Information Retrieval Based on Word Senses , 1995 .

[33]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[34]  Gerda Ruge,et al.  Human memory models and term association , 1995, SIGIR '95.

[35]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[36]  G. Polis Ecology: Stability is woven by complex webs , 1998, Nature.

[37]  Fabio Crestani,et al.  A study of probability kinematics in information retrieval , 1998, TOIS.

[38]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[39]  A. Hastings,et al.  Weak trophic interactions and the balance of nature , 1998, Nature.

[40]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[41]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[42]  E. Berlow,et al.  Strong effects of weak interactions in ecological communities , 1999, Nature.

[43]  Albert-László Barabási,et al.  Internet: Diameter of the World-Wide Web , 1999, Nature.

[44]  FaloutsosMichalis,et al.  On power-law relationships of the Internet topology , 1999 .

[45]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[46]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[47]  M. Newman,et al.  Epidemics and percolation in small-world networks. , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[48]  S. Redner,et al.  Connectivity of growing random networks. , 2000, Physical review letters.

[49]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[50]  B. Beckman,et al.  BizTalk Server 2000 Business Process Orchestration. , 2001 .

[51]  Shlomo Moran,et al.  SALSA: the stochastic approach for link-structure analysis , 2001, TOIS.

[52]  D. Fell,et al.  The small world inside large metabolic networks , 2000, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[53]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[54]  Alessandro Vespignani,et al.  Epidemic spreading in scale-free networks. , 2000, Physical review letters.

[55]  S N Dorogovtsev,et al.  Language as an evolving word web , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[56]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[57]  V Latora,et al.  Efficient behavior of small-world networks. , 2001, Physical review letters.

[58]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[59]  S. N. Dorogovtsev,et al.  Evolution of networks , 2001, cond-mat/0106144.

[60]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[61]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[62]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[63]  Ludger Santen,et al.  Single-vehicle data of highway traffic: microscopic description of traffic phases. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[64]  Olaf Sporns,et al.  Networks analysis, complexity, and brain function , 2002 .

[65]  O. Sporns Network Analysis , Complexity , and Brain Function , 2002 .

[66]  G. Edelman,et al.  Theoretical neuroanatomy and the connectivity of the cerebral cortex , 2002, Behavioural Brain Research.

[67]  Partha Dasgupta,et al.  Topology of the conceptual network of language. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[68]  Mariano Sigman,et al.  Global organization of the Wordnet lexicon , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Gerhard Heyer,et al.  Small Worlds of Concepts and Other Principles of Semantic Search , 2003, IICS.

[70]  Massimo Marchiori,et al.  Economic small-world behavior in weighted networks , 2003 .

[71]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[72]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[73]  W. Li,et al.  Statistical analysis of airport network of China. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[74]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[75]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[76]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[77]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[78]  S. Shen-Orr,et al.  Superfamilies of Evolved and Designed Networks , 2004, Science.

[79]  Ney Lemke,et al.  Essentiality and damage in metabolic networks , 2004, Bioinform..

[80]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[81]  M. Vitevitch,et al.  Neighborhood density effects in spoken word recognition in Spanish , 2005, Journal of multilingual communication disorders.

[82]  J. Weijer,et al.  Word length, sentence length and frequency: Zipf revisited , 2004 .

[83]  Cédrick Fairon,et al.  Lexical Similarity Based On Quantity Of Information Exchanged - Synonym Extraction , 2004, RIVF.

[84]  A. Vespignani,et al.  The architecture of complex weighted networks. , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[85]  Amanda Spink,et al.  A day in the life of Web searching: an exploratory study , 2004, Inf. Process. Manag..

[86]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[87]  Richard K. Belew,et al.  Scientific impact quantity and quality: Analysis of two sources of bibliographic data , 2005, ArXiv.

[88]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[89]  Takashi Inui,et al.  Extracting Semantic Orientations of Words using Spin Model , 2005, ACL.

[90]  R. Guimerà,et al.  The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[91]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[92]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[93]  Gilberto Corso,et al.  The network of syllables in Portuguese , 2005 .

[94]  Alistair Moffat,et al.  SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005 , 2005, SIGIR.

[95]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[96]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[97]  R. Albert Scale-free networks in cell biology , 2005, Journal of Cell Science.

[98]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[99]  Prahlad T. Ram,et al.  Formation of Regulatory Patterns During Signal Propagation in a Mammalian Cellular Network , 2005, Science.

[100]  R. Iyengar,et al.  Toward predictive models of mammalian cells. , 2005, Annual review of biophysics and biomolecular structure.

[101]  Alistair Moffat,et al.  Simplified similarity scoring using term ranks , 2005, SIGIR '05.

[102]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[103]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[104]  Stan Szpakowicz,et al.  Learning Noun-Modifier Semantic Relations with Corpus-based and WordNet-based Features , 2006, AAAI.

[105]  S. M.G. Caldeira,et al.  The network of concepts in written texts , 2006 .

[106]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[107]  M. Newman,et al.  Vertex similarity in networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[108]  V. Latora,et al.  The backbone of a city , 2005, physics/0511063.

[109]  Xiaojin Zhu,et al.  Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization , 2006 .

[110]  Michael Gamon Graph-Based Text Representation for Novelty Detection , 2006 .

[111]  Carmen Banea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006 .

[112]  Bruno Gaume,et al.  Synonym Extraction Using a Semantic Distance on a Dictionary , 2006 .

[113]  Jon M. Kleinberg,et al.  Social networks, incentives, and search , 2006, SIGIR.

[114]  Guido Caldarelli,et al.  Spectral Methods Cluster Words of the Same Class in a Syntactic Dependency Network , 2005, Int. J. Bifurc. Chaos.

[115]  Thad Hughes,et al.  Lexical Semantic Relatedness with Random Graph Walks , 2007, EMNLP.

[116]  James Allan,et al.  Web Page Clustering Using Heuristic Search in the Web Graph , 2007, IJCAI.

[117]  Rada Mihalcea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006, International Conference on Semantic Computing (ICSC 2007).

[118]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[119]  Christina Lioma,et al.  Random walk term weighting for information retrieval , 2007, SIGIR.

[120]  Stephen E. Robertson,et al.  Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[121]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[122]  Charles L. A. Clarke,et al.  SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007 , 2007, SIGIR.

[123]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[124]  Soumen Chakrabarti,et al.  Dynamic personalized pagerank in entity-relation graphs , 2007, WWW '07.

[125]  Lucas Antiqueira,et al.  Some issues on complex networks for author characterization , 2007, Inteligencia Artif..

[126]  Niloy Ganguly,et al.  How Difficult is it to Develop a Perfect Spell-checker? A Cross-Linguistic Analysis through Complex Network Approach , 2007, physics/0703198.

[127]  Réka Albert,et al.  Using Graph Concepts to Understand the Organization of Complex Systems , 2006, Int. J. Bifurc. Chaos.

[128]  Iadh Ounis,et al.  Research directions in Terrier: a search engine for advanced retrieval on the Web , 2007 .

[129]  Takashi Inui,et al.  Extracting Semantic Orientations of Phrases from Dictionary , 2007, NAACL.

[130]  Andrea Esuli,et al.  PageRanking WordNet Synsets: An Application to Opinion Mining , 2007, ACL.

[131]  Gerhard Weikum,et al.  Efficient top-k querying over social-tagging networks , 2008, SIGIR '08.

[132]  Bruno Gaume Mapping the forms of meaning in small worlds , 2008, Int. J. Intell. Syst..

[133]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[134]  Pablo Gervás,et al.  Concept-Graph Based Biomedical Automatic Summarization Using Ontologies , 2008, COLING 2008.

[135]  Terry Joyce,et al.  Capturing the Structures in Association Knowledge: Application of Network Analyses to Large-Scale Databases of Japanese Word Associations , 2008, LKR.

[136]  Christina Lioma,et al.  Part of speech n-grams and Information Retrieval , 2008 .

[137]  William W. Cohen,et al.  Learning Graph Walk Based Similarity Measures for Parsed Text , 2008, EMNLP.

[138]  Jaeyoung Jung,et al.  Associative Language Learning Support Applying Graph Clustering For Vocabulary Learning and Improving Associative Ability , 2008, 2008 Eighth IEEE International Conference on Advanced Learning Technologies.

[139]  Raj Kumar Pan,et al.  Network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilization inscriptions , 2009, Graph-based Methods for Natural Language Processing.

[140]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[141]  James Allan,et al.  Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009 , 2009, SIGIR.

[142]  Christina Lioma,et al.  Part of Speech Based Term Weighting for Information Retrieval , 2009, ECIR.

[143]  Alexander Mehler Large Text Networks as an Object of Corpus Linguistic Studies , 2009 .

[144]  Christopher D. Manning,et al.  Random Walks for Text Semantic Similarity , 2009, Graph-based Methods for Natural Language Processing.

[145]  Lucas Antiqueira,et al.  A complex network approach to text summarization , 2009, Inf. Sci..

[146]  Lise Getoor,et al.  Opinion Graphs for Polarity and Discourse Classification , 2009, Graph-based Methods for Natural Language Processing.

[147]  Ioannis Konstas,et al.  On social networks and collaborative recommendation , 2009, SIGIR.

[148]  Seong-Bae Park,et al.  An automatic translation of tags for multimedia contents using folksonomy networks , 2009, SIGIR.

[149]  Lillian Lee,et al.  PageRank without hyperlinks: Structural reranking using links induced by language models , 2010, ACM Trans. Inf. Syst..

[150]  Noah A. Smith,et al.  Parsing with Soft and Hard Constraints on Dependency Length , 2005, IWPT.

[151]  Béla Bollobás,et al.  Graph Theory: An Introductory Course , 1980, The Mathematical Gazette.