Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval

Document coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness [37]. The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one--mode undirected graphs. However, one--mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR.

[1]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[2]  Tapas Kanungo,et al.  Predicting the readability of short web summaries , 2009, WSDM '09.

[3]  Dilek Z. Hakkani-Tür,et al.  Discovery of Topically Coherent Sentences for Extractive Summarization , 2011, ACL.

[4]  Susan Gauch,et al.  Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web , 2000, SIGIR '00.

[5]  Graeme Hirst,et al.  Encoding World Knowledge in the Evaluation of Local Coherence , 2015, HLT-NAACL.

[6]  T. Snijders The statistical evaluation of social network dynamics , 2001 .

[7]  Jakob Grue Simonsen,et al.  Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application to IR , 2015, ICTIR.

[8]  Matthieu Latapy,et al.  Basic notions for the analysis of large two-mode networks , 2008, Soc. Networks.

[9]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[10]  W. Bruce Croft,et al.  Document quality models for web ad hoc retrieval , 2005, CIKM '05.

[11]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[12]  Eduard H. Hovy,et al.  A Model of Coherence Based on Distributed Sentence Representation , 2014, EMNLP.

[13]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[14]  Fabien Tarissan,et al.  Analysing the first case of the International Criminal Court from a network-science perspective , 2016, J. Complex Networks.

[15]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[16]  Hwee Tou Ng,et al.  Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation , 2012, ACL.

[17]  Camille Guinaudeau,et al.  Graph-based Local Coherence Modeling , 2013, ACL.

[18]  Enver Kayaaslan On Enumerating All Maximal Bicliques of Bipartite Graphs , 2010, CTW.

[19]  Martin G. Everett,et al.  Network analysis of 2-mode data , 1997 .

[20]  R. Gunning The Technique of Clear Writing. , 1968 .

[21]  Iadh Ounis,et al.  Extending Weighting Models with a Term Quality Measure , 2007, SPIRE.

[22]  Matthieu Latapy,et al.  Towards a bipartite graph modeling of the internet topology , 2013, Comput. Networks.

[23]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[24]  Jean-Loup Guillaume,et al.  Revealing intricate properties of communities in the bipartite structure of online social networks , 2015, 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS).

[25]  Ophir Frieder,et al.  Disproving the fusion hypothesis: an analysis of data fusion via effective information retrieval strategies , 2003, SAC '03.

[26]  Christina Lioma,et al.  Part of speech n-grams and Information Retrieval , 2008 .

[27]  Renxian Zhang,et al.  Sentence Ordering Driven by Local and Global Coherence for Summary Generation , 2011, ACL.

[28]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[29]  Tore Opsahl Triadic closure in two-mode networks: Redefining the global and local clustering coefficients , 2013, Soc. Networks.

[30]  Stephen E. Robertson,et al.  Relevance weighting for query independent evidence , 2005, SIGIR '05.

[31]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[32]  Michael Halliday,et al.  Cohesion in English , 1976 .

[33]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[34]  Min Zhang,et al.  Topic-Based Coherence Modeling for Statistical Machine Translation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Christina Lioma,et al.  Preliminary study of technical terminology for the retrieval of scientific book metadata records , 2012, SIGIR '12.

[36]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[37]  M. de Rijke,et al.  Learning to Explain Entity Relationships in Knowledge Graphs , 2015, ACL.

[38]  Evgeniy Gabrilovich,et al.  To each his own: personalized content selection based on text comprehensibility , 2012, WSDM '12.

[39]  R. Beaugrande,et al.  Introduction to text linguistics , 1981 .

[40]  Yang Ding,et al.  Lexical Chain Based Cohesion Models for Document-Level Statistical Machine Translation , 2013, EMNLP.

[41]  Daraksha Parveen,et al.  Integrating Importance, Non-Redundancy and Coherence in Graph-Based Extractive Summarization , 2015, IJCAI.

[42]  Z. Di,et al.  Clustering coefficient and community structure of bipartite networks , 2007, 0710.0117.

[43]  Christina Lioma,et al.  Rhetorical relations for information retrieval , 2012, SIGIR '12.

[44]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[45]  Christina Lioma,et al.  A Cascaded Classification Approach to Semantic Head Recognition , 2011, EMNLP.

[46]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[47]  G. M. McClure Readability formulas: Useful or useless? , 1987, IEEE Transactions on Professional Communication.