Text-based measures of document diversity

Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.

[1]  C. W. CLEVERDON Citation Idiosyncrasies , 1970, Nature.

[2]  Brian D. Crawford Open Access Initiative , 2014 .

[3]  M. Nei Analysis of gene diversity in subdivided populations. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Robert N. Broadus An investigation of the validity of bibliographic citations , 1983, J. Am. Soc. Inf. Sci..

[5]  C. Ricotta,et al.  Towards a unifying approach to diversity measures: bridging the gap between the Shannon entropy and Rao's quadratic index. , 2006, Theoretical population biology.

[6]  Ben Taskar,et al.  Discovering Diverse and Salient Threads in Document Collections , 2012, EMNLP.

[7]  Jian Zhang,et al.  Statistical Translation, Heat Kernels and Expected Distances , 2007, UAI.

[8]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[9]  R. OVER,et al.  Citation Idiosyncrasies , 1970 .

[10]  Kevin W. Boyack,et al.  Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature , 2011, J. Informetrics.

[11]  A. Porter,et al.  How interdisciplinary is a given body of research , 2008 .

[12]  Jack P. Gibbs,et al.  Urbanization, Technology, and the Division of Labor: International Patterns , 1962 .

[13]  W. Bossert,et al.  The Measurement of Diversity , 2001 .

[14]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Michael O Finkelstein,et al.  The Application of an Entropy Theory of Concentration to the Clayton Act , 1967 .

[16]  Calyampudi R. Rao Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[17]  P. Ohadike Urbanization , 1968, Encyclopedia of the UN Sustainable Development Goals.

[18]  Ismael Rafols,et al.  Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience , 2009, Scientometrics.

[19]  Christopher Olston,et al.  Search result diversity for informational queries , 2011, WWW.

[20]  A. Magurran Ecological Diversity and Its Measurement , 1988, Springer Netherlands.

[21]  Stanley Lieberson,et al.  Measuring Population Diversity , 1969 .

[22]  Ismael Rafols,et al.  Is science becoming more interdisciplinary? Measuring and mapping six research fields over time , 2009, Scientometrics.

[23]  A. Stirling A general framework for analysing diversity in science, technology and society , 2007, Journal of The Royal Society Interface.

[24]  A. Solow,et al.  On the measurement of biological diversity , 1993 .

[25]  E. C. Pielou,et al.  An introduction to mathematical ecology , 1970 .