CITREC : An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central

Citation-based similarity measures such as Bibliographic Coupling and Co-Citation are an integral component of many information retrieval systems. However, comparisons of the strengths and weaknesses of measures are challenging due to the lack of suitable test collections. This paper presents CITREC, an open evaluation framework for citation-based and text-based similarity measures. CITREC prepares the data from the PubMed Central Open Access Subset and the TREC Genomics collection for a citation-based analysis and provides tools necessary for performing evaluations of similarity measures. To account for different evaluation purposes, CITREC implements 35 citation-based and text-based similarity measures, and features two gold standards. The first gold standard uses the Medical Subject Headings (MeSH) thesaurus and the second uses the expert relevance feedback that is part of the TREC Genomics collection to gauge similarity. CITREC additionally offers a system that allows creating user-defined gold standards to adapt the evaluation framework to individual information needs and evaluation purposes.

[1]  Andrew Trotman,et al.  Overview of the INEX 2009 Ad Hoc Track , 2009, INEX.

[2]  Masaki Eto,et al.  Spread co-citation relationship as a measure for document retrieval , 2012, BooksOnline '12.

[3]  David Sánchez,et al.  An ontology-based measure to compute semantic similarity in biomedicine , 2011, J. Biomed. Informatics.

[4]  Per Ahlgren,et al.  Document-document similarity approaches and science mapping: Experimental comparison of five approaches , 2009, J. Informetrics.

[5]  Evangelos E. Milios,et al.  Node similarity in the citation graph , 2006, Knowledge and Information Systems.

[6]  Berthier A. Ribeiro-Neto,et al.  A comparative study of citations and links in document classification , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[7]  Kevin W. Boyack,et al.  Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? , 2010 .

[8]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[9]  Per Ahlgren,et al.  Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping , 2008, Scientometrics.

[10]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[11]  Kevin W. Boyack,et al.  Improving the accuracy of co-citation clustering using full text , 2013, J. Assoc. Inf. Sci. Technol..

[12]  Hongyan Liu,et al.  Using Link-Based Content Analysis to Measure Document Similarity Effectively , 2009, APWeb/WAIM.

[13]  Yihong Gong,et al.  Combining content and link for classification using matrix factorization , 2007, SIGIR.

[14]  Sunju Park,et al.  C-Rank: A link-based similarity measure for scientific literature databases , 2011, Inf. Sci..

[15]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[16]  Dino Karabeg,et al.  Quality, Relevance and Importance in Information Retrieval with Fuzzy Semantic Networks , 2008 .

[17]  Jia Zeng,et al.  Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity , 2009, Bioinform..

[18]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[19]  Norman Meuschke,et al.  Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus , 2014, J. Assoc. Inf. Sci. Technol..

[20]  Masaki Eto,et al.  Evaluations of context-based co-citation searching , 2012, Scientometrics.

[21]  Naoki Shibata,et al.  Comparative study on methods of detecting research fronts using different types of citation , 2009, J. Assoc. Inf. Sci. Technol..

[22]  Robert A. Buchanan,et al.  Accuracy of Cited References: The Role of Citation Databases , 2006 .

[23]  Bart De Moor,et al.  Hybrid Clustering of Text Mining and Bibliometrics Applied to Journal Sets , 2009, SDM.

[24]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[25]  Bart De Moor,et al.  Hybrid clustering for validation and improvement of subject-classification schemes , 2009, Inf. Process. Manag..

[26]  T. Saracevic Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part II , 2006 .

[27]  Paolo Rosso,et al.  Counting Co-occurrences in Citations to Identify Plagiarised Text Fragments , 2013, CLEF.

[28]  Jöran Beel,et al.  Citation Proximity Analysis (CPA) : A New Approach for Identifying Related Work Based on Co-Citation Analysis , 2009 .

[29]  Bart De Moor,et al.  Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database , 2010, J. Assoc. Inf. Sci. Technol..

[30]  Xiaoying Gao,et al.  Combining Contents and Citations for Scientific Document Classification , 2005, Australian Conference on Artificial Intelligence.

[31]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[32]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[33]  Bo Jarneving,et al.  A comparison of two bibliometric methods for mapping of the research front , 2005, Scientometrics.