Quality Evaluation for Document Relation Discovery Using Citation Information

Assessment of discovered patterns is an important issue in the field of knowledge discovery. This paper presents an evaluation method that utilizes citation (reference) information to assess the quality of discovered document relations. With the concept of transitivity as direct/indirect citations, a series of evaluation criteria is introduced to define the validity of discovered relations. Two kinds of validity, called soft validity and hard validity, are proposed to express the quality of the discovered relations. For the purpose of impartial comparison, the expected validity is statistically estimated based on the generative probability of each relation pattern. The proposed evaluation is investigated using more than 10,000 documents obtained from a research publication database. With frequent itemset mining as a process to discover document relations, the proposed method was shown to be a powerful way to evaluate the relations in four aspects: soft/hard scoring, direct/indirect citation, relative quality over the expected value, and comparison to human judgment.

[1]  Ronald Rousseau,et al.  A classification of author co-citations: Definitions and search strategies , 2004, J. Assoc. Inf. Sci. Technol..

[2]  Wanda Pratt,et al.  A Knowledge-Based Approach to Organizing Retrieved Documents , 1999, AAAI/IAAI.

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[5]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[6]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[7]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[8]  W. Klein,et al.  Bibliometrics , 2005, Social work in health care.

[9]  Michael D. Gordon,et al.  Literature-Based Discovery by Lexical Statistics , 1999, J. Am. Soc. Inf. Sci..

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Thanaruk Theeramunkong,et al.  Revealing Topic-based Relationship Among Documents using Association Rule Mining , 2005, Artificial Intelligence and Applications.

[12]  Susan T. Dumais,et al.  Using Latent Semantic Indexing for Literature Based Discovery , 1998, J. Am. Soc. Inf. Sci..

[13]  William M. Pottenger,et al.  Recent Advances in Literature Based Discovery , 2005 .

[14]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[15]  Michael K. Buckland,et al.  Annual Review of Information Science and Technology , 2006, J. Documentation.

[16]  D. Swanson Medical literature as a potential source of new knowledge. , 1990, Bulletin of the Medical Library Association.