Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation

In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html.

[1]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[2]  Thomas C. Rindflesch,et al.  Large-Scale Structure of a Network of Co-Occurring MeSH Terms: Statistical Analysis of Macroscopic Properties , 2014, PloS one.

[3]  Weiqing Wang,et al.  Exploring supervised and unsupervised methods to detect topics in biomedical text , 2006, BMC Bioinformatics.

[4]  Robert L. Goldstone,et al.  Interdisciplinarity at the journal and specialty level: The changing knowledge bases of the journal cognitive science , 2012, J. Assoc. Inf. Sci. Technol..

[5]  Marcelo Fiszman,et al.  A Literature-Based Assessment of Concept Pairs as a Measure of Semantic Relatedness , 2013, AMIA.

[6]  Trevor Cohen,et al.  EpiphaNet: An Interactive Tool to Support Biomedical Discoveries , 2010, Journal of biomedical discovery and collaboration.

[7]  Chaomei Chen The Fitness of Information: Quantitative Assessments of Critical Evidence , 2014 .

[8]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[9]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[10]  Jing Zhou,et al.  MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents , 2015, 2015 34th Chinese Control Conference (CCC).

[11]  Thomas C. Rindflesch,et al.  Spark, an application based on Serendipitous Knowledge Discovery , 2016, J. Biomed. Informatics.

[12]  Neil R. Smalheiser,et al.  Three Journal Similarity Metrics and Their Application to Biomedical Journals , 2014, PloS one.

[13]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[14]  Padmini Srinivasan,et al.  Distilling Conceptual Connections from MeSH Co-Occurrences , 2004, MedInfo.

[15]  W. Myers,et al.  Atypical Combinations and Scientific Impact , 2013 .

[16]  Lefteris Angelis,et al.  MeSHy: Mining unanticipated PubMed information using frequencies of occurrences and concurrences of MeSH terms , 2011, J. Biomed. Informatics.

[17]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation , 2005, J. Assoc. Inf. Sci. Technol..

[18]  Olivier Bodenreider,et al.  Methods for Exploring the Semantics of the Relationships between Co-occurring UMLS Concepts , 2001, MedInfo.