Tracing the Paths between Concepts in Large Bio-Medical Corpora

Language suffers an everlasting process of change, both at a semantic level, where existing words acquire new meanings, and at a lexical level, where new concepts appear and old ones disappear or are used less frequently. New words (terms/concepts) may be added as a result of scientific discoveries or socio-cultural influences, while other words are "forgotten" or are assigned alternative meanings. These changes in a vocabulary usually characterize important shifts in the environment or the domain they are used in. For experts there is an evident connection between a new concept and some of the existing ones, but for regular people these relations remain hidden and need to be identified. In particular, in the medical domain new terms appear as a result of new discoveries and it becomes an important challenge to establish the connections between different concepts. Moreover, it is important to detect if such a relation even exists. In this paper, we present a graph-based approach to identify the semantic path (which is a chain of semantically related words) between the concepts that appeared in the bio-medicine publications available in the Pub Med corpus over a time period of 20 years.

[1]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[2]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[3]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[4]  Wanda Pratt,et al.  A Study of Biomedical Concept Identification: MetaMap vs. People , 2003, AMIA.

[5]  D. Wijaya,et al.  Understanding semantic change of words over centuries , 2011, DETECT '11.

[6]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[7]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[8]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[9]  Joel Dudley,et al.  Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets , 2011, J. Biomed. Informatics.

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Jeffrey Heer,et al.  Identifying medical terms in patient-authored text: a crowdsourcing-based approach , 2013, J. Am. Medical Informatics Assoc..