Unsupervised mining of frequent tags for clinical eligibility text indexing

Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.

[1]  Fredric C. Gey,et al.  Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness , 2001, CIKM '01.

[2]  Margaret-Anne D. Storey,et al.  An Interactive Tool for Visualizing Design Heterogeneity in Clinical Trials , 2008, AMIA.

[3]  Carol Friedman,et al.  Natural language processing: State of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine , 2013, J. Biomed. Informatics.

[4]  Riccardo Miotto,et al.  A human-computer collaborative approach to identifying common data elements in clinical trial eligibility criteria , 2013, J. Biomed. Informatics.

[5]  Mor Peleg,et al.  A practical method for transforming free-text eligibility criteria into computable criteria , 2011, J. Biomed. Informatics.

[6]  Mehrnoush Shamsfard,et al.  Learning ontologies from natural language texts , 2004, Int. J. Hum. Comput. Stud..

[7]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[8]  Ian Ruthven,et al.  Interactive information retrieval , 2008 .

[9]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[10]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Joel D. Martin,et al.  ExaCT: automatic extraction of clinical trial characteristics from journal publications , 2010, BMC Medical Informatics Decis. Mak..

[13]  Tingting Mu,et al.  ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials , 2012, BMC Medical Informatics and Decision Making.

[14]  Ioannis Korkontzelos,et al.  Reviewing and Evaluating Automatic Term Recognition Techniques , 2008, GoTAL.

[15]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[16]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[17]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[18]  Chunhua Weng,et al.  Formal representation of eligibility criteria: A literature review , 2010, J. Biomed. Informatics.

[19]  Joel D. Martin,et al.  Automated Information Extraction of Key Trial Design Elements from Clinical Trial Publications , 2008, AMIA.

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[22]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[23]  Yang Huang,et al.  Combining text classification and Hidden Markov Modeling techniques for categorizing sentences in randomized clinical trial abstracts. , 2006, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[24]  William R. Hogan,et al.  Natural Language Processing methods and systems for biomedical ontology learning , 2011, J. Biomed. Informatics.

[25]  Grace Yuet-Chee Chung,et al.  Sentence retrieval for abstracts of randomized controlled trials , 2009, BMC Medical Informatics Decis. Mak..

[26]  F. W. Lancaster,et al.  Vocabulary control for information retrieval , 1972 .

[27]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[28]  Mark B. Sandler,et al.  Music Information Retrieval Using Social Tags and Audio , 2009, IEEE Transactions on Multimedia.

[29]  Chunhua Weng,et al.  Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[30]  Yang Huang,et al.  Combining Text Classification and Hidden Markov Modeling Techniques for Structuring Randomized Clinical Trial Abstracts , 2006, AMIA.

[31]  S. Tu,et al.  Analysis of Eligibility Criteria Complexity in Clinical Trials , 2010, Summit on translational bioinformatics.

[32]  Xiaoying Wu,et al.  EliXR: an approach to eligibility criteria extraction and representation , 2011, J. Am. Medical Informatics Assoc..

[33]  Ed H. Chi,et al.  Understanding the efficiency of social tagging systems using information theory , 2008, ICWSM.

[34]  Mor Naaman,et al.  Why do tagging systems work? , 2006, CHI Extended Abstracts.

[35]  Vittorio Loreto,et al.  Semiotic dynamics and collaborative tagging , 2006, Proceedings of the National Academy of Sciences.

[36]  Pierre Tirilly,et al.  Constructing a true LCSH tree of a science and engineering collection , 2012, J. Assoc. Inf. Sci. Technol..

[37]  Sharib A. Khan,et al.  What do patients search for when seeking clinical trial information online? , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[38]  Kuang-Hwei Lee–Smeltzer,et al.  Finding the needle: Controlled vocabularies, resource discovery, and Dublin Core , 2000 .

[39]  Michael Krauthammer,et al.  Shallow Semantic Parsing of Randomized Controlled Trial Reports , 2006, AMIA.

[40]  Dina Demner-Fushman,et al.  Application of Information Technology: Essie: A Concept-based Search Engine for Structured Biomedical Text , 2007, J. Am. Medical Informatics Assoc..

[41]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[42]  Vittorio Loreto,et al.  Collaborative Tagging and Semiotic Dynamics , 2006, ArXiv.

[43]  T C Rindflesch,et al.  Semantic processing in information retrieval. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[44]  Meng Wang,et al.  Multimedia tagging: past, present and future , 2011, ACM Multimedia.