Similarity-based scoring method for classification of Health Informatics content

Objective: There has been a considerable growth of the architecture and complexity of digital repositories in Health Informatics (HI). For information retrieval different information treatment and representation, such as automatic content classification, are required. The purpose of this study is to present the results of a procedure for automatic classification of scientific articles in HI using a specific thesaurus. Design: Statistical, vector, and artificial intelligence methods were applied to classify HI-related content. Articles extracted from the HI and Health journals and a specialized HI thesaurus were used for method application and result evaluation. Measurements: Statistical procedures and measures of accuracy, precision, recall, area under the ROC curve, and combination of precision and recall (F 1 measure) were performed to measure the degree of similarity between terms of the specialized HI thesaurus and the selected articles. Results: The percentage of accuracy achieved was 0.87, F 1 measure was 0.87 and the area under the ROC curve was 0.94. Conclusion: The results were positive, showing that the use of a specialized thesaurus on Health Informatics in conjunction with the methods used allows the classification of articles in the areas of Health Informatics and Health.

[1]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[2]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[3]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[4]  Allen C Browne,et al.  A method for verifying a vector-based text classification system. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[5]  Thomas C. Rindflesch,et al.  Automatic indexing by discipline and high-level categories: Methodology and potential applications. , 2011 .

[6]  Eliane Colepicolo Epistemologia da Informática em Saúde: entre a teoria e a prática , 2008 .

[7]  Stéfan Jacques Darmoni,et al.  Comparing a rule-based versus statistical system for automatic categorization of MEDLINE documents according to biomedical specialty , 2009, J. Assoc. Inf. Sci. Technol..

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Robert H. Baud,et al.  Health search engine with e-document analysis for reliable search results , 2006, Int. J. Medical Informatics.

[10]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[11]  Kevin B. Johnson,et al.  Model Formulation: A Model for Evaluating Interface Terminologies , 2008, J. Am. Medical Informatics Assoc..

[12]  C. Lee Giles,et al.  Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation , 2009, ECIR.

[13]  D. Parkinson,et al.  Bayesian Methods in Cosmology: Model selection and multi-model inference , 2009 .

[14]  K. D. Joshi,et al.  A collaborative approach to ontology design , 2002, CACM.

[15]  A. Zanasi,et al.  Data Mining 8: Data, Text and Web Mining and Their Business Applications (Wit Transactions on Information and Communication Technologies) , 2007 .

[16]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Neil R. Smalheiser,et al.  A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research Articles , 2005 .

[19]  F. W. Lancaster,et al.  Vocabulary control for information retrieval , 1972 .

[20]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.