Biomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey

In this survey paper, we discuss biomedical ontologies and major text mining techniques applied to biomedicine and healthcare. Biomedical ontologies such as UMLS are currently being adopted in text mining approaches because they provide domain knowledge for text mining approaches. In addition, biomedical ontologies enable us to resolve many linguistic problems when text mining approaches handle biomedical literature. As the first example of text mining, document clustering is surveyed. Because a document set is normally multiple-topic, text mining approaches use document clustering as a preprocessing step to group similar documents. Additionally, document clustering is able to inform the biomedical literature searches required for the practice of evidence-based medicine. We introduce Swanson's UnDiscovered Public Knowledge (UDPK) model to generate biomedical hypotheses from biomedical literature such as MEDLINE by discovering novel connections among logically-related biomedical concepts. Another important area of text mining is document classification. Document classification is a valuable tool for biomedical tasks that involve large amounts of text. We survey well-known classification techniques in biomedicine. As the last example of text mining in biomedicine and healthcare, we survey information extraction. Information extraction is the process of scanning text for information relevant to some interest, including extracting entities, relations, and events. We also address techniques and issues of evaluating text mining applications in biomedicine and healthcare.

[1]  Saso Dzeroski,et al.  Supporting Discovery in Medicine by Association Rule Mining in Medline and UMLS , 2001, MedInfo.

[2]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  Wanda Pratt,et al.  A Knowledge-Based Approach to Organizing Retrieved Documents , 1999, AAAI/IAAI.

[4]  Aviel D. Rubin,et al.  Security considerations for remote electronic voting , 2002, CACM.

[5]  Michael J. E. Sternberg,et al.  Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines , 2001, Pacific Symposium on Biocomputing.

[6]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[7]  Fang Liu,et al.  FigSearch: using maximum entropy classifier to categorize biological figures , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[8]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[9]  Sven Meyer,et al.  The Suffix Tree Document Model Revisited , 1992 .

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[11]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[12]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[13]  Joyce A. Mitchell,et al.  Improving Literature Based Discovery Support by Genetic Knowledge Integration , 2003, MIE.

[14]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[15]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[16]  Neil R. Smalheiser,et al.  Implicit Text Linkages between Medline Records: Using Arrowsmith as an Aid to Scientific Discovery , 1999, Libr. Trends.

[17]  Marti A. Hearst Intelligent Connections: Battling with GA-Joe. , 1998 .

[18]  Wanda Pratt,et al.  H.3.3 Information Search and Retrieval , 2022 .

[19]  Steffen Staab,et al.  Text clustering based on good aggregations , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[20]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[21]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[22]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[23]  Marko Grobelnik,et al.  Interaction of Feature Selection Methods and Linear Classification Models , 2002 .

[24]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[25]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[26]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[27]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[28]  Chi-Chang Chang,et al.  Bayesian Value of Information Analysis with Linear, Exponential, Power Law Failure Models for Aging Chronic Diseases , 2008, J. Comput. Sci. Eng..

[29]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[30]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[31]  W. Pratt,et al.  The usefulness of dynamically categorizing search results. , 2000, Journal of the American Medical Informatics Association : JAMIA.

[32]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[33]  Christopher G. Chute,et al.  Maximum entropy modeling for mining patient medication status from free text , 2002, AMIA.

[34]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[35]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[36]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[37]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[38]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[39]  Martin Romacker,et al.  Creating Knowledge Repositories from Biomedical Reports: The MEDSYNDIKATE Text Mining System , 2001, Pacific Symposium on Biocomputing.

[40]  Padmini Srinivasan,et al.  Learning to crawl: Comparing classification schemes , 2005, TOIS.

[41]  Denys Proux,et al.  A Pragmatic Information Extraction Strategy for Gathering Data on Genetic Interactions , 2000, ISMB.

[42]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[43]  Ian Witten,et al.  Data Mining , 2000 .

[44]  Don R. Swanson,et al.  Two medical literatures that are logically but not bibliographically connected , 1987, J. Am. Soc. Inf. Sci..

[45]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[46]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[47]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[48]  Joel D. Martin,et al.  Getting to the (c)ore of knowledge: mining biomedical literature , 2002, Int. J. Medical Informatics.

[49]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[50]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[51]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[52]  Michael Gruninger,et al.  ONTOLOGY Applications and Design , 2002 .

[53]  R GruberThomas Toward principles for the design of ontologies used for knowledge sharing , 1995 .

[54]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[55]  Tze-Yun Leong,et al.  Automatic model structuring from text using biomedical ontology , 2004 .

[56]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[57]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[58]  Robert J. Gaizauskas,et al.  Utilizing text mining results: The Pasta Web System , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[59]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[60]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[61]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[62]  Walter Daelemans,et al.  Multilingualism and electronic language management: proceedings of the 4th International MIDP Colloquium, 22-23 September 2003, Bloemfontein, South Africa , 2005 .

[63]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[64]  Jimmy J. Lin,et al.  Semantic Clustering of Answers to Clinical Questions , 2007, AMIA.

[65]  Min Song,et al.  A Hybrid Abbreviation Extraction Technique for Biomedical Literature , 2007, BIBM.

[66]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[67]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[68]  Yasunori Yamamoto,et al.  Automatic Construction of Knowledge Base from Biological Papers , 1997, ISMB.

[69]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[70]  A. Detsky,et al.  Evidence-based medicine. A new approach to teaching the practice of medicine. , 1992, JAMA.

[71]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[72]  Hussein A. Abbass,et al.  Learning text classifier using the domain concept hierarchy , 2002, IEEE 2002 International Conference on Communications, Circuits and Systems and West Sino Expositions.

[73]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[74]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[75]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[76]  Raymond J. Mooney,et al.  Text mining with information extraction , 2004 .

[77]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[78]  Agostino Poggi,et al.  A collaborative platform for fixed and mobile networks , 2002, CACM.

[79]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[80]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[81]  Olivier Bodenreider,et al.  Chapter 3 Lexical, terminological and ontological resources for biological text mining , 2006 .

[82]  Xiaohua Hu,et al.  Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering , 2006, KDD '06.

[83]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[84]  D. Swanson Undiscovered Public Knowledge , 1986 .

[85]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[86]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[87]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[88]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[89]  Hongfang Liu,et al.  Mining Terminological Knowledge in Large Biomedical Corpora , 2003, Pacific Symposium on Biocomputing.

[90]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[91]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[92]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[93]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.

[94]  Toshihisa Takagi,et al.  Research Paper: ALICE: An Algorithm to Extract Abbreviations from MEDLINE , 2005, J. Am. Medical Informatics Assoc..

[95]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[96]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[97]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[98]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[99]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[100]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[101]  Xiaohua Hu,et al.  A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new summarization evaluation method , 2007, BMC Bioinformatics.

[102]  Yuji Matsumoto,et al.  Protein Name Tagging for Biomedical Annotation in Text , 2003, BioNLP@ACL.

[103]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.