Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges

Text is a very important type of data within the biomedical domain. For example, patient records contain large amounts of text which has been entered in a non-standardized format, consequently posing a lot of challenges to processing of such data. For the clinical doctor the written text in the medical findings is still the basis for decision making – neither images nor multimedia data. However, the steadily increasing volumes of unstructured information need machine learning approaches for data mining, i.e. text mining. This paper provides a short, concise overview of some selected text mining methods, focusing on statistical methods, i.e. Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation, Hierarchical Latent Dirichlet Allocation, Principal Component Analysis, and Support Vector Machines, along with some examples from the biomedical domain. Finally, we provide some open problems and future challenges, particularly from the clinical domain, that we expect to stimulate future research.

[1]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing and Text Mining , 2014 .

[2]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[3]  Igor Jurisica,et al.  Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions , 2014, BMC Bioinformatics.

[4]  R. Althoff,et al.  Principal components analysis of a large cohort with Tourette syndrome , 2008, British Journal of Psychiatry.

[5]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[6]  Heljä Lundgrén-Laine,et al.  Characteristics and Analysis of Finnish and Swedish Clinical Intensive Care Nursing Narratives , 2010, Louhi@NAACL-HLT.

[7]  Wen-Lian Hsu,et al.  New Challenges for Biological Text-Mining in the Next Decade , 2010, Journal of Computer Science and Technology.

[8]  Carolyn J. Crouch,et al.  A connectionist model for information retrieval based on the vector space model , 1994 .

[9]  Pengzhu Zhang,et al.  Exploring Health-Related Topics in Online Health Community Using Cluster Analysis , 2013, 2013 46th Hawaii International Conference on System Sciences.

[10]  Eugene Agichtein,et al.  Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions , 2003, Pacific Symposium on Biocomputing.

[11]  K. Bretonnel Cohen,et al.  A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools , 2012, BMC Bioinformatics.

[12]  Geoffrey Z. Liu Semantic vector space model : Implementation and evaluation , 1997 .

[13]  Tapio Salakoski,et al.  Combining hidden Markov models and latent semantic analysis for topic segmentation and labeling: Method and clinical application , 2008, Int. J. Medical Informatics.

[14]  Alexander Mehler,et al.  Aspects of Automatic Text Analysis , 2010, Studies in Fuzziness and Soft Computing.

[15]  Tudor I. Oprea,et al.  Associating Drugs, Targets and Clinical Outcomes into an Integrated Network Affords a New Platform for Computer‐Aided Drug Repurposing , 2011, Molecular informatics.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[18]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[19]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[20]  Sophia Ananiadou,et al.  Supporting Systematic Reviews Using Text Mining , 2009 .

[21]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[22]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[23]  Koji Eguchi,et al.  Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation , 2008, DTMBIO '08.

[24]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[25]  Andreas Holzinger,et al.  Quality-Based Knowledge Discovery from Medical Text on the Web , 2013, Quality Issues in the Management of Web Information.

[26]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[27]  Andreas Holzinger,et al.  On Knowledge Discovery and Interactive Intelligent Visualization of Biomedical Data - Challenges in Human-Computer Interaction & Biomedical Informatics , 2012, DATA.

[28]  Ata Kabán,et al.  Sequential Activity Profiling: Latent Dirichlet Allocation of Markov Chains , 2005, Data Mining and Knowledge Discovery.

[29]  Andreas Holzinger,et al.  Big Complex Biomedical Data: Towards a Taxonomy of Data , 2012, ICETE.

[30]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[31]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[32]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[33]  Andreas Holzinger,et al.  Semantische Informationsextraktion in medizinischen Informationssystemen , 2007, Informatik-Spektrum.

[34]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[35]  José Carlos Cortizo,et al.  Testing concept indexing in crosslingual medical text classification , 2008, 2008 Third International Conference on Digital Information Management.

[36]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[37]  Yanchun Zhang,et al.  Web Technologies Research and Development - APWeb 2005, 7th Asia-Pacific Web Conference, Shanghai, China, March 29 - April 1, 2005, Proceedings , 2005, APWeb.

[38]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[39]  Gabriella Pasi,et al.  Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data , 2013, Lecture Notes in Computer Science.

[40]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[41]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[42]  Igor Jurisica,et al.  Knowledge Discovery and Data Mining in Biomedical Informatics: The Future Is in Integrative, Interactive Machine Learning Solutions , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[43]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[44]  Yanchun Zhang,et al.  A Web Recommendation Technique Based on Probabilistic Latent Semantic Analysis , 2005, WISE.

[45]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[46]  Walter Kintsch The potential of latent semantic analysis for machine grading of clinical case summaries , 2002, J. Biomed. Informatics.

[47]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[48]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[49]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[50]  Markus Kreuzthaler,et al.  A Comparison of Different Retrieval Strategies Working on Medical Free Texts , 2011, J. Univers. Comput. Sci..

[51]  Erkki Sutinen,et al.  Applying Latent Dirichlet Allocation to Automatic Essay Grading , 2006, FinTAL.

[52]  David Martínez,et al.  Extraction of Named Entities from Tables in Gene Mutation Literature , 2009, BioNLP@HLT-NAACL.

[53]  Marco Masseroli,et al.  Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[54]  Chung-Hsien Wu,et al.  Ontology-based speech act identification in a bilingual dialog system using partial pattern trees , 2008 .

[55]  Vimla L. Patel,et al.  Simulating expert clinical comprehension: Adapting latent semantic analysis to accurately extract clinical concepts from psychiatric narrative , 2008, J. Biomed. Informatics.

[56]  Ted Briscoe,et al.  Statistical Anaphora Resolution in Biomedical Texts , 2008, COLING.

[57]  Ozlem Uzuner,et al.  Second i2b2 workshop on natural language processing challenges for clinical records. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[58]  Mark A. Girolami,et al.  Employing Latent Dirichlet Allocation for fraud detection in telecommunications , 2007, Pattern Recognit. Lett..

[59]  Anne-Lise Veuthey,et al.  Assisting medical annotation in Swiss-Prot using statistical classifiers , 2005, Int. J. Medical Informatics.

[60]  Kwang-Hyun Cho,et al.  Encyclopedia of Systems Biology , 2013, Springer New York.

[61]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[62]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[63]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[64]  J.R. Bellegarda,et al.  Latent semantic mapping [information retrieval] , 2005, IEEE Signal Processing Magazine.

[65]  Karin M. Verspoor,et al.  Literature mining of genetic variants for curation: quantifying the importance of supplementary material , 2014, Database J. Biol. Databases Curation.

[66]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[67]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.

[68]  Peter W. Foltz,et al.  The Measurement of Textual Coherence with Latent Semantic Analysis. , 1998 .

[69]  Jae Young Lee,et al.  An Intelligent Grading System for Descriptive Examination Papers Based on Probabilistic Latent Semantic Analysis , 2004, Australian Conference on Artificial Intelligence.

[70]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[71]  Fabio Rinaldi,et al.  Dependency-Based Relation Mining for Biomedical Literature , 2008, LREC.

[72]  Andreas Holzinger,et al.  Knowledge discovery of drug data on the example of adverse reaction prediction , 2014, BMC Bioinformatics.

[73]  Suchi Saria,et al.  Discovering shared and individual latent structure in multiple time series , 2010, ArXiv.

[74]  Akinori Yonezawa,et al.  The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011 , 2012, BMC Bioinformatics.

[75]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[76]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[77]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[78]  Siddhartha Jonnalagadda,et al.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules , 2012, J. Am. Medical Informatics Assoc..

[79]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[80]  S J Stanhope,et al.  Exploiting Online Discussions to Discover Unrecognized Drug Side Effects , 2013, Methods of Information in Medicine.

[81]  Philip Resnik,et al.  Communication of Clinically Relevant Information in Electronic Health Records : A Comparison between Structured Data and Unrestricted Physician Language , 2008 .

[82]  Ryen W. White,et al.  Web-scale pharmacovigilance: listening to signals from the crowd , 2013, J. Am. Medical Informatics Assoc..

[83]  Anne H. H Ngu,et al.  Web Information Systems Engineering - WISE 2005, 6th International Conference on Web Information Systems Engineering, New York, NY, USA, November 20-22, 2005, Proceedings , 2005, WISE.

[84]  Andreas Holzinger,et al.  Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field , 2013, CHI-KDD.

[85]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[86]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing , 2014 .

[87]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[88]  Ellen M. Voorhees,et al.  Overview of the TREC 2012 Medical Records Track , 2012, TREC.

[89]  A. Govardhan,et al.  SURVEY ON PREDICTION OF HEART MORBIDITY USING DATA MINING TECHNIQUES , 2011 .

[90]  Oliviero Carugo,et al.  Data Mining Techniques for the Life Sciences , 2009, Methods in Molecular Biology.

[91]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[92]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[93]  Armin R. Mikler,et al.  Text and Structural Data Mining of Influenza Mentions in Web and Social Media , 2010, International journal of environmental research and public health.

[94]  William Speier,et al.  A topic model of clinical reports , 2012, SIGIR '12.

[95]  Wulfram Gerstner,et al.  Artificial Neural Networks — ICANN'97 , 1997, Lecture Notes in Computer Science.

[96]  Michal Karpowicz,et al.  Opinion Mining on the Web 2.0 - Characteristics of User Generated Content and Their Impacts , 2013, CHI-KDD.

[97]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[98]  Andreas Holzinger,et al.  Darwin or Lamarck? Future Challenges in Evolutionary Algorithms for Knowledge Discovery and Data Mining , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[99]  Michael Granitzer,et al.  Towards Disambiguating Web Tables , 2013, SEMWEB.

[100]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[101]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[102]  Ted Pedersen,et al.  A Comparative Study of Support Vector Machines Applied to the Supervised Word Sense Disambiguation Problem in the Medical Domain , 2005, IICAI.

[103]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[104]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[105]  Mirella Lapata,et al.  Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) , 2008 .

[106]  Euripides G. M. Petrakis,et al.  Information Retrieval by Semantic Similarity , 2006, Int. J. Semantic Web Inf. Syst..

[107]  Wolfgang Kienreich,et al.  Visual Analysis and Knowledge Discovery for Text , 2014, Large-Scale Data Analytics.

[108]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[109]  Michael Granitzer,et al.  Do We Need Entity-Centric Knowledge Bases for Entity Disambiguation? , 2013, i-Know '13.

[110]  Antoine Geissbühler,et al.  A Review of Content{Based Image Retrieval Systems in Medical Applications { Clinical Bene(cid:12)ts and Future Directions , 2022 .

[111]  Andrew McCallum,et al.  Unsupervised Relation Discovery with Sense Disambiguation , 2012, ACL.

[112]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[113]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[114]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[115]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[116]  Philip V. Ogren,et al.  Improving Syntactic Coordination Resolution using Language Modeling , 2010, NAACL.

[117]  Eric SanJuan,et al.  A New Hybrid Summarizer Based on Vector Space Model, Statistical Physics and Linguistics , 2007, MICAI.

[118]  Alexander Gelbukh,et al.  MICAI 2007: Advances in Artificial Intelligence, 6th Mexican International Conference on Artificial Intelligence, Aguascalientes, Mexico, November 4-10, 2007, Proceedings , 2007, MICAI.

[119]  Erik Börjesson,et al.  A vector model for perceived object rotation and translation in space , 1975, Psychological research.

[120]  C. Kendziorski,et al.  Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes , 2012, 1202.5999.

[121]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[122]  Andreas Holzinger Biomedical Informatics: Discovering Knowledge in Big Data , 2014 .

[123]  Indrajit Mukherjee,et al.  Content analysis based on text mining using genetic algorithm , 2010, 2010 2nd International Conference on Computer Technology and Development.

[124]  Andreas Holzinger,et al.  Semantic Information in Medical Information Systems: Utilization of Text Mining Techniques to Analyze Medical Diagnoses , 2008, J. Univers. Comput. Sci..

[125]  Andreas Holzinger,et al.  On Topological Data Mining , 2014, Interactive Knowledge Discovery and Data Mining in Biomedical Informatics.

[126]  Wolfgang Himmel,et al.  Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums , 2009, Journal of medical Internet research.

[127]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[128]  Luo Si,et al.  Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis , 2005, PAKDD.

[129]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[130]  Hyeoncheol Kim,et al.  Identifying non-elliptical entity mentions in a coordinated NP with ellipses , 2014, J. Biomed. Informatics.

[131]  Vinod D. Kumar,et al.  Biomedical Literature Mining , 2014, Methods in Molecular Biology.

[132]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[133]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[134]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[135]  Shuying Shen,et al.  Evaluating the state of the art in coreference resolution for electronic medical records , 2012, J. Am. Medical Informatics Assoc..

[136]  Gianluca Quercini,et al.  Entity discovery and annotation in tables , 2013, EDBT '13.

[137]  J. Pitman Combinatorial Stochastic Processes , 2006 .

[138]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[139]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[140]  Yong Yu,et al.  Using Probabilistic Latent Semantic Analysis for Personalized Web Search , 2005, APWeb.

[141]  Xinghuo Yu,et al.  AI 2004: Advances in Artificial Intelligence, 17th Australian Joint Conference on Artificial Intelligence, Cairns, Australia, December 4-6, 2004, Proceedings , 2004, Australian Conference on Artificial Intelligence.

[142]  Alex A T Bui,et al.  Clinical Case-based Retrieval Using Latent Topic Analysis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[143]  Andreas Holzinger,et al.  On Knowledge Discovery in Open Medical Data on the Example of the FDA Drug Adverse Event Reporting System for Alendronate (Fosamax) , 2013, CHI-KDD.