Named Entity Recognition : A Survey of Machine-Learning Tools

It is well known that the rapid growth and dissemination of the Internet has resulted in huge amounts of information generated and shared, available in the form of textual data, images, videos or sounds. This overwhelming surge of data is also true for specific areas such as biomedicine, where the number of published documents, such as articles, books and technical reports, is increasing exponentially. For instance, the MEDLINE literature database contains over 20 million references to journal papers, covering a wide range of biomedical fields. In order to organize and manage these data, several manual curation efforts have been set up to identify, in texts, information regarding entities (e.g. genes and proteins) and their relations (e.g. protein-protein interactions). The extracted information is stored in structured knowledge resources, such as Swiss-Prot [1] and GenBank [2]. However, the ef‐ fort required to continually update these databases makes this a very demanding and ex‐ pensive task, naturally leading to increasing interest in the application of Text Mining (TM) systems to help perform those tasks.

[1]  Sérgio Matos,et al.  Current Methodologies for Biomedical Named Entity Recognition , 2013 .

[2]  Dietrich Rebholz-Schuhmann,et al.  Harmonization of gene/protein annotations: towards a gold standard MEDLINE , 2012, Bioinform..

[3]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[4]  René Witte,et al.  OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents , 2011, Bioinform..

[5]  Alberto Lavelli,et al.  Assessing the practical usability of an automatically annotated corpus , 2011, Linguistic Annotation Workshop.

[6]  Xiaohua Hu,et al.  A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Alberto Lavelli,et al.  Disease Mention Recognition with Specific Features , 2010, BioNLP@ACL.

[8]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[9]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[10]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[11]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[12]  Martin Hofmann-Apitius,et al.  Detection of IUPAC and IUPAC-like chemical names , 2008, ISMB.

[13]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[14]  Elena Beisswanger,et al.  Semantic Annotations for Biology: a Corpus Development Initiative at the Jena University Language & Information Engineering (JULIE) Lab , 2008, LREC.

[15]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[16]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[17]  Gideon S. Mann,et al.  Efficient Computation of Entropy Gradient for Semi-Supervised Conditional Random Fields , 2007, NAACL.

[18]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[19]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[20]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[21]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[22]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[23]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[24]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[27]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[28]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank. , 1991, Nucleic acids research.

[29]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[30]  Martin Hofmann-Apitius,et al.  An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature , 2010, LREC 2010.

[31]  Martin Hofmann-Apitius,et al.  Chemical Names: Terminological Resources and Corpora Annotation , 2008, LREC 2008.

[32]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[33]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[34]  Udo Hahn,et al.  A Reappraisal of Sentence and Token Splitting for Life Sciences Documents , 2007, MedInfo.

[35]  D. Ph.,et al.  A Comparison of 13 Tokenizers on MEDLINE , 2007 .

[36]  Hong Yu,et al.  Towards Answering Biological Questions with Experimental Evidence: Automatically Identifying Text that Summarize Image Content in Full-Text Articles , 2006, AMIA.

[37]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[40]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005, Bioinform..