Boosting performance of gene mention tagging system by hybrid methods

NER (Named Entity Recognition) in biomedical literature is presently one of the internationally concerned NLP (Natural Language Processing) research questions. In order to get higher performance, a hybrid experimental framework is presented for the gene mention tagging task. Six classifiers are firstly constructed by four toolkits (CRF++, YamCha, Maximum Entropy (ME) and MALLET) with different training methods and features sets, and then combined with three different hybrid methods respectively: simple set operation method, voting method and two layer stacking method. Experiments carried out on the corpus of BioCreative II GM task show that the three hybrid methods get the F-measure of 87.40%, 87.31% and 87.70% separately without any post-processing, which are all higher than those of any single ones. Our best hybrid method (two layer stacking method) achieves an F-measure of 88.42% after post-processing, which outperforms most of the state-of-the-art systems. We also discuss the influence on the performance of the ensemble system by the number, performance and divergence of single classifiers in each hybrid method, and give the corresponding analysis why our hybrid models can improve the performance.

[1]  Sue Povey,et al.  The HUGO Gene Nomenclature Database, 2006 updates , 2005, Nucleic Acids Res..

[2]  Lishuang Li,et al.  Integrating divergent models for gene mention tagging , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[3]  Chun-Nan Hsu,et al.  Analysis and Enhancement of Conditional Random Fields Gene Mention Taggers in BioCreative II Challenge Evaluation , 2007, LBM.

[4]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[5]  Jian Su,et al.  Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain , 2003, BioNLP@ACL.

[6]  Lishuang Li,et al.  Two-phase biomedical named entity recognition using CRFs , 2009, Comput. Biol. Chem..

[7]  Hongfei Lin,et al.  Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature , 2008, Comput. Biol. Chem..

[8]  Hsin-Hsi Chen,et al.  Enhancing performance of protein and gene name recognizers with filtering and integration strategies , 2004, J. Biomed. Informatics.

[9]  Yi Guan,et al.  Rich features based Conditional Random Fields for biological named entities recognition , 2007, Comput. Biol. Medicine.

[10]  Hongfei Lin,et al.  Exploiting the contextual cues for bio-entity name recognition in biomedical literature , 2008, J. Biomed. Informatics.

[11]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[12]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[13]  Chun-Nan Hsu,et al.  Bayesian classification for data from the same unknown class , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[14]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[15]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[16]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[17]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[18]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[19]  Tiejun Zhao,et al.  Biomedical Named Entity Recognition Based on Classifiers Ensemble , 2008, Int. J. Comput. Sci. Appl..

[20]  Pabitra Mitra,et al.  Feature selection techniques for maximum entropy based biomedical named entity recognition , 2009, J. Biomed. Informatics.

[21]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[22]  Cheng-Ju Kuo,et al.  High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models , 2007 .

[23]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.