Analysis and Enhancement of Conditional Random Fields Gene Mention Taggers in BioCreative II Challenge Evaluation

Background: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In BioCreative 2 challenge, the conditional random fields model (CRF) was the most prevailing method in the gene mention task. In this paper, we analyze two best performing CRF-based systems in BioCreative 2. We examine their key claims and propose enhancement based on the analysis results. Results: We implemented their systems in MALLET as specified in their report and in CRF++, a different CRF package, to empirically analyze their claims. We found that their feature set is effective for models trained by MALLET, but a smaller set works better for those by CRF++. We confirmed the effectiveness of pairing parentheses as a post processing step. We found that backward parsing is not always superior to forward parsing. The benefit of applying bidirectional parsing is the creation of a wider variety of complementary models. We elaborated the notion of divergent models by relating it to the difference of the increments of ture positives and false positives of the union model. Conclusions: To further enhance the performance, we can integrate more models based on the elaborated notion of divergent models that we derived to minimize the number of models required.

[1]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[2]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[3]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[4]  Sue Povey,et al.  The HUGO Gene Nomenclature Database, 2006 updates , 2005, Nucleic Acids Res..

[5]  Chun-Nan Hsu,et al.  Triple jump acceleration for the EM algorithm , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[7]  G. D. Zhou,et al.  Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid , 2006, Int. J. Medical Informatics.

[8]  Andrew Josey Updates , 2003, login Usenix Mag..

[9]  Cheng-Ju Kuo,et al.  High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models , 2007 .

[10]  Ruslan Salakhutdinov,et al.  Adaptive Overrelaxed Bound Optimization Methods , 2003, ICML.

[11]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[12]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[13]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[14]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Chun-Nan Hsu,et al.  Global and Componentwise Extrapolation for Accelerating Data Mining from Large Incomplete Data Sets with the EM Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).