LingPipe for 99.99% Recall of Gene Mentions

Text data mining over biomedical research literature is a needle-in-a-haystack problem. We contend that first-best methods performing at 90% F-measure are insufficient, especially given that performance is much worse for “unseen” phrases. In this paper, we recast the problem as one of n-best search rather than first-best database population. We describe LingPipe’s HMM and character language model-based chunkers, which extract mentions of genes in unseen MEDLINE abstracts at 99.99% recall with greater than 50% mean-average precision. We provide evaluation results in terms of received precision-recall curves on unseen data.