Thresholding Strategies for Text Classifiers: TREC 2005 Biomedical Triage Task Experiments

We participated in the triage task of biomedical documents in the TREC genomic track. In this paper we describe the methods we developed for the four triage 1 subtasks. Logistic regression and support vector machine algorithms were first trained to generate ranked lists of test documents. Then a subset of the test documents was identified as positive instances by selecting the top-k documents of the ranked lists. Deciding on the ideal value for k requires a good thresholding strategy. In this paper we first describe two thresholding strategies based on i) logistic regression and ii) support vector machines. In addition to these methods, we describe a thresholding method that combines the outputs from logistic regression and support vector machine by applying a joint thresholding strategy.