Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup

MOTIVATION The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. RESULTS We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new ('blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.

[1]  Alexander A. Morgan,et al.  Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles , 2002, SKDD.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Ellen M. Voorhees,et al.  The eleventh text REtrieval conference, TREC 2002 , 2003 .

[4]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[5]  A. J. Schroeder,et al.  The FlyBase database of the Drosophila Genome Projects and community literature. , 2002, Nucleic acids research.

[6]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[7]  Wei Chu,et al.  A machine learning approach for the curation of biomedical literature: KDD Cup 2002 (task 1) , 2002, SKDD.

[8]  Ronen Feldman,et al.  Rule-based extraction of experimental evidence in the biomedical domain: the KDD Cup 2002 (task 1) , 2002, SKDD.

[9]  ZhangYong,et al.  Automatic scientific text classification using local patterns , 2002 .

[10]  Lynette Hirschman,et al.  The Evolution of evaluation: Lessons from the Message Understanding Conferences , 1998, Comput. Speech Lang..

[11]  FeldmanRonen,et al.  Rule-based extraction of experimental evidence in the biomedical domain , 2002 .

[12]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Moustafa Ghanem,et al.  Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1) , 2002, SKDD.

[15]  R. C. Johnson,et al.  Neuropeptide Amidation in Drosophila: Separate Genes Encode the Two Enzymes Catalyzing Amidation , 1997, The Journal of Neuroscience.

[16]  K. White,et al.  APPL, the Drosophila Member of the APP-Family, Exhibits Differential Trafficking and Processing in CNS Neurons , 1996, The Journal of Neuroscience.

[17]  R. Farkas,et al.  The LAMMER protein kinase encoded by the Doa locus of Drosophila is required in both somatic and germline cells and is expressed as both nuclear and cytoplasmic isoforms throughout development. , 2000, Genetics.

[18]  K. White,et al.  A Drosophila gene encoding a protein resembling the human beta-amyloid protein precursor. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Y. Engström,et al.  The GATA factor Serpent is required for the onset of the humoral immune response in Drosophila embryos , 2001, Proceedings of the National Academy of Sciences of the United States of America.