论文信息 - Parsing the Medline Corpus

Parsing the Medline Corpus

For the development of PHASAR, an experimental system for literature mining in the BioSciences which uses dependency triples as search terms, we have parsed a snapshot of the Medline collection of biomedical abstracts (18 million short documents, 17 Gbytes of text) using the EP4IR dependency parser of English. The resulting dependency trees were unnested into triples and indices from words and triples to documents were constructed. We describe the linguistic resources, the parsing technique used (best-only top-down chart parsing) and the unnesting and indexation processes. We describe the parsing and indexation process and show the results of some performance measurements.

[1] Gert Vriend,et al. MRS: a fast and compact retrieval system for biological data , 2005, Nucleic Acids Res..

[2] F. A. Grootjen,et al. Effectiveness of Index Expressions , 2004, NLDB.

[3] Joel L Fagan,et al. Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[4] Dekang Lin,et al. A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[5] Marti A. Hearst. Untangling Text Data Mining , 1999, ACL.

[6] Cornelis H. A. Koster,et al. The PHASAR Search Engine , 2006, NLDB.

[7] David D. Lewis,et al. An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[8] Hagit Shatkay,et al. Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[9] Jun'ichi Tsujii,et al. Efficient HPSG Parsing with Supertagging and CFG-Filtering , 2007, IJCAI.