Parsing the Medline Corpus

For the development of PHASAR, an experimental system for literature mining in the BioSciences which uses dependency triples as search terms, we have parsed a snapshot of the Medline collection of biomedical abstracts (18 million short documents, 17 Gbytes of text) using the EP4IR dependency parser of English. The resulting dependency trees were unnested into triples and indices from words and triples to documents were constructed. We describe the linguistic resources, the parsing technique used (best-only top-down chart parsing) and the unnesting and indexation processes. We describe the parsing and indexation process and show the results of some performance measurements.