Extracting Protein Sub-cellular Localizations from Literature

Protein Sub-cellular Localization (PSL) prediction is an important task for predicting protein functions. Because the sequence-based approach used in the most previous work has focused on prediction of locations for given proteins, it failed to provide useful information for the cases in which single proteins are localized, depending on their states in progress, in several different sub-cellular locations. While it is difficult for the sequence-based approach, it can be tackled by the text-based approach. The proposed approach extracts PSL from literature using Natural Language Processing techniques. We conducted experiments to see how our system performs in identification of evidence sentences and what linguistic features from sentences significantly contribute to the task. This article presents a text-based novel approach to extract PSL relations with their evidence sentences. Evidence sentences will provide indispensable pieces of information that the sequence-based approach cannot supply.