论文信息 - Extraction of semantic relations from bioscience text

Extraction of semantic relations from bioscience text

A crucial area of Natural Language Processing is semantic analysis, the study of the meaning of linguistic utterances. This thesis proposes algorithms that extract semantics from bioscience text using statistical machine learning techniques. In particular this thesis is concerned with the identification of concepts of interest ("entities", "roles") and the identification of the relationships that hold between them. This thesis describes three projects along these lines. First, I tackle the problem of classifying the semantic relations between nouns in noun compounds, to characterize, for example, the "treatment-for-disease" relationship between the words of migraine treatment versus the "method-of-treatment" relationship between the words of sumatriptan treatment. Noun compounds are frequent in technical text and any language understanding program needs to be able to interpret them. The task is especially difficult due to the lack of syntactic clues. I propose two approaches to this problem. Second, extending the work to the sentence level. I examine the problem of distinguishing among seven relation types that can occur between the entities "treatment" and "disease" and the problem of identifying such entities. I compare five generative graphical models and a neural network, using lexical, syntactic, and semantic features. Finally, I tackle the problem of identifying the interactions between proteins, proposing the use of an existing curated database to address the problem of the lack of appropriately labeled data. In each of these cases, I propose, design and implement state-of-the art machine learning algorithms. The results obtained represent first steps on the way to a comprehensive strategy of exploiting machine learning algorithms for the analysis of bioscience text.

Barbara Rosario | Marti A. Hearst | Barbara Rosario

[1] Fernando Pereira,et al. Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[2] Milind Mahajan,et al. Information Extraction Using the Structured Language Model , 2001, EMNLP.

[3] Rosemary Leonard,et al. The Interpretation of English Noun Sequences on the Computer , 1984 .

[4] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[5] Ellen Riloff,et al. Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[6] Pamela A. Downing. On the Creation and Use of English Compound Nouns. , 1977 .

[7] Javed Mostafa,et al. Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[8] William B. Langdon,et al. BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[9] Daniel Berleant,et al. Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[10] James Pustejovsky,et al. The Generative Lexicon , 1995, CL.

[11] Maria Lapata,et al. The Automatic Interpretation of Nominalizations , 2000, AAAI/IAAI.