Extraction of semantic relations from bioscience text

A crucial area of Natural Language Processing is semantic analysis, the study of the meaning of linguistic utterances. This thesis proposes algorithms that extract semantics from bioscience text using statistical machine learning techniques. In particular this thesis is concerned with the identification of concepts of interest ("entities", "roles") and the identification of the relationships that hold between them. This thesis describes three projects along these lines. First, I tackle the problem of classifying the semantic relations between nouns in noun compounds, to characterize, for example, the "treatment-for-disease" relationship between the words of migraine treatment versus the "method-of-treatment" relationship between the words of sumatriptan treatment. Noun compounds are frequent in technical text and any language understanding program needs to be able to interpret them. The task is especially difficult due to the lack of syntactic clues. I propose two approaches to this problem. Second, extending the work to the sentence level. I examine the problem of distinguishing among seven relation types that can occur between the entities "treatment" and "disease" and the problem of identifying such entities. I compare five generative graphical models and a neural network, using lexical, syntactic, and semantic features. Finally, I tackle the problem of identifying the interactions between proteins, proposing the use of an existing curated database to address the problem of the lack of appropriately labeled data. In each of these cases, I propose, design and implement state-of-the art machine learning algorithms. The results obtained represent first steps on the way to a comprehensive strategy of exploiting machine learning algorithms for the analysis of bioscience text.

[1]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[2]  Milind Mahajan,et al.  Information Extraction Using the Structured Language Model , 2001, EMNLP.

[3]  Rosemary Leonard,et al.  The Interpretation of English Noun Sequences on the Computer , 1984 .

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[6]  Pamela A. Downing On the Creation and Use of English Compound Nouns. , 1977 .

[7]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[8]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[9]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[10]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.

[11]  Maria Lapata,et al.  The Automatic Interpretation of Nominalizations , 2000, AAAI/IAAI.

[12]  Mark Lauer,et al.  Designing Statistical Language Learners: Experiments on Noun Compounds , 1996, ArXiv.

[13]  Philip Resnik,et al.  Structural Ambiguity and Conceptual Relations , 1993, VLC@ACL.

[14]  Mark Craven,et al.  Representing Sentence Structure in Hidden Markov Models for Information Extraction , 2001, IJCAI.

[15]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[16]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[17]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[18]  Snehasis Mukhopadhyay,et al.  A multi-level text mining method to extract biological relationships , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[19]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[20]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[22]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[23]  Timothy W. Finin,et al.  The semantic interpretation of compound nominals , 1980 .

[24]  Ben Shneiderman,et al.  Visual information seeking: tight coupling of dynamic query filters with starfield displays , 1994, CHI '94.

[25]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[26]  Andrew McCallum,et al.  Learning with Scope, with Application to Information Extraction and Classification , 2002, UAI.

[27]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[28]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[29]  M. González Rodríguez,et al.  Proceedings of the third International Conference on Language Resources and Evaluation , 2002 .

[30]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[31]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[32]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[33]  Dayne Freitag,et al.  Trained Named Entity Recognition using Distributional Clusters , 2004, EMNLP.

[34]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[35]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[36]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[37]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[38]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[39]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[40]  Nianwen Xue,et al.  Calibrating Features for Semantic Role Labeling , 2004, EMNLP.

[41]  James W. Cooper,et al.  Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information , 2005, BMC Bioinformatics.

[42]  Doheon Lee,et al.  Learning Rules to Extract Protein Interactions from Biomedical Text , 2003, PAKDD.

[43]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[44]  Alex Bateman,et al.  An introduction to hidden Markov models. , 2007, Current protocols in bioinformatics.

[45]  Judith N. Levi,et al.  The syntax and semantics of complex nominals , 1978 .

[46]  Razvan C. Bunescu,et al.  Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions , 2005, LBLODMBS@IDMB.

[47]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[48]  Ralph Grishman,et al.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[49]  Beatrice Warren,et al.  Semantic patterns of noun-noun compounds , 1978 .

[50]  Mark Dras,et al.  A Probabilistic Model of Compound Nouns , 1994, ArXiv.

[51]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[52]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[53]  Ralph Grishman,et al.  Unsupervised Discovery of Scenario-Level Patterns for Information Extraction , 2000, ANLP.

[54]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[55]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[56]  Barbara Rosario,et al.  Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy , 2001, EMNLP.

[57]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[58]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[59]  Jonathan Aseltine WAVE: An Incremental Algorithm for Information Extraction , 1999 .

[60]  Hasan Davulcu,et al.  IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text , 2005, LBLODMBS@IDMB.

[61]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[62]  Ronen Feldman,et al.  Mining biomedical literature using information extraction , .

[63]  Mark Craven,et al.  Learning to Extract Relations from MEDLINE , 1999 .

[64]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[65]  Michael I. Jordan Graphical Models , 2003 .

[66]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[67]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[68]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[69]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[70]  Peer Bork,et al.  Extracting Regulatory Gene Expression Networks From Pubmed , 2004, ACL.

[71]  Lucy Vanderwende,et al.  Algorithm for Automatic Interpretation of Noun Sequences , 1994, COLING.

[72]  Barbara Rosario,et al.  The Descent of Hierarchy, and Selection in Relational Semantics , 2002, ACL.

[73]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[74]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[75]  Roger Levy,et al.  A Generative Model for Semantic Role Labeling , 2003, ECML.

[76]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[77]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[78]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[79]  Stan Szpakowicz,et al.  Semi-Automatic Recognition of Noun Modifier Relationships , 1998, ACL.

[80]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[81]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[82]  Eugene Charniak,et al.  Parsing with Context-Free Grammars and Word Statistics , 1995 .

[83]  Scott Miller,et al.  A Novel Use of Statistical Parsing to Extract Information from Text , 2000, ANLP.

[84]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[85]  Hang Li,et al.  Generalizing Case Frames Using a Thesaurus and the MDL Principle , 1995, CL.

[86]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[87]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[88]  Ellen Riloff Bootstrapping for text learning tasks , 1999 .

[89]  Michael Collins,et al.  Semantic Tagging using a Probabilistic Context Free Grammar , 1998, VLC@COLING/ACL.

[90]  Paul Buitelaar A Lexicon for Underspecified Semantic Tagging , 1997, ArXiv.

[91]  Barry Smith,et al.  Proceedings of the AMIA Symposium , 2005 .

[92]  Philip Resnik,et al.  Disambiguating Noun Groupings with Respect to Wordnet Senses , 1995, VLC@ACL.

[93]  Marti A. Hearst,et al.  Citances: Citation Sentences for Semantic Analysis of Bioscience Text , 2004 .

[94]  James Pustejovsky,et al.  Lexical Semantic Techniques for Corpus Analysis , 1993, CL.

[95]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[96]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[97]  Suzanne Stevenson,et al.  Unsupervised Semantic Role Labellin , 2004, EMNLP.

[98]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[99]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[100]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[101]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[102]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[103]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.