Semantic feature extraction from technical texts with limited human intervention

Natural Language Processing (NLP) and message understanding systems often use semantic information in order to perform lexical and syntactic disambiguation and to assist them in "understanding" the text. Such information is domain-specific in nature and hence difficult to acquire in an automatic manner. This causes a problem whenever an NLP system is moved from one domain to another. Portability of an NLP system can be improved if these semantic features can be acquired with limited human intervention. The semantic information needed by an NLP system may take several different forms. This dissertation focuses on two such semantic features--semantic classes present in a given domain, and lexico-semantic patterns that exist between content words in the domain. This document discusses the techniques that are used to extract these semantic features from a domain with limited human intervention. Semantic classes are discovered by clustering different objects on the basis of the lexico-syntactic environments in which they appear in the corpus. The results of some experiments with augmenting the noun semantic classes with class information obtained from WordNet are presented. A methodology for formally evaluating the semantic classes extracted by the system against classes provided by experts is also presented. Once semantic classes have been obtained, they are then used to generate lexico-semantic patterns that are prevalent in the given domain. A noteworthy feature of this research is that the techniques used to acquire the semantic features require very limited human intervention. The combination of distributional and taxonomic techniques to obtain a set of semantic classes for a given domain has also been found to be useful.

[1]  John D. Lafferty,et al.  Decision Tree Models Applied to the Labeling of Text with Parts-of-Speech , 1992, HLT.

[2]  Lisa F. Rau,et al.  Information extraction and text summarization using linguistic knowledge acquisition , 1989, Inf. Process. Manag..

[3]  Christiane Fellbaum,et al.  English Verbs as a Semantic Net , 1990 .

[4]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[5]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[6]  M. Halliday Categories of the theory of grammar , 1959 .

[7]  Mats Rooth,et al.  Structural Ambiguity and Lexical Relations , 1991, ACL.

[8]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[9]  Naomi Sager,et al.  Natural Language Information Processing: A Computer Grammar of English and Its Applications , 1980 .

[10]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[11]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[12]  Donald Hindle,et al.  Acquiring Disambiguation Rules from Text , 1989, ACL.

[13]  Ralph Grishman,et al.  Smoothing of Automatically Generated Selectional Constraints , 1993, HLT.

[14]  Philip Resnik,et al.  Structural Ambiguity and Conceptual Relations , 1993, VLC@ACL.

[15]  Uri Zernik,et al.  Shipping Departments vs. Shipping Pacemakers: Using Thematic Analysis to Improve Tagging Accuracy , 1992, AAAI.

[16]  Lynette Hirschman,et al.  Porting PUNDIT to the Resource Management Domain , 1989, HLT.

[17]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[18]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[19]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[20]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[21]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[22]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[23]  Christopher D. Manning Automatic Acquisition of a Large Sub Categorization Dictionary From Corpora , 1993, ACL.

[24]  Eric Brill,et al.  Deducing Linguistic Structure from the Statistics of Large Corpora , 1990, HLT.

[25]  Frank Smadja,et al.  Xtract: An overview , 1992, Comput. Humanit..

[26]  Ralph Grishman,et al.  Grammatically-based automatic word class formation , 1975, Inf. Process. Manag..

[27]  Lisa F. Rau,et al.  Lexico-Semantic Pattern Matching as a Companion to Parsing in Text Understanding , 1991, HLT.

[28]  Jeremy J. Carroll,et al.  Linguistic Knowledge Generator , 1992, COLING.

[29]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[30]  James Pustejovsky,et al.  Lexical Semantic Techniques for Corpus Analysis , 1993, CL.

[31]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[32]  Rajeev Agarwal,et al.  A Simple but Useful Approach to Conjunct Identification , 1992, ACL.

[33]  Lynette Hirschman,et al.  Improved Portability And Parsing Through Interactive Acquisition Of Semantic Information , 1988, ANLP.

[34]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[35]  Robert J. P. Ingna,et al.  Porting tonNew Domains Using the Learner , 1989, HLT.

[36]  James Pustejovsky The Acquisition of Lexical Semantic Knowledge from Large Corpora , 1992, HLT.

[37]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[38]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Hinrich Schütze,et al.  Part-of-Speech Induction From Scratch , 1993, ACL.

[40]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[41]  Gregory Grefenstette,et al.  Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches , 1996 .

[42]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[43]  Larry Kerschberg,et al.  Mining for Knowledge in Databases: Goals and General Description of the INLEN System , 1989, Knowledge Discovery in Databases.

[44]  Ralph Grishman,et al.  Analyzing language in restricted domains : sublanguage description and processing , 1986 .

[45]  T C Rindflesch,et al.  Semantic processing in information retrieval. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[46]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[47]  Ralph Grishman,et al.  Preference Semantics for Message Understanding , 1989, HLT.

[48]  Rajeev Agarwal Disambiguation of prepositional phrase attachments in English sentences using case grammar analysis , 1990 .

[49]  Ralph Grishman,et al.  Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments , 1986, Comput. Linguistics.

[50]  Claire Cardie,et al.  University of Massachusetts: Description of the CIRCUS System as Used for MUC-4 , 1992, MUC.

[51]  Steven Finch,et al.  Finding structure in language , 1995 .

[52]  Ralph Grishman,et al.  Acquisition of Selectional Patterns , 1992, COLING.

[53]  Julia E. Hodges,et al.  Automatically building a knowledge base through natural language text analysis , 1991, Int. J. Intell. Syst..

[54]  Nicoletta Calzolari,et al.  Acquiring and Representing Semantic Information in a Lexical Knowledge Base , 1991, SIGLEX Workshop.

[55]  Gregory Grefenstette,et al.  SEXTANT: Extracting Semantics from Raw Text , 1994 .

[56]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[57]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[58]  Roberto Basili,et al.  A Shallow Syntactic Analyser to Extract Word Associations from Corpora , 1992 .

[59]  Kevin Thompson,et al.  Cobweb/3: A portable implementation , 1990 .

[60]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[61]  Vasileios Hatzivassiloglou,et al.  Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning , 1993, ACL.

[62]  Richard M. Schwartz,et al.  Studies in Part of Speech Labelling , 1991, HLT.

[63]  Vasileios Hatzivassiloglou,et al.  Do we Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to a Statistical Word Grouping System , 1994 .

[64]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[65]  Eric Brill A Report of Recent Progress in Transformation-Based Error-Driven Learning , 1994, HLT.

[66]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[67]  Keh-Yih Su,et al.  GPSM: A Generalized Probabilistic Semantic Model for Ambiguity Resolution , 1992, ACL.

[68]  Rajeev Agarwal,et al.  Disambiguation of Prepositional Phrases in Automatically Labelled Technical Text , 1991, AAAI.

[69]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[70]  Eric Brill,et al.  Automatically Acquiring Phrase Structure Using Distributional Analysis , 1992, HLT.