Sepia : semantic parsing for named entities

People's names, dates, locations, organizations, and various numeric expressions, collectivelv called Named Entities, are used to convey specific meanings to humans in the same way that identifiers and constants convey meaning to a computer language interpreter. Natural Language Question Answering can benefit from understanding the meaning of these expressions because answers in a text are often phrased differently from questions and from each other. For example, "9/11" might mean the same as "September 11th" and "Mayor Rudy Giuliani" might be the same person as "Rudolph Giuliani". Sepia, the system presented here, uses a lexicon of lambda expressions and a mildly context-sensitive parser ri ireateA data structure fe ehch' named entity. The parser and grammar design are inspired by Combinatory Categorial Grammar. The data structures are designed to capture semantic dependencies using common syntactic forms. Sepia differs from other natural language parsers in that it does not use a pipeline architecture. As yet there is no statistical component in the architecture. To evaluate Sepia, I use examples tp illustrate its qualitative differences from other named entity systems, I measure component perforrmance on Automatic Content Extraction (ACE) competition held-out training data. and I assess end-to-end performance in the Infolab's TREC-12 Question Answering competition entry. Sepia will compete in the ACE Entity Detection and Tracking track at the end of September. Thesis Supervisor: Boris Katz Title: Principal Research Scientist

[1]  Cynthia A. Thompson,et al.  Corpus-Based Lexical Acquisition For Semantic Parsing , 1996 .

[2]  Y. Miyashita,et al.  Image, language, brain , 2000 .

[3]  Jason Eisner Efficient Normal-Form Parsing for Combinatory Categorial Grammar , 1996, ACL.

[4]  E. Gibson The dependency locality theory: A distance-based theory of linguistic complexity. , 2000 .

[5]  Mark Steedman,et al.  Generative Models for Statistical Parsing with Combinatory Categorial Grammar , 2002, ACL.

[6]  Elaine Marsh,et al.  Appendix D: MUC-7 Information Extraction Task Definition (version 5.1) , 1998, MUC.

[7]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[8]  Hwee Tou Ng,et al.  Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods , 2003, ACL.

[9]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[10]  Douglas E. Appelt,et al.  SRI International FASTUS SystemMUC-6 Test Results and Analysis , 1995, MUC.

[11]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[12]  Yorick Wilks,et al.  IR and AI: Traditions of Representation and Anti-representation in Information Processing , 2004, ECIR.

[13]  Janyce Wiebe,et al.  Mapping corpus-based semantic role annotations from TreeBank and FrameNet to CG and Cyc , 2003 .

[14]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[15]  Kalina Bontcheva,et al.  GATE: an Architecture for Development of Robust HLT applications , 2002, ACL.

[16]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[17]  Deniz Yuret,et al.  Blitz: A Preprocessor for Detecting Context-Independent Linguistic Structures 1 , 1998 .

[18]  C. Phillips Linear Order and Constituency , 2003, Linguistic Inquiry.

[19]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[20]  Yorick Wilks,et al.  University of Sheffield: description of the LaSIE system as used for MUC-6 , 1995, MUC.

[21]  George R. Krupka,et al.  IsoQuest Inc.: Description of the NetOwl , 1998, Message Understanding Conference.

[22]  S. Danforth,et al.  The Bridge Project , 1997, Journal of learning disabilities.

[23]  Sanda M. Harabagiu,et al.  Performance Issues and Error Analysis in an Open-Domain Question Answering System , 2002, ACL.

[24]  Jason Baldridge,et al.  Leo: an Architecture for Sharing Resources for Unification-Based Grammars , 2002, LREC.

[25]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[26]  Ellen M. Voorhees,et al.  Overview of the TREC-9 Question Answering Track , 2000, TREC.

[27]  Amit P. Sheth,et al.  Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content , 2002 .

[28]  Keith Bonawitz,et al.  An Architecture for Word Learning using Bidirectional Multimodal Structural Alignment , 2003, HLT-NAACL 2003.

[29]  W. Eric L. Grimson,et al.  Answering Questions about Moving Objects in Surveillance Videos , 2003, New Directions in Question Answering.

[30]  Fabio Rinaldi,et al.  FACILE: Description of the NE System Used for MUC-7 , 1998, MUC.

[31]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[32]  Douglas E. Appelt,et al.  GEMINI: A Natural Language System for Spoken-Language Understanding , 1993, ACL.

[33]  Julia Hockenmaier,et al.  Extending the Coverage of a CCG System , 2004 .

[34]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[35]  Kalina Bontcheva,et al.  Adapting a Robust Multi-genre NE System for Automatic Content Extraction , 2002, AIMSA.

[36]  Lynette Hirschman,et al.  Deep Read: A Reading Comprehension System , 1999, ACL.

[37]  Marc Moens,et al.  Description of the LTG System Used for MUC-7 , 1998, MUC.

[38]  Nancy Chinchor,et al.  Appendix E: MUC-7 Named Entity Task Definition (version 3.5) , 1998, MUC.

[39]  Jimmy J. Lin,et al.  Omnibase: Uniform Access to Heterogeneous Data for Question Answering , 2002, NLDB.

[40]  Dekang Lin Using Collocation Statistics in Information Extraction , 1998, MUC.

[41]  Shuanhu Bai,et al.  Description of the Kent Ridge Digital Labs System Used for MUC-7 , 1998, MUC.

[42]  Douglas E. Appelt,et al.  Deductive Question Answering from Multiple Resources , 2004, New Directions in Question Answering.

[43]  Bonnie J. Dorr,et al.  The use of lexical semantics in interlingual machine translation , 2004, Machine Translation.

[44]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[45]  Richard M. Schwartz,et al.  BBN: Description of the SIFT System as Used for MUC-7 , 1998, MUC.

[46]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[47]  Mark Steedman,et al.  The syntactic process , 2004, Language, speech, and communication.

[48]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[49]  Wei Li,et al.  Information Extraction Supported Question Answering , 1999, TREC.