Question Answering on a Case Insensitive Corpus

Most question answering (QA) systems rely on both keyword index and Named Entity (NE) tagging. The corpus from which the QA systems attempt to retrieve answers is usually mixed case text. However, there are numerous corpora that consist of case insensitive documents, e.g. speech recognition results. This paper presents a successful approach to QA on a case insensitive corpus, whereby a preprocessing module is designed to restore the case-sensitive form. The document pool with the restored case then feeds the QA system, which remains unchanged. The case restoration preprocessing is implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach leads to very limited degradation in QA benchmarking (2.8%), mainly due to the limited degradation in the underlying information extraction support.

[1]  Kenneth C. Litkowski Question-Answering Using Semantic Relation Triples , 1999, TREC.

[2]  Eduard H. Hovy,et al.  The Use of External Knowledge of Factoid QA , 2001, TREC.

[3]  George R. Krupka,et al.  IsoQuest Inc.: Description of the NetOwl™ Extractor System as Used for MUC-7 , 1998, MUC.

[4]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[5]  Cheng Niu,et al.  Orthographic case restoration using supervised learning without manual annotation , 2004, Int. J. Artif. Intell. Tools.

[6]  Julian Kupiec,et al.  MURAX: a robust linguistic approach for question answering using an on-line encyclopedia , 1993, SIGIR.

[7]  Elaine Marsh,et al.  Appendix D: MUC-7 Information Extraction Task Definition (version 5.1) , 1998, MUC.

[8]  R O H I N,et al.  InfoXtract : A customizable intermediate level information extraction engine , 2022 .

[9]  Wei Li,et al.  A Question Answering System Supported by Information Extraction , 2000, ANLP.

[10]  Sanda M. Harabagiu,et al.  High performance question/answering , 2001, SIGIR '01.

[11]  Ralph Weischedel,et al.  NAMED ENTITY EXTRACTION FROM SPEECH , 1998 .

[12]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[13]  Hwee Tou Ng,et al.  Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text , 2002, ACL.

[14]  Cheng Niu,et al.  Extracting Exact Answers to Questions Based on Structural Links , 2002, COLING 2002.

[15]  Mari Ostendorf,et al.  Robust information extraction from automatically generated speech transcriptions , 2000, Speech Commun..

[16]  Richard M. Schwartz,et al.  Named Entity Extraction from Noisy Input: Speech and OCR , 2000, ANLP.

[17]  Ellen M. Voorhees,et al.  Overview of the TREC-9 Question Answering Track , 2000, TREC.

[18]  Michael Collins,et al.  Answer Extraction , 2000, ANLP.

[19]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[20]  Lynette Hirschman,et al.  Overview: Information Extraction From Broadcast News , 1999 .