Using a Reference Corpus as a User Model for Focused Information Retrieval

We propose a method for ranking short information nuggets extracted from a text corpus, using another, reliable reference corpus as a user model. We argue that the availability and usage of such additional corpora is common in a number of IR tasks, and apply the method to answering a form of definition questions. The proposed ranking method makes a substantial improvement in the performance of our system.

[1]  Donna K. Harman,et al.  Scaling Up the TREC Collection , 1999, Information Retrieval.

[2]  Remco C. Veltkamp,et al.  Using transportation distances for measuring melodic similarity , 2003, ISMIR.

[3]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[4]  Esko Ukkonen,et al.  The C-BRAHMS project , 2003, ISMIR.

[5]  Donna K. Harman,et al.  Overview of the TREC 2003 Novelty Track , 2003, TREC.

[6]  Remco C. Veltkamp,et al.  Searching notated polyphonic music using transportation distances , 2004, MULTIMEDIA '04.

[7]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Eric C. Jensen,et al.  A Survey of Retrieval Strategies for OCR Text Collections , 2002 .

[10]  David Hawking,et al.  Proximity Operators - So Near And Yet So Far , 1995, TREC.

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[13]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[14]  Jennifer Chu-Carroll,et al.  A Multi-Strategy and Multi-Source Approach to Question Answering , 2002, TREC.

[15]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[16]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[17]  D. Slawson,et al.  What clinical information do doctors need? , 1997 .

[18]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[19]  J. Stephen Downie,et al.  Toward the scientific evaluation of music information retrieval systems , 2003, ISMIR.

[20]  J Deinum,et al.  Acute pancreatitis after a course of clarithromycin. , 2003, The Netherlands journal of medicine.

[21]  Sylvie Calabretto,et al.  Passage à l’échelle dans la taille des corpus. , 2006 .

[22]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[23]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[24]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[25]  Uwe Quasthoff Tools for automatic lexicon maintenance: acquisition, error correction, and the generation of missing values , 1998 .

[26]  Jinxi Xu,et al.  TREC 2003 QA at BBN: Answering Definitional Questions , 2003, TREC.

[27]  James Allan,et al.  Flexible intrinsic evaluation of hierarchical clustering for TDT , 2003, CIKM '03.

[28]  Julian Kupiec,et al.  MURAX: a robust linguistic approach for question answering using an on-line encyclopedia , 1993, SIGIR.

[29]  Wessel Kraaij,et al.  Unsupervised Event Clustering in Multilingual News Streams , 2002 .

[30]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[31]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[32]  Iadh Ounis,et al.  A study of parameter tuning for term frequency normalization , 2003, CIKM '03.

[33]  John Howard,et al.  Plaine and Easie Code : a code for music bibliography , 1997 .

[34]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[35]  M. de Rijke,et al.  Information Retrieval Support for Ontology Construction and Use , 2004, SEMWEB.

[36]  E M van Mulligen,et al.  UMLS-based access to CPR data. , 1998, Studies in health technology and informatics.

[37]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[38]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[39]  Madhu C. Reddy,et al.  Asking questions: information needs in a surgical intensive care unit , 2002, AMIA.

[40]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[41]  J. Humphreys,et al.  The best of intentions. , 2002, Harvard business review.

[42]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[43]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[44]  Charles L. A. Clarke,et al.  Statistical Selection of Exact Answers (MultiText Experiments for TREC 2002) , 2002, TREC.

[45]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[46]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[47]  Andreas Paepcke,et al.  Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[48]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[49]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[50]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[51]  Chuleerat Jaruskulchai,et al.  Generic text summarization using local and global properties of sentences , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[52]  Eleanor Selfridge-Field,et al.  Conceptual and representational issues in melodic comparison , 1998 .

[53]  Gilad Mishne,et al.  Query Formulation for Answer Projection , 2005, ECIR.

[54]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[55]  Jimmy J. Lin,et al.  Extracting Answers from the Web Using Knowledge Annotation and Knowledge Mining Techniques , 2006 .

[56]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[57]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[58]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[59]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[60]  Klaus Frieler,et al.  Measuring melodic similarity: Human vs. algorithmic Judgments , 2004 .

[61]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[62]  Stephen E. Robertson,et al.  On Collection Size and Retrieval Effectiveness , 2004, Information Retrieval.

[63]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[64]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[65]  Anita Burgun-Parenthoine,et al.  Experiments in cross-language medical information retrieval using a mixing translation module , 2004, MedInfo.

[66]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[67]  S. E. Johnsonz,et al.  Improving Retrieval on Imperfect Speech Transcriptions , 1999 .

[68]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[69]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[70]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[71]  David Alex Lamb,et al.  Spelling correction in user interfaces , 1983, CACM.

[72]  Ophir Frieder On scalable information retrieval systems , 2002, CIKM '02.

[73]  Jaap Kamps,et al.  Improving Retrieval Effectiveness by Reranking Documents Based on Controlled Vocabulary , 2004, ECIR.

[74]  Suresh Manandhar,et al.  The Use of Sentence Similarity as a Semantic Relevance Metric for Question Answering , 2003, New Directions in Question Answering.

[75]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[76]  Gregory B. Newby The Science of Large-Scale Information Retrieval , .

[77]  James Allan,et al.  HARD Track Overview in TREC 2003: High Accuracy Retrieval from Documents , 2003, TREC.

[78]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[79]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[80]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[81]  Alan F. Smeaton,et al.  An Architecture for Efficient Document Clustering and Retrieval on a Dynamic Collection of Newspaper Texts , 1998, BCS-IRSG Annual Colloquium on IR Research.

[82]  Alan F. Smeaton,et al.  Replicating Web Structure in Small-Scale Test Collections , 2004, Information Retrieval.

[83]  Lutz Prechelt,et al.  An interface for melody input , 2001, TCHI.

[84]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[85]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .