Personalized text snippet extraction using statistical language models

In knowledge discovery in a text database, extracting and returning a subset of information highly relevant to a user's query is a critical task. In a broader sense, this is essentially identification of certain personalized patterns that drives such applications as Web search engine construction, customized text summarization and automated question answering. A related problem of text snippet extraction has been previously studied in information retrieval. In these studies, common strategies for extracting and presenting text snippets to meet user needs either process document fragments that have been delimitated a priori or use a sliding window of a fixed size to highlight the results. In this work, we argue that text snippet extraction can be generalized if the user's intention is better utilized. It overcomes the rigidness of existing approaches by dynamically returning more flexible start-end positions of text snippets, which are also semantically more coherent. This is achieved by constructing and using statistical language models which effectively capture the commonalities between a document and the user intention. Experiments indicate that our proposed solutions provide effective personalized information extraction services.

[1]  Dianne P. O'Leary,et al.  Text Summarization via Hidden Markov Models and Pivoted QR Matrix Decomposition , 2001 .

[2]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[3]  Chin-Yew Lin,et al.  From Single to Multi-document Summarization : A Prototype System and its Evaluation , 2002 .

[4]  R. Schwartz,et al.  Automatic Headline Generation for Newspaper Stories , 2002 .

[5]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[6]  S. Robertson The probability ranking principle in IR , 1997 .

[7]  Vibhu O. Mittal,et al.  Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries (poster abstract). , 1998, SIGIR 1999.

[8]  Vibhu O. Mittal,et al.  Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries , 1999, SIGIR '99.

[9]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[10]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[11]  Robert J. Gaizauskas,et al.  Evaluating Passage Retrieval Approaches for Question Answering , 2004, ECIR.

[12]  Dell Zhang,et al.  A Language Modeling Approach to Passage Question Answering , 2003, TREC.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[15]  Gerard Salton,et al.  Automatic Text Decomposition and Structuring , 1994, Inf. Process. Manag..

[16]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001, J. Assoc. Inf. Sci. Technol..

[17]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[18]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[19]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[20]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[21]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[22]  Pascale Fung,et al.  Combining Optimal Clustering and Hidden Markov Models for Extractive Summarization , 2003, ACL 2003.

[23]  James Allan,et al.  Language models for financial news recommendation , 2000, CIKM '00.

[24]  Fernando Llopis,et al.  Passage Selection to Improve Question Answering , 2002, COLING 2002.

[25]  Charles L. A. Clarke,et al.  Question Answering by Passage Selection (MultiText Experiments for TREC-9) , 2000, TREC.

[26]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[27]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[28]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[29]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[30]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.