World Wide Web site summarization

Summaries of Web sites help Web users get an idea of the site contents without having to spend time browsing the sites. Currently, manually constructed summaries of Web sites by volunteer experts are available, such as the DMOZ Open Directory Project. This research is directed towards automating the Web site summarization task. To achieve this objective, an approach which applies machine learning and natural language processing techniques is developed to summarize a Web site automatically. The information content of the automatically generated summaries is compared, via a formal evaluation process involving human subjects, to DMOZ summaries, home page browsing and time-limited site browsing, for a number of academic and commercial Web sites. Statistical evaluation of the scores of the answers to a list of questions about the sites demonstrates that the automatically generated summaries convey the same information to the reader as DMOZ summaries do, and more information than the two browsing options.

[1]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[2]  Gordon W. Paynter,et al.  Interactive document summarisation using automatically extracted keyphrases , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[3]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[4]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[5]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[6]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[7]  Lada A. Adamic,et al.  Evolutionary Dynamics of the World Wide Web , 1999 .

[8]  Andreas Paepcke,et al.  Seeing the whole in parts: text summarization for web browsing on handheld devices , 2001, WWW '01.

[9]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[10]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[11]  Campbell B. Read,et al.  Zipf's Law , 2004 .

[12]  Jade Goldstein-Stewart,et al.  Creating and evaluating multi-document sentence extract summaries , 2000, CIKM '00.

[13]  M. Sanderson Book Reviews: Advances in Automatic Text Summarization , 2000, Computational Linguistics.

[14]  John M. Conroy,et al.  Machine and human performance for single and multidocument summarization , 2003 .

[15]  Chin-Yew Lin,et al.  From Single to Multi-document Summarization : A Prototype System and its Evaluation , 2002 .

[16]  Harris Wu,et al.  Probabilistic question answering on the web , 2002, WWW '02.

[17]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[18]  Bernadette Bouchon-Meunier,et al.  Enhanced web document summarization using hyperlinks , 2003, HYPERTEXT '03.

[19]  Kathleen R. McKeown,et al.  Columbia multi-document summarization : Approach and evaluation , 2001 .

[20]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[21]  Cécile Paris,et al.  Automatically summarising Web sites: is there a way around it? , 2000, CIKM '00.

[22]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[23]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[24]  Sergio Greco,et al.  A Probabilistic Approach for Distillation and Ranking of Web Pages , 2004, World Wide Web.

[25]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[26]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[27]  Gerald Salton,et al.  Automatic text processing , 1988 .

[28]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[29]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[30]  Katerina T. Frantzi,et al.  Automatic recognition of multi-word terms , 1998 .

[31]  Marc Moens,et al.  Articles Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status , 2002, CL.

[32]  Mary Ellen Okurowski,et al.  A Scalable Summarization System Using Robust NLP , 1997 .

[33]  ZhangYongzheng,et al.  World wide web site summarization , 2004 .

[34]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[35]  G. Bowden Wise,et al.  Multi-Document Summarization: Methodologies and Evaluations , 2000 .

[36]  Jihoon Yang,et al.  Extracting sentence segments for text summarization: a machine learning approach , 2000, SIGIR '00.

[37]  Christopher J. Fox,et al.  Lexical Analysis and Stoplists , 1992, Information Retrieval: Data Structures & Algorithms.

[38]  Vibhu O. Mittal,et al.  Query-Relevant Summarization using FAQs , 2000, ACL.

[39]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[40]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[41]  Inderjeet Mani Recent developments in text summarization , 2001, CIKM '01.