The information displayed as the search result by search engines is important for quickly finding the desired information. In particular, the summary of each Web page in the search results is important for determining the Web page content, as well as for determining how the input search term is used in each Web page, namely, the relation between the search term and the Web page. However, the summaries of the search results in conventional search engines have problems such as extracting only the opening text and not containing the search term, or containing the search term but having the sentence truncated in the middle so that the context of the term or the content of the Web page cannot be determined. Therefore, a summary in sentence units is desirable, but since HTML text includes many nonsentence items that do not contain punctuation, if they are unprocessed, it is difficult for a key sentence extraction system that treats sentences as units to provide a summary. Thus, in this paper, we propose an HTML text segmentation system that divides the source text of each Web page into meaningfully connected groups of text corresponding to sentences. We also verify experimentally that the text generated by this system can be used effectively in a Web page summarization. © 2006 Wiley Periodicals, Inc. Syst Comp Jpn, 37(7): 26–36, 2006; Published online in Wiley InterScience (). DOI 10.1002sscj.20416
Mark Sanderson,et al.
Advantages of query biased summaries in information retrieval
SIGIR '98.
Vibhu O. Mittal,et al.
OCELOT: a system for summarizing Web pages
SIGIR '00.
Andreas Paepcke,et al.
Seeing the whole in parts: text summarization for web browsing on handheld devices
WWW '01.
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
Inf. Process. Manag..
Soumen Chakrabarti,et al.
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction
WWW '01.