Automated Query-biased and Structure-preserving Text Summarization on Web Documents

Automatic summarization has become an important application recently due to the increased amount of information available on the Web. Summarization techniques can be very useful in improving the effectiveness of Web search. However, the available search engines, such as Google, only display short extracts under the search results, e.g. two lines of text fragments which consist of the query words and their surrounding text. In this paper, we investigate novel summarization techniques to improve the effectiveness of search engines. The proposed system incorporates the structure of the documents, namely the sectional hierarchy, into the output summaries. Different from the previous work, both the structural information and the content to be displayed in the summary are selected in a query-biased way. The system also uses natural language processing techniques for summarization purposes such as identification of phrases as better content carriers than single words.