Blog Post and Comment Extraction Using Information Quantity of Web Format

With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.

[1]  King-Lup Liu,et al.  Automatic Extraction of Publication Time from News Search Results , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[2]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[3]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[4]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Torsten Suel,et al.  Interactive wrapper generation with minimal user effort , 2006, WWW '06.

[6]  Pedro Domingos KDD-2003 : proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC, USA , 2003 .

[7]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[8]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[9]  K. Selçuk Candan,et al.  CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams , 2006, HYPERTEXT '06.

[10]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[11]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[12]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[13]  Xiaofeng Meng,et al.  Automated Extraction of Hit Numbers from Search Result Pages , 2006, WAIM.

[14]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[15]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[16]  Meng Li,et al.  Stream Operators for Querying Data Streams , 2005, WAIM.