Information Extraction versus Text Segmentation for Web Content Mining

The information explosion of the Web aggravates the problem of effective information retrieval. Even though various approaches in the literature aim to enhance retrieval, they prove to be insufficient because the actual content of a page is poorly exploited with regard to a specific semantic content. This paper extends an existing method for performing automatic semantic segmentation. The existing method initially partitions a web page into blocks based on its visual layout and the application of a set of heuristics. The subsequent step performs partitioning based on the appearance of specific types of named entities with the help of a machine learning algorithm. Our work extends the initial method in multiple directions. First of all, it examines alternative named entities as features in the learning step. Secondly, it extends the initial corpus. Thirdly, it evaluates and compares the initial method with metrics used in text segmentation. Furthermore, the result of text segmentation is incorporated as feature in the learning process. Finally, two text segmentation algorithms are applied to evaluate the effectiveness of manual annotation. Reported results show that the synergy of semantic-based and text segmentation algorithms strongly depends on the predefined semantic model used for text segmentation.

[1]  Xin Yang,et al.  Semantic HTML Page Segmentation using Type Analysis , 2006, 2006 First International Symposium on Pervasive Computing and Applications.

[2]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[3]  Nan Di,et al.  Representing a web page as sets of named entities of multiple types: a model and some preliminary applications , 2008, WWW.

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  Changjun Wu,et al.  A Web Page Segmentation Algorithm for Extracting Product Information , 2006, 2006 IEEE International Conference on Information Acquisition.

[6]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[7]  Xiaojun Wan,et al.  Towards a unified approach to document similarity search using manifold-ranking of blocks , 2008, Inf. Process. Manag..

[8]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[9]  Xihong Wu,et al.  Text Segmentation with LDA-Based Fisher Kernel , 2008, ACL.

[10]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[11]  Joongmin Choi,et al.  Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction , 2008, J. Univers. Comput. Sci..

[12]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[13]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[14]  Wei-Ying Ma,et al.  Learning important models for web page blocks based on layout and content analysis , 2004, SKDD.

[15]  Athanasios Kehagias,et al.  A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[16]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[17]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Xu Hui,et al.  DeSeA: A Page Segmentation based Algorithm for Information Extraction , 2005, 2005 First International Conference on Semantics, Knowledge and Grid.

[19]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[20]  Junlan Feng,et al.  A learning approach to discovering Web page semantic structures , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[21]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[22]  Chia-Hui Chang,et al.  OLERA: Semisupervised Web-Data Extraction with Visual Support , 2004, IEEE Intell. Syst..

[23]  Luo Junzhou,et al.  A Web Page Segmentation Algorithm Based on Iterated Dividing and Shrinking , 2007, 2007 IFIP International Conference on Network and Parallel Computing Workshops (NPC 2007).

[24]  Yong Yu,et al.  Block-Based Language Modeling Approach Towards Web Search , 2005, APWeb.

[25]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[26]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[27]  Jie Zou,et al.  Combining DOM tree and geometric layout analysis for online medical journal article segmentation , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).