The Data Records Extraction from Web Pages

Copyright © 2019 by author(s) and International Journal of Trend in Scientific Research and Development Journal. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/by /4.0) ABSTRACT No other medium has taken a more meaningful place in our life in such a short time than the world-wide largest data network, the World Wide Web. However, when searching for information in the data network, the user is constantly exposed to an ever-growing flood of information. This is both a blessing and a curse at the same time. The explosive growth and popularity of the world-wide web has resulted in a huge number of information sources on the Internet. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and timeconsuming. So the scalable automatic Web Information Extraction (WIE) is also becoming high demand. There are four levels of information extraction from the World Wide Web such as free-text level, record level, page level and site level. In this paper, the target extraction task is record level extraction.

[1]  Zhendong Niu,et al.  Extraction of Informative Blocks from Web Pages , 2008, 2008 International Conference on Advanced Language Processing and Web Information Technology.

[2]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[3]  Yu Chen,et al.  Html Page Analysis based on Visual cues , 2003, Web Document Analysis.

[4]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[6]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[7]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[9]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[10]  Bing Liu,et al.  NET - A System for Extracting Web Data from Flat and Nested Data Records , 2005, WISE.

[11]  Huang Yu Extracting Semi-Structured Information from the WEB , 2000 .

[12]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[13]  HongJiang Zhang,et al.  HTML page analysis based on visual cues , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[14]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[15]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.