论文信息 - A Semantic DOM Approach for Webpage Information Extraction

A Semantic DOM Approach for Webpage Information Extraction

With the development of electronic technology and e-commerce, technology for web pages has attracted a lot of research efforts which becomes one of the hottest topics recently. This paper has proposed a semantic DOM(SDOM) approach for information extraction of e-commerce WebPages. With the combination of content and structure information, the precision and recall can achieve a good result which is shown in our experiments on listpage and tablepage data sets.

Zongwei Luo | Yun Xu | Yulian Fei | Winston Zhang

[1] Brad Adelberg,et al. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[2] David W. Embley,et al. A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[3] H. Ibrahim,et al. A Framework for Extracting Information from Semi-Structured Web Data Sources , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[4] Erik Larson,et al. Relational Recognition for Information Extraction in Free Text Documents , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[5] Fabio Ciravegna,et al. LearningPinocchio: adaptive information extraction for real world applications , 2004, Natural Language Engineering.

[6] Nicolas André,et al. Extraction of information from laser-induced breakdown spectroscopy spectral data by multivariate analysis. , 2008, Applied optics.

[7] Joe Marini,et al. Document Object Model , 2002, Encyclopedia of GIS.

[8] Chun-Nan Hsu,et al. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[9] Nicholas Kushmerick,et al. Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[10] David W. Embley,et al. Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[11] Calton Pu,et al. A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.