A Semantic DOM Approach for Webpage Information Extraction

With the development of electronic technology and e-commerce, technology for web pages has attracted a lot of research efforts which becomes one of the hottest topics recently. This paper has proposed a semantic DOM(SDOM) approach for information extraction of e-commerce WebPages. With the combination of content and structure information, the precision and recall can achieve a good result which is shown in our experiments on listpage and tablepage data sets.

[1]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[2]  David W. Embley,et al.  A Conceptual-Modeling Approach to Extracting Data from the Web , 1998, ER.

[3]  H. Ibrahim,et al.  A Framework for Extracting Information from Semi-Structured Web Data Sources , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[4]  Erik Larson,et al.  Relational Recognition for Information Extraction in Free Text Documents , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[5]  Fabio Ciravegna,et al.  LearningPinocchio: adaptive information extraction for real world applications , 2004, Natural Language Engineering.

[6]  Nicolas André,et al.  Extraction of information from laser-induced breakdown spectroscopy spectral data by multivariate analysis. , 2008, Applied optics.

[7]  Joe Marini,et al.  Document Object Model , 2002, Encyclopedia of GIS.

[8]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[9]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[10]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[11]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.