Embarking on a Web Information Extraction project

Web Information Extraction (WIE) is a very popular topic, however we have yet to find a fully operational implementation of WIE, especially in the training courses domain. This paper explores the variety of technologies that can be used for this kind of project and introduces some of the issues that we have experienced. Our aim is to show a different view of WIE, as a reference model for future projects.

[1]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2001, WWW '01.

[2]  David W. Embley,et al.  Towards Semantic Understanding -- An Approach Based on Information Extraction Ontologies , 2004, ADC.

[3]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[4]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[5]  Evangelos E. Milios,et al.  World Wide Web site summarization , 2004, Web Intell. Agent Syst..

[6]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[7]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[8]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[9]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[10]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[11]  Tetsuya Nakatoh,et al.  Generation of Query URL for Search Sites , 2001 .

[12]  David Fisher,et al.  CRYSTAL: Inducing a Conceptual Dictionary , 1995, IJCAI.

[13]  David Parry,et al.  Fuzzy ontologies for information retrieval on the WWW , 2006, Fuzzy Logic and the Semantic Web.

[14]  David W. Embley Toward Tomorrow ’ s Semantic Web — An Approach Based on Information Extraction Ontologies , 2005 .

[15]  Huang Yu Extracting Semi-Structured Information from the WEB , 2000 .

[16]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[17]  Jian-Yun Nie Heterogeneous Web Data Extraction using Ontology , 2001 .

[18]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[19]  Yu Chen,et al.  Html Page Analysis based on Visual cues , 2003, Web Document Analysis.

[20]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[21]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[22]  Arnaud Sahuguet,et al.  Looking at the Web through XML glasses , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[23]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[24]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[25]  Jeonghee Yi,et al.  Sentiment analysis: capturing favorability using natural language processing , 2003, K-CAP '03.

[26]  Calton Pu,et al.  An XJML-based wrapper generator for Web information extraction , 1999, SIGMOD '99.

[27]  Markus Koppenberger,et al.  Natural language processing of lyrics , 2005, ACM Multimedia.