An information extraction system for heterogeneous Web source

Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer science teachers in Chinese universities. The target of the system is to automatically extract the useful information from heterogeneous sources and re-organize them into structured format. The system includes 4 main modules: web pages retrieval module, web pages' structure classification module, information extraction module and information updating module. We have successfully applied the system to deal with 107 universities in China which shows the effect of the proposed system.

[1]  Giorgio Maria Di Nunzio A Bidimensional View of Documents for Text Categorisation , 2004, ECIR.

[2]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[3]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[4]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[5]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[6]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[7]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[8]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[9]  Andrew McCallum,et al.  Information Extraction , 2005, ACM Queue.

[10]  Dmitry Zelenko,et al.  Kernel methods for relation extraction , 2003 .

[11]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[12]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[13]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[14]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[16]  Andrew McCallum,et al.  Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text , 2006, NAACL.

[17]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[18]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[19]  Ruihua Song,et al.  Joint optimization of wrapper generation and template detection , 2007, KDD '07.

[20]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..