JobOlize - Headhunting by Information Extraction in the Era of Web 2.0

E-recruitment is one of the most successful e-business applications supporting both, headhunters and job seekers. The explosive growth of online job offers makes the usage of information extraction techniques to build up, e.g., job portals in a semi-automatic way a necessity. Existing approaches, however, hardly cope with the heterogeneous and semi-structured nature of job offers nor do they consider potentials offered by Web 2.0 technologies. This paper proposes an information extraction system called “JobOlize” 1 , realized for arbitrarily structured IT job offers. To improve extraction quality, a hybrid approach is employed, combining existing NLP-techniques with a new form of context-driven extraction, incorporating layout, structure and content information. To allow users a proper adaptation of the extraction results while preserving the look and feel of the original Web pages, a rich client interface is provided. The improvements in extraction quality are justified on basis of a case study and the experiences gained are generalized and critically reflected by discussing lessons learned.

[1]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[2]  Gerti Kappel,et al.  Lifting metamodels to ontologies: a step to the semantic integration of modeling languages , 2006, MoDELS'06.

[3]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[4]  Edgar R. Weippl,et al.  On cooperatively creating dynamic ontologies , 2005, HYPERTEXT '05.

[5]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[6]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[7]  Elena Simperl,et al.  Practical Guidelines for Building Semantic eRecruitment Applications , 2006 .

[8]  Claudio Giuliano,et al.  A Critical Survey of the Methodology for IE Evaluation , 2004, LREC.

[9]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[10]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[11]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[13]  Werner Retschitzegger,et al.  A software architecture for ontology-driven situation awareness , 2008, SAC '08.

[14]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[16]  Ming-Syan Chen,et al.  WISDOM: Web intrapage informative structure mining based on document object model , 2005, IEEE Transactions on Knowledge and Data Engineering.

[17]  Jesualdo Tomás Fernández-Breis,et al.  An ontology-based intelligent system for recruitment , 2006, Expert Syst. Appl..