A DOCUMENT ENGINEERING APPROACH TO AUTOMATIC EXTRACTION OF SHALLOW METADATA FROM SCIENTIFIC PUBLICATIONS

Semantic metadata can be considered one of the foundational blocks of the Semantic Web and Desktop. This report describes a solution for automatic metadata extraction from scientific publications, published as PDF documents. The proposed algorithms follow a low-level document engineering approach, by combining mining and analysis of the publications’ text based on its formatting style and font information. We evaluate them and compare their performance to other similar approaches. In addition, we present a sample application that represent the use-case for the metadata extraction algorithms.

[1]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[2]  Siegfried Handschuh,et al.  SALT: Weaving the Claim Web , 2007, ISWC/ASWC.

[3]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[6]  Siegfried Handschuh,et al.  SALT - Semantically Annotated LaTeX for scientific publications , 2007 .

[7]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2006, Inf. Process. Manag..

[8]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[9]  Elizabeth D. Liddy,et al.  Metaextract: an NLP system to automatically assign metadata , 2004, JCDL.

[10]  Xiangmin Zhang,et al.  Rule-based word clustering for document metadata extraction , 2005, SAC '05.

[11]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[12]  Siegfried Handschuh,et al.  Recipes for Semantic Web Dog Food - The ESWC and ISWC Metadata Projects , 2007, ISWC/ASWC.

[13]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[14]  Stephan Bloehdorn,et al.  The SWRC Ontology - Semantic Web for Research Communities , 2005, EPIA.