Automated document metadata extraction

Web documents are available in various forms, most of which do not carry additional semantics. This paper presents a model for general document metadata extraction. The model, which combines segmentation by keywords and pattern matching techniques, was implemented using PHP, MySQL, JavaScript and HTML. The system was tested with 40 randomly selected PDF documents mainly theses. An evaluation of the system was done using standard criteria measures namely precision, recall, accuracy and F-measure. The results show that the model is relatively effective for the task of metadata extraction, especially for theses and dissertations. A combination of machine learning with these rule-based methods will be explored in the future for better results.

[1]  Bolanle Adefowoke Ojokoh,et al.  Improving on the smoothing technique for obtaining emission probabilities in hidden Markov models , 2008 .

[2]  Elizabeth D. Liddy,et al.  Metaextract: an NLP system to automatically assign metadata , 2004, JCDL.

[3]  Masakazu Suzuki,et al.  Extraction of Logical Structure from Articles in Mathematics , 2004, MKM.

[4]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2006, Inf. Process. Manag..

[5]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[6]  John E. Hopcroft,et al.  Automatic Discovery of Logical Document Structure , 1998 .

[7]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[8]  Elizabeth D. Liddy,et al.  Automatic metadata generation & evaluation , 2002, SIGIR '02.

[9]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[10]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[11]  Song Mao,et al.  A dynamic feature generation system for automated metadata extraction in preservation of digital materials , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[12]  Eric G. Berkowitz,et al.  Creation of a Style Independent Intelligent Autonomous Citation Indexer to Support Academic Research , 2004, MAICS.

[13]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[14]  Thomas M. Breuel,et al.  Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers , 2007 .