Automatic document metadata extraction using support vector machines

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.

[1]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  Kevin Chen-Chuan Chang,et al.  Interoperability for digital libraries worldwide , 1998, CACM.

[6]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[7]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[8]  Catherine C. Marshall,et al.  Making metadata: a study of metadata creation for a mixed physical-digital collection , 1998, DL '98.

[9]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[10]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[11]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[14]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[15]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[16]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[17]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[20]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[21]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[22]  Hwee Tou Ng,et al.  A maximum entropy approach to information extraction from semi-structured and free text , 2002, AAAI/IAAI.

[23]  James Mayfield,et al.  Entity Extraction without Language-Specific Resources , 2002, CoNLL.

[24]  Kurt Maly,et al.  Federating heterogeneous digital libraries by metadata harvesting , 2002 .

[25]  Deborah Knox CITIDEL: making resources available , 2002, ITiCSE '02.

[26]  Proceedings 2003 Joint Conference on Digital Libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[27]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[28]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[29]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[30]  Stuart Weibel,et al.  The Dublin Core: A Simple Content Description Model for Electronic Resources , 2005 .

[31]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .