SVM+BiHMM: A Hybrid Statistic Model for Metadata Extraction

This paper proposes SVM+BiHMM, a hybrid statistic model of metadata extraction based on SVM (support vector machine) and BiHMM (bigram HMM (hidden Markov model)). The BiHMM model modifies the HMM model with both Bigram sequential relation and position information of words, by means of distinguishing the beginning emitting probability from the inner emitting probability. First, the rule based extractor segments documents into line-blocks. Second, the SVM classifier tags the blocks into metadata elements. Finally, the SVM+BiHMM model is built based on the BiHMM model, with the emitting probability adjusted by the Sigmoid function of SVM score, and the transition probability trained by Bigram HMM. The SVM classifier benefits from the structure patterns of document line data while the Bigram HMM considers both words' Bigram sequential relation and position information, so the complementary SVM+BiHMM outperforms HMM, BiHMM, and SVM methods in the experiments on the same task.

[1]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[2]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[3]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[4]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[7]  Thomas Kieninger,et al.  Rule-based document structure understanding with a fuzzy combination of layout and textual features , 2001, International Journal on Document Analysis and Recognition.

[8]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[11]  Boris Chidlovskii,et al.  Wrapping Web Information Providers by Transducer Induction , 2001, ECML.

[12]  Louis B. Rosenfeld,et al.  Information architecture for the world wide web - designing large-scale web sites: introduces tagging and advanced findability concepts (3. ed.) , 2007 .

[13]  Joseph Picone,et al.  Applications of support vector machines to speech recognition , 2004, IEEE Transactions on Signal Processing.

[14]  Les Carr,et al.  Developing services for open eprint archives: globalisation, integration and the impact of links , 2000, DL '00.

[15]  Daniel X. Le,et al.  Automated Labeling Algorithms for Biomedical Document Images , 2003 .

[16]  Shiwei Tang,et al.  PKUSpace: A Collaborative Platform for Scientific Researching , 2004, ICWL.

[17]  William J. Byrne,et al.  Lattice segmentation and support vector machines for large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Ping Yin,et al.  Metadata Extraction from Bibliographies Using Bigram HMM , 2004, ICADL.

[19]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[20]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.